Grep solutions tab-delimited file

andmal · April 26, 2018, 5:23pm

Hello, I am trying to find a solution to problem that's proving to be beyond my newbie skills. The below files comes from a genetics study. File 1 describes a position on the genome and file 2 does the same but is formatted differently and has more information. I am trying to match all lines in file 1 with the right line in file 2, and then print the full line from file 2. The first column in file 1 corresponds to the first and third column in file 2. File 1 has a colon as separator whereas file 2 has tab.

I tried to come up with a grep solution that would not require reformatting of files. I tried to come up with some easy manipulation of File 1 or file 2 to make the matching easier. Both failed. Any suggestions would be more than welcome.

Edit: Sorry if I made the first example look confusing. There are matches. Please consider the below example instead:

File 1

chr1:17373
chr1:17375
chr1:17398
chr1:17407

~100,000 rows

File 2

chr1    17372    17373    rs750111615
chr1    17374    17375    rs755771866
chr1    17378    17379    rs754322362
chr1    17384    17385    rs201535981
chr1    17395    17398    rs200784459
chr1    17405    17406    rs772228657
chr1    17405    17407    rs372841554

~15 M rows

The result I need is

chr1:17373 rs750111615
chr1:17375    rs755771866
chr1:17398    rs200784459
chr1:17407    rs372841554

Again, help would be much appreciated.

vgersh99 · April 26, 2018, 5:53pm

hmmm..... this is a bit confusing:

The first column in file 1 corresponds to the first and third column in file 2

.
I don't see the correlation in your sample files...
I don't follow - can you elaborate?
And maybe provide a desired result based on the samples provided.

RudiC · April 27, 2018, 4:17pm

Is there any match in your samples?

andmal · April 28, 2018, 3:49pm

Sorry if I made it look confusing. There are matches. PLease consider the below example instead:

File 1

chr1:17373
chr1:17375
chr1:17398
chr1:17407

~100,000 rows

File 2

chr1    17372    17373    rs750111615
chr1    17374    17375    rs755771866
chr1    17378    17379    rs754322362
chr1    17384    17385    rs201535981
chr1    17395    17398    rs200784459
chr1    17405    17406    rs772228657
chr1    17405    17407    rs372841554

~15 M rows

The result I need is

chr1:17373 rs750111615
chr1:17375    rs755771866
chr1:17398    rs200784459
chr1:17407    rs372841554

Again, help would be much appreciated.

RudiC · April 28, 2018, 6:55pm

Still not unambiguous. Try

awk '
NR == FNR       {T[$2] = $0
                 next
                }
($2 in T)        {print T[$2], $NF
                }
($3 in T)        {print T[$3], $NF
                }
' FS=: file1 FS="\t" file2
chr1:17373 rs750111615
chr1:17375 rs755771866
chr1:17398 rs200784459
chr1:17407 rs372841554

If you want the range $2 - $3 checked, it would become more complex and time consuming for that large files.

andmal · April 29, 2018, 3:26pm

Thanks -this produces exactly what I was looking for. Next, I'll try and understand the code and customize to other similar data -you've given me a great starting point.
Example output below:

1:40370176:G_GT    g    gt    -0.0103    0.0222    0.6434    ???????+??+- rs564192510
19:4197562:A_AGAAT    a    agaat    0.1321    0.0121    1.019e-27    ?++++??+++-+ rs79305507
19:4210401:C_CA    ca    c    0.2934    0.0137    3.00e-101    ?++++??+++++ rs6683453
19:4259184:T_TTC    t    ttc    0.5397    0.0186    3.27e-185    ++++????++?? rs12092368
19:4268341:A_G    a    g    -0.6392    0.0106    1.95e-798    ----?------- rs12090408
4:38765867:A_G    a    g    -0.0641    0.0116    3.516e-08    ------------ rs1778050