Awk- Indexing a list of numbers in file2 to print certain rows in file1

Geneanalyst · October 30, 2018, 12:37pm

Hi

Does anyone know of an efficient way to index a column of data in file2 to print the coresponding row in file1 which corresponds to the data in file2 AND 30 rows preceding and after the row in file1.

For example suppose you have a list of numbers in file2 (single column) as follows:

rs25678
rs25679
rs25680
rs25681
rs25682
rs25683

file1:

2    9658    rs25681    G    G    GT1=0.20;GT2=0.65;GT3=0.75
2    4258    rs25679    A    G    GT1=0.20;GT2=0.65;GT3=0.76
2    4258    rs25680    T    T    GT1=0.20;GT2=0.65;GT3=0.77

I would like all rows in file1 corresponding to the file2 numbers indexed printed AND 30 rows before and after also printed.

Desired output:

.    .    .    .    .    .
2    9658    rs25681    G    G    GT1=0.20;GT2=0.65;GT3=0.75
.    .    .    .    .    .
.    .    .    .    .    .
2    4258    rs25679    A    G    GT1=0.20;GT2=0.65;GT3=0.76
.    .    .    .    .    .
.    .    .    .    .    .
2    4258    rs25680    T    T    GT1=0.20;GT2=0.65;GT3=0.77
.    .    .    .    .    .
.    .    .    .    .    .

Dots signify 30 rows of data preceding and after the targeted rows being printed along with targeted rows indexed in file2

Thanks...

vgersh99 · October 30, 2018, 1:22pm

something along these lines:
default 2 lines before/after
awk -f gene.awk file2.txt file1.txt
or 30 lines before/after
awk -v ba=30 -f gene.awk file2.txt file1.txt
where gene.awk is:

BEGIN {
  if(!ba) ba=2
}
FNR == NR {
   f2[$1];
   next
}
{
  f1all[FNR]=$0
  if ($3 in f2) {
    f1pat[$3]=FNR
    f1order[++order]=$3
  }
}
END {
  for (i=1;i<=order;i++)
    for(j=f1pat[f1order]-ba;j<=f1pat[f1order]+ba;j++)
      print f1all[j]
}

Or depending on your OS/version of grep you could do (for 2 lines before/after):
grep -A 2 -B 2 -F -f file2.txt file1.txt

Geneanalyst · October 30, 2018, 9:59pm

vgersh99:

something along these lines:
default 2 lines before/after
awk -f gene.awk file2.txt file1.txt
or 30 lines before/after
awk -v ba=30 -f gene.awk file2.txt file1.txt
where gene.awk is:
BEGIN {
  if(!ba) ba=2
}
FNR == NR {
   f2[$1];
   next
}
{
  f1all[FNR]=$0
  if ($3 in f2) {
   f1pat[$3]=FNR
   f1order[++order]=$3
  }
}
END {
  for (i=1;i<=order;i++)
   for(j=f1pat[f1order]-ba;j<=f1pat[f1order]+ba;j++)
   print f1all[j]
}
Or depending on your OS/version of grep you could do (for 2 lines before/after):
grep -A 2 -B 2 -F -f file2.txt file1.txt

Works great! Initially it was outputting 80 million rows, but that was my bad because a ".' had made its way into the column of data in file2

Chubler_XL · October 30, 2018, 11:17pm

If vgersh99's solution if matching lines are less than 30 lines apart some lines are printed multiple times (overlapping regions).

Try this modification:

BEGIN {
  if(!ba) ba=2
}
FNR == NR {
   f2[$1];
   next
}
{
  f1all[FNR]=$0
  if ($3 in f2)
     for(i=FNR-ba;i<=FNR+ba;i++) prn
}
END {
  for(i=1;i<=FNR;i++)
     if(i in prn) print f1all
}

Geneanalyst · October 31, 2018, 12:13am

chubler_xl:

If vgersh99's solution if matching lines are less than 30 lines apart some lines are printed multiple times (overlapping regions).

Try this modification:
BEGIN {
  if(!ba) ba=2
}
FNR == NR {
   f2[$1];
   next
}
{
  f1all[FNR]=$0
  if ($3 in f2)
   for(i=FNR-ba;i<=FNR+ba;i++) prn
}
END {
  for(i=1;i<=FNR;i++)
   if(i in prn) print f1all
 }

Works great! Initially it was outputting 40 million rows, but that was my bad because a "." had made its way into the column of data in file2 and file1 had many rows for which $3 was a "."

Chubler_XL · October 31, 2018, 2:07am

Are you sure you have the code as posted? The solution I presented shouldn't be able to print more lines than is in file2. I suspect you don't have and END { block and the for loop is executing for every line of file2.

Geneanalyst · October 31, 2018, 6:20am

Nothing wrong with the END block. I edited my post above to outline the problem. Sorry for the trouble.