In the file1
below if $9
and $12
are .
(dot) then the value in $8
of file1
is used as a key (exact match) to lookup in each $2
of file2
, when a match is found then the value of $4
in file1
is used to look for a range match within +/- 50
using the values in $4
and after in file2
. The number of fields can be variable but will always start at $4
.
For example, ISG15
has 2 fields in it with coordinates starting at $4
and ending at $5
. CR2
has 19 coordinates in it starting at $4
ending at $22
. The value in $1
of file2
tells you how many coordinates there are but the start or first will always be in $4
.
There will only be one range match but if the number is closer to the first value before the -
(hyphen) in it the $9
of file1
is updated from a .
to the numerical difference between the two numbers with a -
(minus) in front. If the number is closer to the second value after the -
(hyphen) in it the $9
of
file1
is updated from a .
(dot) to the numerical difference between the two numbers with a +
(plus) in front. However is the calculated difference is greater than 50
, then >50
is printed in $9
of file1
.
If $9
or $12
of file1
have a value other then .
(dot) in them then that line is skipped (nothing happens) and the next line is processed. In file1
lines 2 and 3 are skipped. The awk
below will identify these lines and print them, but I am not sure how to do the rest and need some expert help. Thank you :).
file1 tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 . . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
3 chr1 949608 949608 G A exonic ISG15 . . nonsynonymous SNV ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4 chr1 949925 949925 C T downstream ISG15 . . . .
5 chr1 207646923 207646923 G A intronic CR2 . . . .
6 chr2 3653844 3653844 T C intronic COLEC11 . . . .
7 chr1 154562623 154562625 CCG - intronic ADAR . . . .
8 chr1 948840 948840 - C upstream ISG15 . . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
19 CR2 NM_001006658.2 207627644-207627821 207639870-207640257 207641871-207642060 207642144-207642244 207642494-207642577 207643039-207643447 207644084-207644261 207644341-207644432 207644767-207644844 207646116-207646524 207647145-207647230 207647585-207647668 207648168-207648561 207649578-207649764 207651229-207651415 207652601-207652625 207653322-207653398 207658808-207658917 207662486-207663240
6 COLEC11 NM_024027.4 3642421-3642758 3651904-3652060 3660900-3660972 3687867-3687921 3691033-3691129 3691316-3692234
15 ADAR NM_001111.4 154554533-154557519 154557692-154557820 154558228-154558341 154558656-154558839 154560600-154560734 154561026-154561149 154561844-154561938 154562232-154562404 154562737-154562885 154569280-154569414 154569598-154569743 154570303-154570452 154570877-154571061 154573516-154575102 154580467-154580724
desired updated file1
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . . .
3 chr1 949608 949608 G A exonic ISG15 . . nonsynonymous SNV ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
6 chr2 3653844 3653844 T C intronic COLEC11 >50 . . .
7 chr1 154562623 154562625 CCG - intronic ADAR >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
Description of updated file1
line1:file1 $9 updated to 0 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948846 is a exact match to the first cordinate in $4 before the -
line2:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line3:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line4:file1 $9 updated to +6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 949925 is a range match to the second coordinate in $5 after the -
line5:file1 $9 updated to >50 because CR2 is matched to line 2, $2 of file2 and the value in $4 of file1, 207646923 is a range match to the first coordinate in $14 before the - but the difference of 222 is > 50
line6:file1 $9 updated to >50 because COLEC11 is matched to line 3, $2 of file2 and the value in $4 of file1, 3653844 is a range match to the second coordinate in $2 after the - but the difference of 1784 is > 50
line7:file1 $9 updated to >50 because ADAR is matched to line 4, $2 of file2 and the value in $4 of file1, 154562625 is a range match to the second coordinate in $12 after the - but the difference of112 is > 50
line8: file1 $9 updated to -6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948840 is a range match to the first coordinate in $4 before the -
awk
awk -F'\t' -v OFS='\t' '{if ($9=="." && $12==".") print }' file1