Perl to update field in file based of match to another file

cmccabe · July 7, 2017, 2:07pm

In the perl below I am trying to set/update the value of $14 (last field) in file2 , using the matching NM_ in $12
or $9 in file2 with the NM_ in $2 of file1 .
The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always
; (semi-colon) or a break (if it is the last).
What is extracted into $14 (last field) is all the text from the start to end (string between the NM_ up to the ; or break. The value in $7 determines the field to use, that is if
$7 is exonic then $12 is used to extract from. If $7 is not exonic then $9 is used to extract from.
There will always be a value in $7 and exonic is there the majority of the time, but not always.
The below seems to be happening in this code:
The NM_ value of $2 in file1 , after splitting on the . , will match a substring NM_ in $12 (the majority of the time),
or $9 (in some cases). The substring that matches is extracted starting from the NM_ until the ; or break (if it is the last value, like in line 2 in the example).
The text in $7 of file2 determines the field to use/ extract from.... that is if $7=exonic , then use $12 , but if
$7 is not = exonic, then use $9 . The extracted value is used to update $14 (last field) from a . to the extracted value.
My question is why does the Sanger column header in $14 (last field) get removed ---- does the header row need to be skipped ----
why does the rs3841266 after the . in line get removed
since the last feield is line 1 is empty . (dot) should result
I can not seem to do add these 3 things to the script to get the desired output. Thank you :).
file1 space delimeted

ATP13A2 NM_022089.3
PPT1 NM_000310.3
ISG15 NM_005101.3

file2 tab-delimeted

R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 .
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 .

current file2 after perl script executed tab-delimeted --- the rs3841266 after the . in line is removed, Sanger is removed from the last field as the column header,
and since the last feield is line 1 is empty . should result ---

R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 
1 chr1 948846 948846 - A upstream ISG15 . . . . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

desired output of file2 after script executed tab-delimeted

R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266 . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

perl

perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

durden_tyler · July 7, 2017, 3:54pm

Because of the "pop @F" in your code. See the text in red below.

perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

Here's the documentation of the "pop" function: pop - perldoc.perl.org

Skipping the header will retain the "Sanger" column header.
And the "pop" will then remove the last column from the remaining rows.

For the same reason the "Sanger" column header gets removed - the "pop" function.

I did not understand this statement.
The last field in line 1 of "file2.txt" is "Sanger". It is not empty.

cmccabe · July 7, 2017, 4:22pm

I apologize line 1 after the header.... if the last field is blank then a . (dot) results.
R_Index 1 will always be the the first line with data in it and has an index, as the header row does not get an index. Thank you very much that helps and questions 1 and 2

durden_tyler · July 7, 2017, 10:41pm

For line # 2 of "file2.txt", this:

$F[$F[6]=~/exonic/?11:8]

returns $F[8] which is "."

However, this:

grep {$x=~/$_/} keys %m

does not return anything because none of the keys of hash %m (shown below)

'NM_005101'
'NM_000310'
'NM_022089'

exist in the string "."

Therefore the variable $r is an empty string.
And hence, this:

print join("\t",@F,$r);

does not append anything to the array @F for line # 2 of "file2.txt".

cmccabe · July 13, 2017, 7:44pm

Thank you very much:).