In the perl
below I am trying to set/update the value of $14
(last field) in file2
, using the matching NM_
in $12
or $9
in file2
with the NM_
in $2
of file1
.
The lengths of $9
and $12
can be variable but what is consistent is the start pattern will always be NM_
and the end pattern is always
;
(semi-colon) or a break (if it is the last).
What is extracted into $14
(last field) is all the text from the start to end (string between the NM_ up to the ; or break. The value in $7
determines the field to use, that is if
$7
is exonic then $12
is used to extract from. If $7
is not exonic then $9
is used to extract from.
There will always be a value in $7
and exonic is there the majority of the time, but not always.
The below seems to be happening in this code:
The NM_
value of $2
in file1
, after splitting on the .
, will match a substring NM_
in $12
(the majority of the time),
or $9
(in some cases). The substring that matches is extracted starting from the NM_
until the ; or break (if it is the last value, like in line 2 in the example).
The text in $7
of file2
determines the field to use/ extract from.... that is if $7=exonic
, then use $12
, but if
$7
is not = exonic, then use $9
. The extracted value is used to update $14
(last field) from a . to the extracted value.
My question is why does the Sanger column header in $14
(last field) get removed ---- does the header row need to be skipped ----
why does the rs3841266 after the . in line get removed
since the last feield is line 1 is empty . (dot) should result
I can not seem to do add these 3 things to the script to get the desired output. Thank you :).
file1 space delimeted
ATP13A2 NM_022089.3
PPT1 NM_000310.3
ISG15 NM_005101.3
file2 tab-delimeted
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 .
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 .
current file2 after perl script executed tab-delimeted
--- the rs3841266 after the . in line is removed, Sanger is removed from the last field as the column header,
and since the last feield is line 1 is empty . should result ---
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147
1 chr1 948846 948846 - A upstream ISG15 . . . .
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G
desired output of file2 after script executed tab-delimeted
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266 .
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G
perl
perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt