I am trying to use awk
to match the NM_
in file
with $1
of id
which is tab-delimited
. The NM_
will always be in the line of file
that starts with >
and be after the second _
. When there is a match between each NM_
and id
, then the value of $2
in id
is substituted or used to update the NM_
. Each NM_
may not be unique, as in the example below, but will have a match in id
.
After the third _
there is a digit 0,1,2,etc
that I am trying to add the word exon
and add +1
to the digit. Not sure if my awk
attempt helps at all to address the first question. Thank you :).
file
>hg19_refGene_NM_001195684_0 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_NM_001195684_1 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_NM_001195684_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_3 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta
id
NM_001195684 TGFBR3
NM_001206389 FGF8
NM_001197220 PDE4D
NM_001195683 TGFBR3
desired output value in bold updated with $2
in id
because NM_
matched in $1
of id
,
value in italics added one to the 0
and the word exon
>hg19_refGene_TGFBR3_exon1 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_TGFBR3_exon2 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon4 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta
awk
awk 'NR==FNR{a[$1];next} {k=$2; sub(/_.*/,"",k)} k in a' file id