Match read ID file 1 from file 2

Hello everyone,

I want to get the information from which read from Column 2 (File1) (eg: Read ID: ERR315389.743357) and retrieve the information from column 2,3 and 4 from (File2). Basically file1 (~42k lines) and file 2 (~700k lines). The desired output will be:

Count Read ID Sequence Exon Transcript ID
100 ERR315389.6445937        CTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGA 4 ENST00000267996

To add the information, I collapse the redundant read ID from file 2 uniq (UNIQ) command and print the count of redundant read ID in the file 1.

96 ERR315389.743357         GAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGG

#96 mean the read ID has 96 times in file 2.

File 1

Count Read ID Sequence
     96 ERR315389.743357         GAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGG
     96 ERR315389.5907790        TGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGC
     96 ERR315389.4298798        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.422020         ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.2233748        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.2069419        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     92 ERR315389.6677500        AAGAGGCCAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGG
     92 ERR315389.4058303        GAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACG
     88 ERR315389.4648318        CATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGCTCTCAG

File 2

Read ID Transcript ID Exon Sequence
ERR315389.3990366        ENST00000267996        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000288398        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000317516        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000334895        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000357980        5        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000358278        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000403994        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000404484        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000558264        2        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000558314        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA

Thank you for your respond.

Not sure I understand your request, and having two sample files that don't match doesn't help either.

Anyhow, try

awk     'FNR==NR        {C[$2]=$1;next}
         FNR==1         {print "Count Read ID Sequence Exon Transcript ID"; next}
         $1 in C        {print C[$1], $1, $4, $3, $2}
        ' file1 file2
1 Like

Hi RudiC, its working now. Thank you so much!

Hi RudiC, can I if the fasta file retrieve the read ID from file 2 and replace it in file 1?

>trn_13 5570
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>trn_1  12840
GTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGA
>trn_5  13064
AAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATT
>trn_10 6600
CTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGA
>trn_7  6890
CTTGGATCGAGCTGAGCAGGCGGAGGCCGACAAGAAGGCGGCGGAAGACAGGAGCAAGCAGCTGGAAGATGAGCTGGTGTCACTGCAAAAGAAACTCAAGG
>trn_39 6762
GAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCAT
>trn_6  7416
AAGAGATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTA
>trn_87 2210
AAGAAACTCAAGGGCACCGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGC
>trn_2  8632
>ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>ERR315352.11084391_5075 5075
CTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAG
>ERR315352.13981086_4994 4994
GGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATA
>ERR315352.23465660_4888 4888
CCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAG
>ERR315352.10301250_4862 4862
GCGGGCTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACT
>ERR315389.1015631_4669 4669
CTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGACGAGCTGTACGCTCAGAAACTGAAGTACAAA
>ERR315389.1003749_4576 4576
CCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTG

The output desire is:

ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>ERR315352.11084391_5075 5075
CTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAG
>ERR315352.13981086_4994 4994

which the read name of >trn_13 5570 being changed to >ERR315352.12390252_5250 5250

thanks!

How to match the entries? With the CGA... sequence?

Yes, by matching with the sequence, CGAAGATGAACTGGACA...

How about trying sth on your own?

Nevertheless, try

 awk 'FNR==NR {if (/^>/) P=$0; else T[$0]=P; next} $0 in T {print T[$0]; print}' file2 file1
>ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT

That seems to be the only fit in your sample data.

1 Like

Thanks RudiC.. I will try myself first before asking question.. thanks.. its work..