Getting unique sequences from multiple fasta file

Hi,

I have a fasta file with multiple sequences. How can i get only unique sequences from the file.

For example
my_file.fasta

>seq1
TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC

>seq2
CAATTTTGGCTCCTTCATGTCTGTTGTGCCAGACTTGAGTGAGTTTGAACTGCAAGCAAGAAAGGCAGGTTCAGACCAAGAAAAAGATGCAATATACTCCAAGGCACTGATAGCAGCCACAAGAAAGGCGGCTCCTATTG

>seq3
CGGCCTGTGCATGGACATCAAGCAACGACATGGTGACAAAAGGGCTCAAGTGGTTCGAGGATCAGATAACAAAAGAGAATCCTAAATTTATCTCTTGGCACAAGGAGTATGAATTTTTCAAAAAGAATGTGCCCACAGTT

>seq1
TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC

>seq4
AAACTGCAACTCTCACAAGCAAAGTTGTGGCACAGTTCTCAGTTCCTGGGGTCTATGTTGTTGCTGTGCAAGATATGATCAAAGACATGGTTGCCAGAAGAGGTGGAGGGCCTAAACGCGGAGTCAGTGATGAACACATC

>seq1
TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC

>seq2
CAATTTTGGCTCCTTCATGTCTGTTGTGCCAGACTTGAGTGAGTTTGAACTGCAAGCAAGAAAGGCAGGTTCAGACCAAGAAAAAGATGCAATATACTCCAAGGCACTGATAGCAGCCACAAGAAAGGCGGCTCCTATTG

Note there are 3 copies of seq1 and two copies of seq2. I want to get a new file that contain only one copy of seq1, seq2, seq3 and seq4.

Thanks

awk '{ a[$0]} END{for (i in a) print i ORS}' RS= my_file.fasta

or better yet:

awk '!a[$0]++' RS= ORS='\n\n' my_file.fasta
2 Likes

Thanks vgersh99. Code worked perfectly

I'd suggest following this user for the similar scientific threads.
Like this one.