hi, i have a fasta file like this:
>contig00003 length=363 numreads=45 gene=isogroup00001 status=it_thresh
GATTTTTTACCCTGGGAGTGAGGAGGACGAGGTTGAGGATGAAGAAAAGAGAAAGATGAAGAGGTTGAGGATGTT
GTAGTCGGCGGTGGAATTAGGGGGAGCCGGCGAGCCCAAGTATTTTGCAGAGGTGTCTTCATCATCCAAACAACA
CGAGAGGGTGCAATTTGGTCTCTGCGTTGTTATAGATCCAAAGTTTTTGGACCCTGTTTGGCATCGTGTATCAAGTA
TTGGTTACACAGTCTATATTTTCAAGAACGAGACTGTGAAAGCTGTAAGCAACTTTTTATTtATCTATTTATTTTTATG
CTATAGCTTAtattaaactta
>contig00010 length=760 numreads=49 gene=isogroup00001 status=it_thresh
TCAAAGTTTTAGGTTCCAATTTGTATGGCTCAACTTAAGAAGTTTGTTGTAAAAAaGGAAATTCTTTCTGATCTATTA
GGGGCAGAAGTGCCACAATATATGAAGTTGAGAAATTAAaTAAAGTAATCATAGTACATTGTCTCGTTTGGATAGAC
GTAGGCTCTCAaGAAAAAAaGTTCTCATAGTTCTTGATGATGTGGATGATTTAGTGCGGCAAGTAGAACCTGGTCAA
GGGAGTAGAATAATTATGACAAGCAGAGATAGACAATTGAGTGAAGCTCTCTGCCTGTTTTGCAAGCATGCCTTCAA
GCGACAATTTCTAAGAACAGGATATTTAATAAGGCAATTGATCATGCTCAGGG
>G383C4U02H6B5W length=257
CCGGGCTCCCCATCTTCTCTATCTCTTGTGTGATTGTTGCAGAATACATCAAAGACTTGGGGTTGAGAGAGACAGCA
TCATAAACCTGATCACGGAAGCCCCTTTGAAGCATGCAGTCCACCTCATCTAGCCTTGTTGTCGTTGAAATAGTCCAT
CTGCCATCTTTAAATACATGCGCAACATAATGCCCGCATTGCGTATCTAGTCCAATATCATGCTTTATTAAAAAGATCA
ATAAGCCTTCCTGGAGTCCCCACAATCAAGTTCCAACTCCTTGCTGAAATGCGGTAGAGTTGTCCAGCCATCA
>G383C4U02IH1AO length=105
TTCAAGGAACTTTCATCCATCCAATGATCTAACCAATTTGAACCTAGTTTTGATTCATCTCTGAAGTTCGAATTTGAAC
CACATTCTTAAGAATTGAGGGCCCATCAAATTTAGTACTATAATCATGAAGTAGGTGATCCTCTCTTGTCACTCTTTTC
ATCATCAGCAAGATGACTTCTCATTGGAATGCTACCATGCTTGTTCCAAAA
.....
How can i remove the additional information for each sequence and get a file like this:
>contig00003
GATTTTTTACCCTGGGAGTGAGGAGGACGAGGTTGAGGATGAAGAAAAGAGAAAGATGAAGAGGTTGAGGATGTT
GTAGTCGGCGGTGGAATTAGGGGGAGCCGGCGAGCCCAAGTATTTTGCAGAGGTGTCTTCATCATCCAAACAACA
CGAGAGGGTGCAATTTGGTCTCTGCGTTGTTATAGATCCAAAGTTTTTGGACCCTGTTTGGCATCGTGTATCAAGTA
TTGGTTACACAGTCTATATTTTCAAGAACGAGACTGTGAAAGCTGTAAGCAACTTTTTATTtATCTATTTATTTTTATG
CTATAGCTTAtattaaactta
>contig00010
TCAAAGTTTTAGGTTCCAATTTGTATGGCTCAACTTAAGAAGTTTGTTGTAAAAAaGGAAATTCTTTCTGATCTATTA
GGGGCAGAAGTGCCACAATATATGAAGTTGAGAAATTAAaTAAAGTAATCATAGTACATTGTCTCGTTTGGATAGAC
GTAGGCTCTCAaGAAAAAAaGTTCTCATAGTTCTTGATGATGTGGATGATTTAGTGCGGCAAGTAGAACCTGGTCAA
GGGAGTAGAATAATTATGACAAGCAGAGATAGACAATTGAGTGAAGCTCTCTGCCTGTTTTGCAAGCATGCCTTCAA
GCGACAATTTCTAAGAACAGGATATTTAATAAGGCAATTGATCATGCTCAGGG
>G383C4U02H6B5W
CCGGGCTCCCCATCTTCTCTATCTCTTGTGTGATTGTTGCAGAATACATCAAAGACTTGGGGTTGAGAGAGACAGCA
TCATAAACCTGATCACGGAAGCCCCTTTGAAGCATGCAGTCCACCTCATCTAGCCTTGTTGTCGTTGAAATAGTCCAT
CTGCCATCTTTAAATACATGCGCAACATAATGCCCGCATTGCGTATCTAGTCCAATATCATGCTTTATTAAAAAGATCA
ATAAGCCTTCCTGGAGTCCCCACAATCAAGTTCCAACTCCTTGCTGAAATGCGGTAGAGTTGTCCAGCCATCA
>G383C4U02IH1AO
TTCAAGGAACTTTCATCCATCCAATGATCTAACCAATTTGAACCTAGTTTTGATTCATCTCTGAAGTTCGAATTTGAAC
CACATTCTTAAGAATTGAGGGCCCATCAAATTTAGTACTATAATCATGAAGTAGGTGATCCTCTCTTGTCACTCTTTTC
ATCATCAGCAAGATGACTTCTCATTGGAATGCTACCATGCTTGTTCCAAAA
.....
Thanks
awk '/^>/ { NF=1 } 1' inputfile > outputfile
1 Like
Thanks, that works perfectly.
1 Like
Hello the_simpsons,
The following may also help.
awk '/^>/ {$0=$1} 1' filename
Output will be as follows.
>contig00003
GATTTTTTACCCTGGGAGTGAGGAGGACGAGGTTGAGGATGAAGAAAAGAGAAAGATGAAGAGGTTGAGGATGTT
GTAGTCGGCGGTGGAATTAGGGGGAGCCGGCGAGCCCAAGTATTTTGCAGAGGTGTCTTCATCATCCAAACAACA
CGAGAGGGTGCAATTTGGTCTCTGCGTTGTTATAGATCCAAAGTTTTTGGACCCTGTTTGGCATCGTGTATCAAGTA
TTGGTTACACAGTCTATATTTTCAAGAACGAGACTGTGAAAGCTGTAAGCAACTTTTTATTtATCTATTTATTTTTATG
CTATAGCTTAtattaaactta
>contig00010
TCAAAGTTTTAGGTTCCAATTTGTATGGCTCAACTTAAGAAGTTTGTTGTAAAAAaGGAAATTCTTTCTGATCTATTA
GGGGCAGAAGTGCCACAATATATGAAGTTGAGAAATTAAaTAAAGTAATCATAGTACATTGTCTCGTTTGGATAGAC
GTAGGCTCTCAaGAAAAAAaGTTCTCATAGTTCTTGATGATGTGGATGATTTAGTGCGGCAAGTAGAACCTGGTCAA
GGGAGTAGAATAATTATGACAAGCAGAGATAGACAATTGAGTGAAGCTCTCTGCCTGTTTTGCAAGCATGCCTTCAA
GCGACAATTTCTAAGAACAGGATATTTAATAAGGCAATTGATCATGCTCAGGG
>G383C4U02H6B5W
CCGGGCTCCCCATCTTCTCTATCTCTTGTGTGATTGTTGCAGAATACATCAAAGACTTGGGGTTGAGAGAGACAGCA
TCATAAACCTGATCACGGAAGCCCCTTTGAAGCATGCAGTCCACCTCATCTAGCCTTGTTGTCGTTGAAATAGTCCAT
CTGCCATCTTTAAATACATGCGCAACATAATGCCCGCATTGCGTATCTAGTCCAATATCATGCTTTATTAAAAAGATCA
ATAAGCCTTCCTGGAGTCCCCACAATCAAGTTCCAACTCCTTGCTGAAATGCGGTAGAGTTGTCCAGCCATCA
>G383C4U02IH1AO
TTCAAGGAACTTTCATCCATCCAATGATCTAACCAATTTGAACCTAGTTTTGATTCATCTCTGAAGTTCGAATTTGAAC
CACATTCTTAAGAATTGAGGGCCCATCAAATTTAGTACTATAATCATGAAGTAGGTGATCCTCTCTTGTCACTCTTTTC
ATCATCAGCAAGATGACTTCTCATTGGAATGCTACCATGCTTGTTCCAAAA
EDIT: Adding one more solution for same.
[singh@localhost awk_programming]$ awk '/^>/ {print $1} !/^>/ {print $0}' filename
>contig00003
GATTTTTTACCCTGGGAGTGAGGAGGACGAGGTTGAGGATGAAGAAAAGAGAAAGATGAAGAGGTTGAGGATGTT
GTAGTCGGCGGTGGAATTAGGGGGAGCCGGCGAGCCCAAGTATTTTGCAGAGGTGTCTTCATCATCCAAACAACA
CGAGAGGGTGCAATTTGGTCTCTGCGTTGTTATAGATCCAAAGTTTTTGGACCCTGTTTGGCATCGTGTATCAAGTA
TTGGTTACACAGTCTATATTTTCAAGAACGAGACTGTGAAAGCTGTAAGCAACTTTTTATTtATCTATTTATTTTTATG
CTATAGCTTAtattaaactta
>contig00010
TCAAAGTTTTAGGTTCCAATTTGTATGGCTCAACTTAAGAAGTTTGTTGTAAAAAaGGAAATTCTTTCTGATCTATTA
GGGGCAGAAGTGCCACAATATATGAAGTTGAGAAATTAAaTAAAGTAATCATAGTACATTGTCTCGTTTGGATAGAC
GTAGGCTCTCAaGAAAAAAaGTTCTCATAGTTCTTGATGATGTGGATGATTTAGTGCGGCAAGTAGAACCTGGTCAA
GGGAGTAGAATAATTATGACAAGCAGAGATAGACAATTGAGTGAAGCTCTCTGCCTGTTTTGCAAGCATGCCTTCAA
GCGACAATTTCTAAGAACAGGATATTTAATAAGGCAATTGATCATGCTCAGGG
>G383C4U02H6B5W
CCGGGCTCCCCATCTTCTCTATCTCTTGTGTGATTGTTGCAGAATACATCAAAGACTTGGGGTTGAGAGAGACAGCA
TCATAAACCTGATCACGGAAGCCCCTTTGAAGCATGCAGTCCACCTCATCTAGCCTTGTTGTCGTTGAAATAGTCCAT
CTGCCATCTTTAAATACATGCGCAACATAATGCCCGCATTGCGTATCTAGTCCAATATCATGCTTTATTAAAAAGATCA
ATAAGCCTTCCTGGAGTCCCCACAATCAAGTTCCAACTCCTTGCTGAAATGCGGTAGAGTTGTCCAGCCATCA
>G383C4U02IH1AO
TTCAAGGAACTTTCATCCATCCAATGATCTAACCAATTTGAACCTAGTTTTGATTCATCTCTGAAGTTCGAATTTGAAC
CACATTCTTAAGAATTGAGGGCCCATCAAATTTAGTACTATAATCATGAAGTAGGTGATCCTCTCTTGTCACTCTTTTC
ATCATCAGCAAGATGACTTCTCATTGGAATGCTACCATGCTTGTTCCAAAA
Thanks,
R. Singh