Hi all, I have a large fasta (dna sequence) file. I would like to extract a portion of the header as well as the sequence (line below the header).
Input:
Output:
All accession values (the term I want to preserve, which is the string including and directly following "GL") are different, but I believe they are the same length.
I'm a command-line beginning. I tried to adapt code I found online, but though it does preserve the sequence line, it only cuts off the portion of the header following the accession, not before it.
sed 's/| [^ ].* *//g'
I have also tried:
grep -o 'GL\d{6}\.1'
but it also doesn't work.
Any suggestions?
agama
August 25, 2012, 6:47pm
2
Assuming that all of the file consists of the two records (alternating) that you have posted, then this will print the portion of the first record you've pointed out, and the entire next record.
If there are other records in the file this won't work.
awk -F \| ' { printf( ">%s\n", $4 ); getline; print; }' input-file >output-file
Each record is different. Here is a slightly expanded excerpt from the file:
>gi|299507456|gb|GL349621.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold1, whole genome shotgun sequence
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAATAAAATTTTATTGAT
>gi|299507455|gb|GL349622.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold2, whole genome shotgun sequence
GTATGCGCGCATCTCCATACCGTCCGATAAATTCGCAGTAAAAAAAATGTGATTCACATTGTCGATTATAATAAAAAAAT
>gi|299507454|gb|GL349623.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold3, whole genome shotgun sequence
AATATTAAATAATTAATCTAAATAAATTAAATACCTCATTAGTCATTAACACACATTTTTTTCTTAGTTTTAATGTATAA
agama
August 25, 2012, 7:14pm
4
The awk I posted should work, a bit of a tweak (not necessary I don't think)
awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline; print; }' input-file >output-file
By the same I meant same format. Record a separated by pipe symbols (|) followed by record b (no implied format).
Thanks, agama, but with the following command I get the following error:
awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline print; }' test3.fa
awk: syntax error at source line 1
context is
/^>gi/ { printf( ">%s\n", $4 ); getline >>> print <<< ; }
awk: illegal statement at source line 1
cat test3.fa
>gi|299507456|gb|GL349621.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold1, whole genome shotgun sequence
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAATAAAATTTTATTGAT
>gi|299507455|gb|GL349622.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold2, whole genome shotgun sequence
GTATGCGCGCATCTCCATACCGTCCGATAAATTCGCAGTAAAAAAAATGTGATTCACATTGTCGATTATAATAAAAAAAT
>gi|299507454|gb|GL349623.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold3, whole genome shotgun sequence
AATATTAAATAATTAATCTAAATAAATTAAATACCTCATTAGTCATTAACACACATTTTTTTCTTAGTTTTAATGTAT
agama
August 25, 2012, 7:42pm
6
Sorry -- I tested it as a multi-line programme, but joined the lines when I posted it and forgot a semicolon.
awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline; print; }' test3.fa
1 Like
Perl
perl -nle 'if ($flg){print;$flg=0}if(/^\>gi/){ print ">",((split(/\|/))[3]);$flg=1}' File1
sed '/^>gi/s/\(.*\)\(GL.\{8\}\)\(.*\)/>\2/g' infile
or
sed '/^>gi/s/\(.*\)\(GL[^\|]*\)\(.*\)/>\2/g' infile
awk -F\| '/^>/{$0=">" $4}1' infile