Extract regular expression and line below

pathunkathunk · August 25, 2012, 6:40pm

Hi all, I have a large fasta (dna sequence) file. I would like to extract a portion of the header as well as the sequence (line below the header).

Input:

Output:

All accession values (the term I want to preserve, which is the string including and directly following "GL") are different, but I believe they are the same length.

I'm a command-line beginning. I tried to adapt code I found online, but though it does preserve the sequence line, it only cuts off the portion of the header following the accession, not before it.

sed 's/| [^ ].* *//g'

I have also tried:

grep -o 'GL\d{6}\.1'

but it also doesn't work.

Any suggestions?

agama · August 25, 2012, 6:47pm

Assuming that all of the file consists of the two records (alternating) that you have posted, then this will print the portion of the first record you've pointed out, and the entire next record.

If there are other records in the file this won't work.

awk -F \| ' { printf( ">%s\n", $4 ); getline;  print; }'  input-file >output-file

pathunkathunk · August 25, 2012, 6:56pm

Each record is different. Here is a slightly expanded excerpt from the file:

agama · August 25, 2012, 7:14pm

The awk I posted should work, a bit of a tweak (not necessary I don't think)

awk -F \| ' /^>gi/  { printf( ">%s\n", $4 ); getline;  print; }'  input-file >output-file

By the same I meant same format. Record a separated by pipe symbols (|) followed by record b (no implied format).

pathunkathunk · August 25, 2012, 7:29pm

Thanks, agama, but with the following command I get the following error:

awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline print; }' test3.fa

awk: syntax error at source line 1
context is
/^>gi/ { printf( ">%s\n", $4 ); getline >>> print <<< ; }
awk: illegal statement at source line 1

cat test3.fa 
>gi|299507456|gb|GL349621.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold1, whole genome shotgun sequence
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAATAAAATTTTATTGAT
>gi|299507455|gb|GL349622.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold2, whole genome shotgun sequence
GTATGCGCGCATCTCCATACCGTCCGATAAATTCGCAGTAAAAAAAATGTGATTCACATTGTCGATTATAATAAAAAAAT
>gi|299507454|gb|GL349623.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold3, whole genome shotgun sequence
AATATTAAATAATTAATCTAAATAAATTAAATACCTCATTAGTCATTAACACACATTTTTTTCTTAGTTTTAATGTAT

agama · August 25, 2012, 7:42pm

Sorry -- I tested it as a multi-line programme, but joined the lines when I posted it and forgot a semicolon.

awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline; print; }' test3.fa

pravin27 · August 26, 2012, 12:51am

Perl

perl -nle 'if ($flg){print;$flg=0}if(/^\>gi/){ print ">",((split(/\|/))[3]);$flg=1}' File1

complex.invoke · August 26, 2012, 2:41am

sed '/^>gi/s/\(.*\)\(GL.\{8\}\)\(.*\)/>\2/g' infile

or

sed '/^>gi/s/\(.*\)\(GL[^\|]*\)\(.*\)/>\2/g' infile

Scrutinizer · August 26, 2012, 4:07am

awk -F\| '/^>/{$0=">" $4}1' infile