Extract regular expression and line below

Hi all, I have a large fasta (dna sequence) file. I would like to extract a portion of the header as well as the sequence (line below the header).

Input:

Output:

All accession values (the term I want to preserve, which is the string including and directly following "GL") are different, but I believe they are the same length.

I'm a command-line beginning. I tried to adapt code I found online, but though it does preserve the sequence line, it only cuts off the portion of the header following the accession, not before it.

sed 's/| [^ ].* *//g'

I have also tried:

grep -o 'GL\d{6}\.1'

but it also doesn't work.

Any suggestions?

Assuming that all of the file consists of the two records (alternating) that you have posted, then this will print the portion of the first record you've pointed out, and the entire next record.

If there are other records in the file this won't work.

awk -F \| ' { printf( ">%s\n", $4 ); getline;  print; }'  input-file >output-file

Each record is different. Here is a slightly expanded excerpt from the file:

The awk I posted should work, a bit of a tweak (not necessary I don't think)

awk -F \| ' /^>gi/  { printf( ">%s\n", $4 ); getline;  print; }'  input-file >output-file

By the same I meant same format. Record a separated by pipe symbols (|) followed by record b (no implied format).

Thanks, agama, but with the following command I get the following error:

awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline print; }' test3.fa

awk: syntax error at source line 1
context is
/^>gi/ { printf( ">%s\n", $4 ); getline >>> print <<< ; }
awk: illegal statement at source line 1

cat test3.fa 
>gi|299507456|gb|GL349621.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold1, whole genome shotgun sequence
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAATAAAATTTTATTGAT
>gi|299507455|gb|GL349622.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold2, whole genome shotgun sequence
GTATGCGCGCATCTCCATACCGTCCGATAAATTCGCAGTAAAAAAAATGTGATTCACATTGTCGATTATAATAAAAAAAT
>gi|299507454|gb|GL349623.1| Acyrthosiphon pisum unplaced genomic scaffold Scaffold3, whole genome shotgun sequence
AATATTAAATAATTAATCTAAATAAATTAAATACCTCATTAGTCATTAACACACATTTTTTTCTTAGTTTTAATGTAT

Sorry -- I tested it as a multi-line programme, but joined the lines when I posted it and forgot a semicolon.

awk -F \| ' /^>gi/ { printf( ">%s\n", $4 ); getline; print; }' test3.fa
1 Like

Perl

perl -nle 'if ($flg){print;$flg=0}if(/^\>gi/){ print ">",((split(/\|/))[3]);$flg=1}' File1
sed '/^>gi/s/\(.*\)\(GL.\{8\}\)\(.*\)/>\2/g' infile

or

sed '/^>gi/s/\(.*\)\(GL[^\|]*\)\(.*\)/>\2/g' infile
awk -F\| '/^>/{$0=">" $4}1' infile