Remove empty records

yifangt · May 11, 2012, 8:17am

Hello:
Is there a simple way to remove empty records of FASTA format file?
A FASTA format consists of two parts: header and sequence (for non-biologist, Wiki for details of course!). The header always start with ">" for the name of the sequence. The header must be in this ONLY single line. Following the header are the sequence that can be in single or multiple rows. Example:

>header1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATG
>other header name
GCGGCGGGGCGTTTAAA
ATCGAT

I have a file some of which only have headers but not sequence (i.e. empty sequences). I need to remove them.
Input file:

>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>other header name
>some thing maybe long but single line for sure
GCTAGCTAGCA
>something 
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>some description  
>other description  2
CGTAGCTAGGTAGATTTA
>something not good for me

ouput

>head1
ACGATAGCTCTAGCTAGCTA
 >header2
 GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
 >some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
 >other description  2
 CGTAGCTAGGTAGATTTA

There are bioperl scripts and other tools to do the job. I was thinking there must be a simpler handy way with sed or awk to do it, especially on the terminal for smaller files. Perl oneliner works too, but I have problem with the "$/" variable as the input sequence can be different multiple rows.
Could not figure this out by myself. Post it here for to get experts' ideas. Thanks in advance!
Yifang

clx · May 11, 2012, 8:26am

So whats the rule? How do you know that from

>other header name
>some thing maybe long but single line for sure

only

>some thing maybe long but single line for sure

is needed. Is that mean the last entry in the same pattern?
Or may be couldn't figure out it by me.

bartus11 · May 11, 2012, 8:26am

Try:

perl -p0e 's/>.*\n>/>/g;s/>.*\n$//' file

Scrutinizer · May 11, 2012, 8:36am

Hi, try:

awk '$2{print RS}$2' FS='\n' RS=\> ORS= infile

birei · May 11, 2012, 8:48am

Hi yifangt,

One more way using perl:

$ cat infile
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>other header name
>some thing maybe long but single line for sure
GCTAGCTAGCA
>something 
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>some description  
>other description  2
CGTAGCTAGGTAGATTTA
>something not good for me
$ cat script.pl
use warnings;
use strict;

my ($prev_line, $is_prev_header);

while ( <> ) {
        chomp;

        if ( $. == 1 ) {
                $prev_line = $_;
                $is_prev_header = is_header( $_ );
                next unless eof;
        }

        if ( ! $is_prev_header || ! is_header( $_ ) ) {
                printf qq[%s\n], $prev_line;
        }

        if ( eof && ! is_header( $_ ) ) {
                printf qq[%s\n], $_;
                exit 0;
        }

        $prev_line = $_;
        $is_prev_header = is_header( $_ );
}

sub is_header {
        return substr( $_[0], 0, 1 ) eq q[>] ? 1 : 0;
}

$ perl script.pl infile
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>other description  2
CGTAGCTAGGTAGATTTA

ygemici · May 11, 2012, 9:03am

# sed '/^>/N;s/.*\n\(>.*\)/\1/' infile|sed '$s/^>/delete_sed_X/;/^delete_sed_X/d'
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>other description  2
CGTAGCTAGGTAGATTTA

yifangt · May 11, 2012, 11:59am

The first sub is very smart, but the second sub removed the header, which needs stay.
Did not work with my system as Ubuntu 11.10/Bash. Could you explain the option "-p0e" for me?
Thanks again!

---------- Post updated at 11:56 AM ---------- Previous update was at 11:52 AM ----------

ygemici:

# sed '/^>/N;s/.*\n\(>.*\)/\1/' infile|sed '$s/^>/delete_sed_X/;/^delete_sed_X/d'
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>other description  2
CGTAGCTAGGTAGATTTA

Did not work with my system. Could you please explain the first part?

sed '/^>/N;s/.*\n\(>.*\)/\1/'

---------- Post updated at 11:59 AM ---------- Previous update was at 11:56 AM ----------

anchal_khare:

So whats the rule? How do you know that from
>other header name
>some thing maybe long but single line for sure
only
>some thing maybe long but single line for sure
is needed. Is that mean the last entry in the same pattern?
Or may be couldn't figure out it by me.

The headers are the descriptions of each sequence which can be random length so that I simulated some. Wiki FASTA format will give you more information. Thanks anyway!

krsnadasa · May 17, 2012, 4:47am

Hi Scrut,

I found this post very useful. Can you kindly explain this code like how each term is working here

awk '$2{print RS}$2' FS='\n' RS=\> ORS= infile

Sorry I am looking this post little late.

Thanks
Krsnadasa

Scrutinizer · May 17, 2012, 5:07am

I think I wrote that a bit quickly. It is correct, but probably this equivalent code is probably somewhat easier to understand:

awk '$2{print RS $0}' FS='\n' RS=\> ORS= infile

This means:

use ">" as the records separator
use a newline as the field separator (so $1 is the header and $2 is the first line of the content)
do not use an output record separator (this is because we want it printed before instead of after the record, so we cannot use ORS for that)
If field 2 exists ( $2{...} ) then we are processing a record that consists of more than just a header, and we can print it preceded by the record separator (>) print RS $0

krsnadasa · May 17, 2012, 6:18am

Thanks Scrut,

Very useful information..

Krsnadasa...