Remove empty records

Hello:
Is there a simple way to remove empty records of FASTA format file?
A FASTA format consists of two parts: header and sequence (for non-biologist, Wiki for details of course!). The header always start with ">" for the name of the sequence. The header must be in this ONLY single line. Following the header are the sequence that can be in single or multiple rows. Example:

>header1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATG
>other header name
GCGGCGGGGCGTTTAAA
ATCGAT

I have a file some of which only have headers but not sequence (i.e. empty sequences). I need to remove them.
Input file:

>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>other header name
>some thing maybe long but single line for sure
GCTAGCTAGCA
>something 
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>some description  
>other description  2
CGTAGCTAGGTAGATTTA
>something not good for me

ouput

>head1
ACGATAGCTCTAGCTAGCTA
 >header2
 GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
 >some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
 >other description  2
 CGTAGCTAGGTAGATTTA

There are bioperl scripts and other tools to do the job. I was thinking there must be a simpler handy way with sed or awk to do it, especially on the terminal for smaller files. Perl oneliner works too, but I have problem with the "$/" variable as the input sequence can be different multiple rows.
Could not figure this out by myself. Post it here for to get experts' ideas. Thanks in advance!
Yifang

So whats the rule? How do you know that from

>other header name
>some thing maybe long but single line for sure

only

>some thing maybe long but single line for sure

is needed. Is that mean the last entry in the same pattern?
Or may be couldn't figure out it by me.

Try:

perl -p0e 's/>.*\n>/>/g;s/>.*\n$//' file

Hi, try:

awk '$2{print RS}$2' FS='\n' RS=\> ORS= infile
1 Like

Hi yifangt,

One more way using perl:

$ cat infile
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>other header name
>some thing maybe long but single line for sure
GCTAGCTAGCA
>something 
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>some description  
>other description  2
CGTAGCTAGGTAGATTTA
>something not good for me
$ cat script.pl
use warnings;
use strict;

my ($prev_line, $is_prev_header);

while ( <> ) {
        chomp;

        if ( $. == 1 ) {
                $prev_line = $_;
                $is_prev_header = is_header( $_ );
                next unless eof;
        }

        if ( ! $is_prev_header || ! is_header( $_ ) ) {
                printf qq[%s\n], $prev_line;
        }

        if ( eof && ! is_header( $_ ) ) {
                printf qq[%s\n], $_;
                exit 0;
        }

        $prev_line = $_;
        $is_prev_header = is_header( $_ );
}

sub is_header {
        return substr( $_[0], 0, 1 ) eq q[>] ? 1 : 0;
}

$ perl script.pl infile
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>other description  2
CGTAGCTAGGTAGATTTA
1 Like
# sed '/^>/N;s/.*\n\(>.*\)/\1/' infile|sed '$s/^>/delete_sed_X/;/^delete_sed_X/d'
>head1
ACGATAGCTCTAGCTAGCTA
>header2
GGCGCTATGGCGACTGATCAGC
CCGAAAGATGCT
>some thing maybe long but single line for sure
GCTAGCTAGCA
>strange header 2
AGCTAGCTGAGGGAGGAGGGA
>other description  2
CGTAGCTAGGTAGATTTA

The first sub is very smart, but the second sub removed the header, which needs stay.
Did not work with my system as Ubuntu 11.10/Bash. Could you explain the option "-p0e" for me?
Thanks again!

---------- Post updated at 11:56 AM ---------- Previous update was at 11:52 AM ----------

Did not work with my system. Could you please explain the first part?

sed '/^>/N;s/.*\n\(>.*\)/\1/' 

---------- Post updated at 11:59 AM ---------- Previous update was at 11:56 AM ----------

The headers are the descriptions of each sequence which can be random length so that I simulated some. Wiki FASTA format will give you more information. Thanks anyway!

Hi Scrut,

I found this post very useful. Can you kindly explain this code like how each term is working here

awk '$2{print RS}$2' FS='\n' RS=\> ORS= infile

Sorry I am looking this post little late.

Thanks
Krsnadasa

I think I wrote that a bit quickly. It is correct, but probably this equivalent code is probably somewhat easier to understand:

awk '$2{print RS $0}' FS='\n' RS=\> ORS= infile

This means:

  • use ">" as the records separator
  • use a newline as the field separator (so $1 is the header and $2 is the first line of the content)
  • do not use an output record separator (this is because we want it printed before instead of after the record, so we cannot use ORS for that)
  • If field 2 exists ( $2{...} ) then we are processing a record that consists of more than just a header, and we can print it preceded by the record separator (>) print RS $0
1 Like

Thanks Scrut,

Very useful information.. :b:

Krsnadasa...