Way to extract detail and its content above specific value problem asking

patrick87 · March 16, 2010, 12:39am

Input file:

>position_10 sample:68711 coords:5453-8666 number:3 type:complete len:344
MSINQYSSDFHYHSLMWQQQQQQQQHQNDVVEEKEALFEKPLTPSDVGKLNRLVIPKQHA
ERYFPLAAAAADAVEKGLLLCFEDEEGKPWRFRYSYWNSSQSYVLTKGWSRYVKEKHLDA
NRTS*
>position_4 sample:68711 coords:553-866 number:4 type:partial len:483
MSGVVRSSPGSSQPPPPPPHHPPSSPVPVTSTPVIPPIRRHLAFASTKPPFHPSDDYHRF
KITPSDVENDESDYWLLSNAEISMTDIWKTDSGIDWDYGIADVSTPPPGMGEIAPTAVDS
TPR*
>position_7 sample:68711 coords:453-86 number:2 type:partial len:214
KAAETLEVQKRRIYDITNVLEGIDLIEKPFKNRILWKGVDACPGDEDADVSVLQLQAEIE
NLALEEQALDNQIRWLFVTEEDIKSLPGFQNQTLIAVKAPHGTTLEVPDPDEAADHPQRR
TDSGIDWDYGIADVSTPPPGMGEIAPTAVDSTPR*
>position_11 sample:68711 coords:53-86 number:1 type:complete len:558
MLGDFIIRLLVLILGYTYPAFECFKTVEKNKVDIEELRFWCQYWILLALISSFERVGDFF
RAPRPLNKSLSALRSLEKQTSRGRKWPPPTPPPTPGRDSAGTFNGDDGVNIPDTIPGSPL
TDARAKLRRSNSRTQPAA*
.
.

Output file:

>position_10 sample:68711 coords:5453-8666 number:3 type:complete len:344
MSINQYSSDFHYHSLMWQQQQQQQQHQNDVVEEKEALFEKPLTPSDVGKLNRLVIPKQHA
ERYFPLAAAAADAVEKGLLLCFEDEEGKPWRFRYSYWNSSQSYVLTKGWSRYVKEKHLDA
NRTS*
>position_11 sample:68711 coords:53-86 number:1 type:complete len:558
MLGDFIIRLLVLILGYTYPAFECFKTVEKNKVDIEELRFWCQYWILLALISSFERVGDFF
RAPRPLNKSLSALRSLEKQTSRGRKWPPPTPPPTPGRDSAGTFNGDDGVNIPDTIPGSPL
TDARAKLRRSNSRTQPAA*
.
.

I would like to extract the content and detail match with below criteria:

header must got the "complete" word (eg. type:complete )
lens must above or equal to 300 (eg. len:344 and len:558, etc)
It seems like perl, awk, sed able to archive my desired goal.
Thanks a lot for any advice

abubacker · March 16, 2010, 12:58am

Hope this snippet would give you an idea

use strict;
use warnings;



 open FH , "testfile" or die "$0:$!" ;
 while (<FH> )
 {
     my $line = $_ ;

     # check for a type complete
        if ( $line =~ />type complete/  )
        {
                $line =~ /([0-9]+)\s*$/ ;
                # check for greater than 300
                if ( $1 < 300)
                {
                next ;
                }

            print "$line"  ;
        while ( <FH> )
        {
            # leave the type partial
        if ( $_ !~ />type partial/ )
        {
        print  "$_" ;
        }
        else{
            last ;
        }

        }
        }
 }

rdcwayx · March 16, 2010, 1:20am

awk 'BEGIN{RS=">";FS="\n"} {split($1,a," |:")} a[2]~"complete" && a[4]>=300'  urfile

dennis.jacob · March 16, 2010, 2:58am

Another one,

awk -F: '/complete/ && $2>=300  {c=1} c++<=4' file

patrick87 · March 16, 2010, 3:35am

Hi rdcwayx,
Thanks for your previous suggestion
Can I ask your advice about the question this time?
I got change a bit about the header format.
Thanks again for your advice.

rdcwayx · March 16, 2010, 5:28am

awk 'BEGIN{RS=">";FS="\n"} {split($1,a," |:")} {if (a[9]~"complete" && a[11]>=300) print ">"$0}' ORS="" urfile

dennis.jacob · March 16, 2010, 5:40am

Modified one based on the changes you made now

awk -F: '/complete/ && $NF>=300 {c=1} c++<=4' file

patrick87 · March 16, 2010, 6:24am

Hi rdcwayx,
Your awk code work perfectly in my case.
Thanks a lot ^^