Help with extract info if fulfill condition required

perl_beginner · May 19, 2012, 10:01pm

Input file (4 DATA record shown in this case):

DATA       AA0110            
ACCESSION   AA0110
VERSION     AA0110  GI:157412239
FEATURES             Location/Qualifiers
     length            1..1170
                      1..1700
                     /length="1170"
     position            1..1170
                     /length="1170"
     band             1..948
                     /length="948"
//

DATA       BC599              
DEFINITION  USA
ACCESSION   BC599
VERSION     BC599  GI:239744030
FEATURES             Location/Qualifiers

     position          1..3159
                     /length="3159"
     length            1..40000
                     /length="40000"
//

DATA       HI101               
DEFINITION  UK
ACCESSION   HI101
VERSION     HI101  GI:239745142

FEATURES             Location/Qualifiers

     band             1..757
                     /length="757"
     length            1..747
                     /length="747"
//

DATA       AVE111
ACCESSION   AVE111
VERSION     AVE111  GI:157412223
FEATURES             Location/Qualifiers
     position            1..1170
                     /length="1170"
//

Desired output file:

157412239 1170
239744030 40000
239745142 747
157412223 -

Condition required:

The first column info of desired output file is extracted from the line shown "VERSION" and extract the content after GI:;
The second column info of desired output file is extracted from the line that shown "/length="XXX"" after "length" word;
If first column info of desired output file is available but lack of column 2 info. Just put a "-" and print in desired output file;

Command try:

awk 'BEGIN {RS=""; FS="//"} /VERSION/ {for (i=1;i<=NF;i++) {if ($i~/\/length=/) {print $i}}}' input_file.txt
DATA       AA0110            
ACCESSION   AA0110
VERSION     AA0110  GI:157412239
FEATURES             Location/Qualifiers
     length            1..1170
                     /length="1170"
     position            1..1170
                     /length="1170"
     band             1..948
                     /length="948"

The command I try fail to give my desired output result
I was thinking to use "//" as field separator of each record.

Thanks for any advice.

---------- Post updated at 09:01 PM ---------- Previous update was at 04:54 AM ----------

Is there any advice or hints provided to solve my doubt?
I'm still stuck at solving this problem
Thanks in advance!

neutronscott · May 19, 2012, 11:16pm

I think you want RS to be //, not FS.

#!/usr/bin/awk -f
BEGIN { RS="//"; FS="\n[[:space:]]*" }

{
        ver=len=""

        for (i=1;i<=NF;i++) {
                if (match($i,/^VERSION .* GI:/))
                        ver=substr($i,RSTART+RLENGTH)
                if ($i ~ /^length / && split($(i+1),b,/="/))
                        len=b[2]
        }
        if (ver) print ver, len ? 0+len : "-"
}

Scrutinizer · May 20, 2012, 1:16am

// cannot be used a record separator in standard awk, it needs to be a single character. The special case RS= to split the records on empty lines (two consecutive newlines) can not be used here because there are empty lines in the records.

Try:

awk -F'[ \t:"=]*' '$1=="VERSION"{if(p)print p; printf "%s ",$4; p="-"} $2=="length"{ getline; if($2=="/length") p=$3 } END{print p}' infile

perl_beginner · May 20, 2012, 2:45am

Thanks neutronscott.
Many thanks for your awk script.
Give me some time to digest it

---------- Post updated at 01:45 AM ---------- Previous update was at 01:43 AM ----------

Dear Scrutinizer,

Your awk script work perfectly for my case!
Really appreciate for your explanation in detail.
I will take note in future regarding "//" and FS and RS.
Currently I'm trying to understand your awk command.
Will ask you if I'm stuck on it later
Thanks a lot!

perl_beginner · May 22, 2012, 12:45pm

Hi Scrutinizer,

Do you have any idea if one of my record is shown like this:

     length            1..1170
                      1..1700
                     /length="1170"

instead of:

      length            1..1170
                      /length="1170"

It the new case, the awk code that you written can't really given "1170"
It gives "-" instead.

157412239 - 
239744030 40000 
239745142 747 
157412223 -

I just find out that some of my "/length="XXX"" is not appear immediately the next line after "length"

Thanks for advice.

---------- Post updated at 11:45 AM ---------- Previous update was at 11:41 AM ----------

Hi neutronscott,

I just find out that some of my "/length="XXX"" is not appear immediately the next line after "length"
I try with your awk code.
It can't work fine if the "/length="XXX"" is not shown at the next line after "length".
Thanks for further advice.

neutronscott · May 22, 2012, 2:08pm

can /length be shown on same line as length? i have no idea what data i am looking at, but it seems to use width to separate the features categories. so i assume we're in FEATURE length, until we reach a line preceeded by 5 or less spaces.

#!/usr/bin/awk -f
$1 == "VERSION" { ver=substr($3,4); len="-" }
$1 == "length" { l=1; next }
l&&match($0,/^[[:space:]]*/)&&(l=(RLENGTH>5)) { len=0+substr($1,10) }
$1 == "//" { print ver, len }

$ ./script input
157412239 1170
239744030 40000
239745142 747
157412223 -

awk -F'[ \t:"=]*' '$1=="VERSION"{v=$4;l="-"}$1=="//"{print v,l}$2=="length"{p=1;next}p&&match($0,/^[[:space:]]*/)&&(p=(RLENGTH>5)){l=$3}' input