parsing a portion of Data from a text file

Lucky_Ali · September 27, 2010, 9:34pm

Hi All,
I need some help to effectively parse out a subset of results from a big results file.

Below is an example of the text file. Each block that I need to parse starts with "Output of GENE for sequence file 100.fasta" (next block starts with another number). I have given the portion of the block that I need to parse out below and rest of the block is incomplete (given only those text thats needed for parsing.)

# Output of GENE for sequence file 100.fasta
#
#
#
#
# 
# 
# Maximum BLAST-like scores:
# Inner      Max         Sim     S.D.s above     S.D. of
#  frags    Score      P-value    sim. mean        sims
# SCORE     4.145      0.6043        -0.01       0.0274
# OuterSeq
#  frags    0.125      1.0000         0.00       0.0000
#
#
#
#Output of GENE for sequence file 101.fasta
#
#
#
#
#
## Maximum BLAST-like scores:
# Inner      Max         Sim     S.D.s above     S.D. of
#  frags    Score      P-value    sim. mean        sims
# SCORE     2.665      0.8360         0.44       0.0439
# OuterSeq
#  frags  Not found      0.0000         0.00       0.0000
#
#
#
#
#Output of GENE for sequence file 103.fasta
#
#
#
#
#
## Maximum BLAST-like scores:
# Inner      Max         Sim     S.D.s above     S.D. of
#  frags    Score      P-value    sim. mean        sims
# SCORE     3.665      0.8705         1.44       0.0039
# OuterSeq
#  frags  Not found      1.0000         2.00       0.0000

I would like to parse out the number, for example, 100 from the block 'Output of GENE for sequence file 100.fasta" and then the Sim P-values of each block in such a way

100  0.6043
101 0.8360 
103 0.8705

Please let me know the best and simple way to parse out this using awk or sed.

LA

kurumi · September 27, 2010, 10:06pm

$ ruby -ane 'num=$_.scan(/^.*\b(\d+)\.fasta/)[0] if  /Output/; print "#{num[0]} #{$F[3]}\n" if /SCORE/  ' file
100 0.6043
101 0.8360
103 0.8705

Lucky_Ali · September 27, 2010, 10:20pm

Sorry I don't have ruby in my computer

Mubby · September 27, 2010, 11:29pm

I just tested using GNU gawk, and it worked for me.

awk -f awk_parser.awk the_file

I'm new to this forum and editor, so it may not tab-align properly, but this worked for me:

BEGIN {

        # this_num denotes which sequence file we're currently handling
        # it's used as an index into the associative array caled "pval"
        this_num = 0
}

/Output of GENE/ , /SCORE/ {

        # capture the fasta number
        if( $0 ~ /Output of GENE/ ) {

                where = match( $0, /[0-9]+\.fasta/ )
                fasta_str = substr( $0, where, RLENGTH )

                where = match( fasta_str, /^[0-9]+/ )
                num = substr( fasta_str, where, RLENGTH )

                # print "Located gene sequence file: " num
                this_num = num

        }
        else if( $0 ~ /SCORE/ ) {
                # print "\thandling Sim p-value for SCORE row"
                pval[this_num] = $4
        }
}

END {
        for( seq_file in pval ) {
                print seq_file, pval[seq_file]
        }
}

=====
My output:

100 0.6043
101 0.8360
103 0.8705

=====

This was "quick and dirty" and as such requires that the SCORE line be output by your utility JUST as you posted here (i.e., a pound-sign, a space, the SCORE word, etc.)

walid2mi · September 28, 2010, 12:15am

awk '/fasta$/{split($NF,m,".");printf m[1]}/SCORE/{printf " %s\n",$3}'  file

Mubby · September 28, 2010, 12:19am

I ran that:

$ awk '/fasta$/{split($NF,m,".");printf m[1]}/SCORE/{printf " %s\n",$3}'  the_file

100 4.145
101 2.665
103 3.665

walid2mi · September 28, 2010, 12:24am

try with $4

awk '/fasta$/{split($NF,m,".");printf m[1]}/SCORE/{printf " %s\n",$4}'  file

Mubby · September 28, 2010, 12:26am

$ awk '/fasta$/{split($NF,m,".");printf m[1]}/SCORE/{printf " %s\n",$4}' the_file

100 0.6043
101 0.8360
103 0.8705

rdcwayx · September 28, 2010, 12:38am

awk '/Output of GENE / {split($NF,a,".")} /SCORE/{print a[1],$4}' infile