Converting text files to xls through awk script for specific data format

Dear Friends,
I am in urgent need for awk/sed/sh script for converting a specific data format (.txt) to .xls.
The input is as follows:

>gi|1234|ref|
  
 Query = 1 - 65, Target = 1677 - 1733
 Score = 8.38, E = 0.6529, P = 0.0001513, GC =  46
  fd sdfsdfsdfsdf
fsdfdsfdfdfdfdfdf
dfdfdfdfedfsdfsdfdsf
dfdfdfsdfsdfsdfsdfdsf
sdfdsfdfdfdfdsfdfdfdfd
  
>gi|54367|ref|


  fd sdfsdfsdfsdf
fsdfdsfdfdfdfdfdf
dfdfdfdfedfsdfsdfdsf
dfdfdfsdfsdfsdfsdfdsf
sdfdsfdfdfdfdsfdfdfdfd

 Query = 1 - 65, Target = 6780 - 6812
 Score = 8.13, E = 0.7692, P = 0.0001782, GC =  36

//
................EOF................................

The desired output EXCEL table will the contain the headers and data in the following order:

            Serial   Query start    Query End   target start   target end   Score   E   P   GC
       >gi|1234|ref|   1   65   1677   1733   8.38   0.6529   0.000151   46
       >gi|54367|ref|   1   65   6780   6812   8.13   0.7692   0.000178   36   

I have to iteratively perform this conversion in several (1000+ ) .txt files.

awk '/gi/ {print ;}'
awk '/Query/{print $3,$5,$8,$10;}'
awk '/Score/ {print $3, $6, $9, $12;}' 

I was trying with the preliminary awk one-liners and trying to use "CAT" for concatenating all 3 .xls output which did not fulfiling the purpose.
I need urgent help!!!!!!!!!, looking forward for your responses (new script/ modified version of the one-liners).
P.S: Input file and output file is attached for more clarity.
Best regards,
Amit

It will look something like this, I leave it to you to debug it

awk -f x.awk inputfile

where x.awk is...

(then you import it as a comma-delimited file)

BEGIN{print "Serial,Query Start,Query End, Target Sart, Target End, E,P,GC"}
/^\>g/ { serial=$0 }
/Query =/ {
        split($0,a," ")
        query_start=a[3]
        query_end=a[5]
        query_end=gsub(/,/,"",query_end)
        target_start=a[8]
        target_end=a[10]

/Score =/{
        split($0,a," ")
        score=a[3]
        score=gsub(/,/,"",score)
        e=a[6]
        e=gsub(/,/,"",e)
        p=a[1]
        p=gsub(/,/,"",p)
        gc=a[12]

        printf("%s,%s,%s,%s,%s,%s,%s,%s\n",query_start, query_end,target_start,target_end,score,e,p,gc)
        }
1 Like

Thanks for the help, i will try it out.
Best regards,
Amit

A closing } is missing for the /Query =/

Note also that gsub(ERE, replacement, string) returns the number of substitutions performed, not the modified string. And, since there are multiple commas on the Score line, it might make more sense to get rid of all of the commas on the input line in one shot before splitting out the fields than to perform gsub() calls on the split fields. In fact, if we set the field separator differently, we can avoid the need to perform any substitutions. The following creates a comma separated values file that can easily be loaded into Excel:

awk '
BEGIN { FS = "[ ,]*"
        OFS = ","
        print "Serial", "Query start", "Query End", "target start", "target end", "Score", "E", "P", "GC"
}
/^>/ {  S = $1 }
/Query/ {
        QS = $4
        QE = $6
        TS = $9
        TE = $11
}
/Score/ {
        print S, QS, QE, TS, TE, $4, $7, $10, $13
}' "Input file.txt"

which, with your sample input produces:

Serial,Query start,Query End,target start,target end,Score,E,P,GC
>gi|1234|ref|,1,65,1677,1733,8.38,0.6529,0.0001513,46
>gi|54367|ref|,1,65,6780,6812,8.13,0.7692,0.0001782,36

If you save this output in a file named input.csv and open it with Excel, you'll get your desired spreadsheet. Then, if .csv format isn't suitable, you can have Excel save it to a file in .xlsx format.

Note that you can give this script multiple input files which will still produce a single .csv output file. You might, however, run into ARG_MAX limitations on your system if you replace the last line of the script with:

}' *.txt

with over a thousand .txt files. I'm sure you can figure out how to group sets of .txt files into suitable chunks and combine the resulting output files (stripping off the header in all output files except the first one) to get a single, large .csv file to feed to Excel.

1 Like

Assuming you want to extract numeric value from line with a pattern Query and Score..

awk '
function parse(){
                   gsub(/[[:alpha:]]|[-,=]/,x)
                   $0=$0;$1=$1
                   q = q ? q OFS $0 : $0
                } 
           NR==1{
                 print "Serial", "Query start", "Query End", "target start", "target end", "Score", "E", "P", "GC"
                }      
            /^>/{
                   s=$0
                   next
                }
/Query/||/Score/{
                   parse()
                if(++i==2){
                                print s,q
                                q=s=i=""
                          }          
                }
    ' OFS="\t" file
Serial    Query start    Query End    target start    target end    Score    E    P    GC
>gi|1234|ref|    1    65    1677    1733    8.38    0.6529    0.0001513    46
>gi|54367|ref|    1    65    6780    6812    8.13    0.7692    0.0001782    36
1 Like

Dear DON and Akshay,
Thanks for your help, this will definitely ease my work.

Best regards,
Amit