How to get this script work on multiple input files

Daniel8472 · August 26, 2010, 5:51am

Hello Gyues!

I would like to use awk to perform data extraction from several files. The data files look like this:

 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
0.29000E+01 0.55005E-02 0.60012E-03
0.30000E+01 0.11149E+00 0.13603E-01
0.31000E+01 0.39719E+00 0.63013E-01
0.32000E+01 0.94264E+00 0.18784E+00
0.33000E+01 0.17744E+01 0.43749E+00
0.35000E+01 0.32350E+01 0.13273E+01
0.36000E+01 0.34913E+01 0.19104E+01
.
.
.

The first line is unique for each file and contains information I would like to add to the output. In fact, I need to seach for the highest value in $2 and print it together with the the first line of that file. Then the next file needs to be processed the same way.

For A single file it works fine though but how can I do this with multiple files? I think I somehow need to assigne information from the unique first line to the values of each file and store it in an array. At the end I simply need to print that array containing these information... However I really could not get it work so far...

The current code that works for a single file is:

BEGIN     {
    print "trajectory= traj molecules= mol Peptide= pep resid(CA?)= res contact= so (max)solv/sphere= n Radius(A)= r";
    print "traj", "mol", "pep", "res", "co", "n", "     r"; #just a header for the output
    }



# need to read substring in order to get exponential funktion
    {
    if (NR==1)    {
            expo=0;
            coomp=0;
            co=0;
            max=0;
            maxline=0;    
            traj=$2;
            mol=$1;
            pep=$3;
            res=$5;
            so=$6;
            } #saving file information and resetting comparison set
    else         {
            expo=10^(substr($2,9,3)); #extract exponent
            comp=(substr($2,3,5)/100000); 
            co=comp*expo;
            if (co > max) {max=co; maxline=substr($1,3,5)/100000*10^(substr($1,9,3))} # extract highest value from file
            }
    }



END     { 
    print traj, mol, pep, res, so, max, maxline; #print highest value and information from the first line
    }

Hope you gyues can help me out.

Cheers,
Daniel

kevintse · August 26, 2010, 9:04am

Like this?

awk ' BEGIN { h=0 } FNR == 1 { if(header!="") { printf "%s\n%d\n", header, h } header=$0; next  } \
          $2>h { h=$2 } END { printf "%s\n%d\n", header, h } ' file1 file2 file3 ....

Daniel8472 · August 26, 2010, 9:31am

kevintse:

Like this?

awk ' BEGIN { h=0 } FNR == 1 { if(header!="") { printf "%s\n%d\n", header, h } header=$0; next  } \
   $2>h { h=$2 } END { printf "%s\n%d\n", header, h } ' file1 file2 file3 ....

Thanks for your suggestion. I will try this as soon as I am back at the lab.However I am not sure wether I understand everything of your code.

Will will this produce an output like this:

First_line_of_file1 $2_of_file1_with_highest_value_in_file1
First_line_of_file2 $2_of_file2_with_highest_value_in_file2
First_line_of_file3 $2_of_file3_with_highest_value_in_file3
.
.
.

cheers,
Daniel

kevintse · August 26, 2010, 9:52am

it will produce the following if three files are the same:

 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3
 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3
 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3

Daniel8472 · August 26, 2010, 10:18am

kevintse:

it will produce the following if three files are the same:

 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3
 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3
 DWT26R 1 PEP1 CA 1 OH2 SKIPPED: 0 STEP: 1
3

Looks good! I am just wondering where the 3 is coming from (is the value rounded?). In the example the highest value is 0.34913E+01 for $2.

And another question just for my understanding. Where is specified that with starting to read a new file variable h is reset to 0 in order to extract the highest value of this specific file?

I try to grasp as much as possible that is why I ask so much

Cheers,
Daniel

Edit:
O maybe I got it! Right after begin you set h=0.

Edit2:

Oh now I got it, printf "%s\n%d\n" gives me header as a string and h as decimal expression and thus 3:)

kevintse · August 26, 2010, 11:38am

Correct.

"BEGIN { h=0 }" initializes h(highest value) to zero so it can be used to compare with all $2 in the file.

And, actually, you can use printf "%e", number to print numbers in exponential format, and printf "%f", number in float point.

Daniel8472 · August 27, 2010, 3:27am

Good morning!

I am back in the lab and just used the scipt (needed to midify some parts because some files contain "," instead of ".")

One problem seems to remain so far. While reading a new file the value for "h" is not reset to zero. Thus unless in the following file is a higher value in the $2 the highest $2 value of the previous files is kept and printed.

BEGIN { h=0 } 
    FNR == 1    { if(header!="") 
                {             
                #printf "%s\n%d\n", header, h;
                print header, h; 
                } 
            header=$0; next; 
            } \
         
    substr($2,3,5)/100000*10^(substr($2,9,3))>h         { h=substr($2,3,5)/100000*10^(substr($2,9,3)) } 

END     { 
    print header, h;
    #printf "%s\n%d\n", header, h;
    }

Output:

 DWT26R 1 PEP1 CA 10 OH2 SKIPPED: 0 STEP: 1 3,2829
 DWT26R 1 PEP2 CA 10 OH2 SKIPPED: 0 STEP: 1 3,5248
 DWT26R 1 PEP1 CA 11 OH2 SKIPPED: 0 STEP: 1 4,3229
 DWT26R 1 PEP2 CA 11 OH2 SKIPPED: 0 STEP: 1 4,3229
 DWT26R 1 PEP1 CA 12 OH2 SKIPPED: 0 STEP: 1 6,8575
 DWT26R 1 PEP2 CA 12 OH2 SKIPPED: 0 STEP: 1 6,8575
 DWT26R 1 PEP1 CA 13 OH2 SKIPPED: 0 STEP: 1 6,8575
 DWT26R 1 PEP2 CA 13 OH2 SKIPPED: 0 STEP: 1 6,8575
 DWT26R 1 PEP1 CA 14 OH2 SKIPPED: 0 STEP: 1 6,8575
 DWT26R 1 PEP2 CA 14 OH2 SKIPPED: 0 STEP: 1 6,8575

kevintse · August 27, 2010, 7:55am

That was a problem I didn't notice, but it can be easily solved.
Remove "BEGIN { h=0 }", and add "h=0" to the "FNR == 1" action part.
i.e. "FNR == 1 { h=0; if(header!="") ....}"

Daniel8472 · August 27, 2010, 8:40am

:Dawsome it works great now. Thanks a lot!

Daniel8472 · August 28, 2010, 8:43am

Hi again,

I thought about modifying the script a little bit in order to add more option for the output.

For example I would like to get the 4 value before and after a local maximum within a file. For this I think I need to store the values of 9 sukzessive lines respectivly for comparison.

if ($2 in line X < $2 in line X-1) {print $0 of X-4, $0 of X-3, $0 of X-2, $0 of X-1, $0 ofX , $0 of X+1, $0 of X+2, $0 of X+3, $0 of X+4}

However I am not sure how and where to implement this in the current script. I tryd to creat an array but this doesen't seem to work...

What approach would you suggest?

Have a great weekend!
Daniel

ps: my current code

BEGIN     {
    print "trajectory= traj molecules= mol Peptide= pep resid(CA?)= res contact= so (max)solv/sphere= n Radius(A)= r" "first hill= gear";
    printf "%6s %6s %6s %6s %6s %6s %6s %6s\n ", "traj", "mol", "pep", "res", "co", "n", "r", "gear"; 
    } 

    

    NR == 1        {traj=$2;
            mol=$1;
            pep=$3;
            res=$5;
            so=$6;
            }
    FNR == 1    { 
            if(header!="") 
                {

                printf "%6.0f %6s %6s %6s %6s %6.1f %6.1f %6.1f\n", traj, mol, pep, res, so, h, max, gear; 
                traj=$2;
                mol=$1;
                pep=$3;
                res=$5;
                so=$6;
                h=0;
                max=0;
                help=0;
                #gear=0;
                } 
            header=$0; next; 
              } 

    
    {if (substr($2,3,5)/100000*10^(substr($2,9,3))>h)     { 
                            
                            h=substr($2,3,5)/100000*10^(substr($2,9,3))
                            max=(substr($1,3,5)/100000)*10^(substr($1,9,3))    
                            }
    else     {
        };

    


END     {
    printf "%6.0f %6s %6s %6s %6s %6.1f %6.1f %6.1f\n ", traj, mol, pep, res, so, h, max, gear;  
    }