Help with AWK and Scripting!

SriJit · August 12, 2010, 7:00pm

Hi,
This is the first time I am working with awk and I am not familiar with any commands in it. But I managed to do most of my work just left with one more. Needing your help!

I have to extract only the matrix (written within []) from a text file. For example:

1JTJ_0006_ACGC_NPNP_A_12_15.pdb  0.61 0.54  [     0.43 0.51 0.71 0.81]
1JTJ_0011_ACGC_NPNP_A_12_15.pdb  0.46 0.44  [0.37      0.30 0.42 0.70]

I have thousands of text files in a folder. I decided to use awk since I found one awk command on the internet to extract the substring within []. For each file read, after processing, the output should be stored in part of the filename_out.txt. I have almost completed the work. But I have the output as

     0.43 0.51 0.71 0.81
0.37      0.30 0.42 0.70

I have to process this matrix further in matlab. So I just want square brackets at the end as,

[    0.43 0.51 0.71 0.81
0.37      0.30 0.42 0.70 ]

Here is what I have done so far,

for f in *.pair.txt;do awk -v RS=[ -F] '/]/{split (FILENAME,d,".");}{print   $1  >> d[1]"_out.txt"}' $f;done

I tried {print "[" > d[1]"out.txt"} in many places but dint work.

If there is anyother easy way of doing my task please let me know.

Thanks,
Sriji

agama · August 12, 2010, 9:28pm

I might have written the awk a bit differently, but working with what you've got, this should add the leading and trailing square brackets.

awk  -F "]" -v RS="["  '
        NR == 1 {                              # establish file name just once at beginning
                split( FILENAME, d, ".");
                outfile = d[1] "_out.txt";
                printf( "[" ) >outfile;         # opening bracket to the output
        }
        /\]/{                                   # for each record, print the last one if there, save this one
                if( last )
                        print last >outfile;    # print the one from the last time
                last = $1;                     # save this one til next record is read or end
        }
        END {
                printf( "%s]\n", last ) >outfile;       # print the last line of the matrix and closing bracket
        }
        ' $f

The logic is such that it saves the current line in the matrix and will output it as the next is read. At the end of the input file, the last line of the matrix is printed with the closing bracket.

SriJit · August 13, 2010, 1:58pm

Thank you so much!!!

---------- Post updated 08-13-10 at 01:58 PM ---------- Previous update was 08-12-10 at 09:46 PM ----------

Hi,
Can you help me with this
The matrix in the input file will be like, (there will be blank spaces indicating no value in the matrix)

[    4   5  6  7 ]
[3       7  9  10]
[4   5      11 9]

When I make it look like

[0  4   5  6   7 
 3  0   7  9   10
 4  5   0  11  9]

in the output file, how to insert 0 in the diagonal space which is blank? How can be this done along with the previous code?

Srijit

agama · August 13, 2010, 10:05pm

I'd write it like this (completely replaces the previous example) -- there might be a better way, but this is straight forward.

It assumes that a pair of spaces indicates the need for a zero. If there are 4 spaces then it will insert to zero values, 6 spaces 3 zeros, etc. It also assumes an opening bracket space, or space closing bracket indicates the need for a zero at the beginning or end. If this isn't quite right, it should be straightforward to make the needed changes.

for f in *.pair.txt
do
        awk '
        BEGIN {printf( "[" );   }               # opening bracket
  
        /[.*]/  {
                if( last )
                        printf( "%s\n",  last );  # print the last one we saw

                gsub( ".*\\[", "[", $0 );       # trash all before [, but keep [
                gsub( "].*", "]", $0 );         # trash all after ], but keep ]
                gsub( "\\[ ", "0.0 ", $0 );     # opening bracket, space becomes 0 space
                gsub( "\\[", "", $0 );          # ditch opening bracket
                gsub( " ]", " 0.0", $0 );       # trailing space, bracket becomes 0
                gsub( "]", "", $0 );            # ditch trailing bracket
                gsub( "  ", " 0.0 ", $0 );      # two spaces becomes 0 space
                gsub( "  ", " ", $0 );          # cleanup if two or more 0s inserted

                last = $0;               # save to add trailing ] if this is the last one
        } 

        END {
                printf( "%s]\n", last );
        }
        ' $f >${f%%.*}_out.txt  
done

kurumi · August 13, 2010, 10:32pm

grep -Eo "\[(.[^]]*)\]" file | sed '1s/[ \t]*\]//;$s/^\[[ \t]*//'

SriJit · August 16, 2010, 11:12am

Hi Agama,

I ran yoyour second code. It gives me the wrong output. It even prints the texts in the input.

Here is a sample input,

#### ACGG_NNNP.pairwiseRMSDs.txt
#### Output from pdb_extract.py
#### Created 2010-08-12 13:59:31.122708

5 structures aligned

1OSW_0002_ACGG_NNNP_A_10_13.pdb.pdb most representative structure of pool
 with lowest average pair-wise RMSD of 0.81

    Mean global RMSD: 0.91  (0.81 to 0.98A)
 Mean global bb RMSD: 0.72  (0.65 to 0.82A)

                                 Avg  Avgbb  Pair-Wise RMSD Matrix:
Structure:                       RMSD RMSD  (Top-Right=Heavy atom alignment, Bottom-Left=Backbone atom alignment)
-------------------------------  ---- ----   0_13 0_13 0_13 0_13 0_13
1OSW_0002_ACGG_NNNP_A_10_13.pdb  0.81 0.65  [     0.84 0.77 0.95 0.70]
1OSW_0005_ACGG_NNNP_A_10_13.pdb  0.85 0.76  [0.85      1.01 0.59 0.94]
1OSW_0015_ACGG_NNNP_A_10_13.pdb  0.98 0.68  [0.43 0.83      1.10 1.04]
1OSW_0019_ACGG_NNNP_A_10_13.pdb  0.95 0.82  [0.91 0.51 0.94      1.14]
1OSW_0021_ACGG_NNNP_A_10_13.pdb  0.95 0.67  [0.42 0.83 0.52 0.91     ]

END

Can you please try to run the code for this and see if it works perfectly with just printing the matrix with 0 inserted in the spaces along the diagonal? If not can you make what changes I should make to get that output? It is 5*5 matrix in the above input. There is space along the diagonal.

Thanks,
Srijit

agama · August 16, 2010, 8:27pm

Yes, I ran the script against data that always had a matrix component so I didn't notice the mistake. Add backslashes to escape the square brackets and this will print just the matrix:

/\[.*\]/  {

Here is a sample input,
--------------- snip -----------------------------------------------------

-------------------------------  ---- ----   0_13 0_13 0_13 0_13 0_13
1OSW_0002_ACGG_NNNP_A_10_13.pdb  0.81 0.65  [     0.84 0.77 0.95 0.70]
1OSW_0005_ACGG_NNNP_A_10_13.pdb  0.85 0.76  [0.85      1.01 0.59 0.94]
1OSW_0015_ACGG_NNNP_A_10_13.pdb  0.98 0.68  [0.43 0.83      1.10 1.04]
1OSW_0019_ACGG_NNNP_A_10_13.pdb  0.95 0.82  [0.91 0.51 0.94      1.14]
1OSW_0021_ACGG_NNNP_A_10_13.pdb  0.95 0.67  [0.42 0.83 0.52 0.91     ]

END

The way I originally wrote the code assumed that there was one extra space indicating a missing 0 -- please use code tags so that spacing is preserved - your example here shows 4 spaces, so the code will need to be a bit different:

        awk '
        BEGIN {printf( "[" );   }               # opening bracket

        /\[.*\]/  {          
                if( last )
                        printf( "%s\n",  last );  # print the last one we saw

                gsub( ".*\\[", "[", $0 );       # trash all before [, but keep [
                gsub( "].*", "]", $0 );         # trash all after ], but keep ]
                gsub( "\\[", "", $0 );          # ditch opening bracket
                gsub( "]", "", $0 );            # ditch trailing bracket

                gsub( "    ", " 0.00", $0 );  # four spaces becomes 0.00 

                gsub( " $", "", $0 );          # cleanup trailing space if there
                gsub( "^ ", "", $0 );          # cleanup leading space if there

                last = $0;               # save to add trailing ] if this is the last one
        }

        END {
                printf( "%s]\n", last );
        }
        ' $f

Lines in bold are new or changed. Several lines have been removed as they became unnecessary.

rdcwayx · August 16, 2010, 9:48pm

If your awk support gensub(), try this code.

awk '
BEGIN {printf "["}  
/\[/ {if ( last ) { printf "%s\n",  last}
       last=gensub(/.+\[(.+)\]/,"\\1","g")
      } 
END {printf( "%s]\n", last )}
' urfile

[ 0.84 0.77 0.95 0.70
0.85 1.01 0.59 0.94
0.43 0.83 1.10 1.04
0.91 0.51 0.94 1.14
0.42 0.83 0.52 0.91 ]

SriJit · August 17, 2010, 12:17pm

Sweet...Thanks

SriJit · August 18, 2010, 5:29pm

Hi Agama,

I have another problem with the code. Some of my inputs are like,

------------------------------- ---- ---- 0_13 0_13 0_13 0_13 0_13
1OSW_0002_ACGG_NNNP_A_10_13.pdb 0.81 0.65 [ 0.84 0.77 0.95 0.70........
12 12 34 5 5654 6 7.....]
1OSW_0005_ACGG_NNNP_A_10_13.pdb 0.85 0.76 [0.85 1.01 0.59 0.94]
1OSW_0015_ACGG_NNNP_A_10_13.pdb 0.98 0.68 [0.43 0.83 1.10 1.04]
1OSW_0019_ACGG_NNNP_A_10_13.pdb 0.95 0.82 [0.91 0.51 0.94 1.14]
1OSW_0021_ACGG_NNNP_A_10_13.pdb 0.95 0.67 [0.42 0.83 0.52 0.91 ]

END
rest of the lines in the matrix will be like the first line of the matrix.

i.e we have the input such that if the matrix is more than 30*30
then for each line in the matrix, only 30 columns will be in one line
rest will be wrapped to second line but still within one [] for each row of the matrix.

For this type of input ,your awk code does nt write anything in the output file.But it creates a output file.

Can you please modify the code or let me know how to modify?

Thanks,
Sri

agama · August 18, 2010, 8:40pm

Yes, the original wouldn't have printed anything as it only picked up matrix
data when both opening bracket and closing bracket were on the same line.
This assumes that they can be split, and that there can be multiple lines
that have neither opening or closing brackets. It also assumes that if the
'blank indicates a zero' set of spaces exists at the end of the line, those
blanks are present. If not, it will not add 0.00 correctly.

Try this (of course the standard no guarantee that it will work).

        awk '
        BEGIN {printf( "[" );   }               # opening bracket

        /]/  {
                if( last )
                        printf( "%s\n",  last );  # print the last one we saw

                if( partial )                           # add current line to partial buffer
                        buffer = partial " " $0;
                else
                        buffer = $0;                    # no partial, just use current line

                gsub( ".*\\[", "[", buffer );      # trash all before [, but keep [
                gsub( "].*", "]", buffer );        # trash all after ], but keep ]
                gsub( "\\[", "", buffer );         # ditch opening bracket
                gsub( "]", "", buffer );           # ditch trailing bracket
                gsub( "    ", " 0.00", buffer );   # four spaces becomes 0.00 space
                gsub( "  ", " ", buffer );         # cleanup if multiple spaces
                gsub( " $", "", buffer );          # cleanup trailing space if there
                gsub( "^ ", "", buffer );          # cleanup leading space if there

                last = buffer;               # save to add trailing ] if this is the last one
                join = 0;
                partial = "";

                next;
        }

        /\[/ {                                  # beginning of matrix, but not end
                gsub( "^.*\\[", "[", $0 );      # ditch beginning junk
                partial = $0;                   # start a partial buffer
                join = 1;                       # join next line(s) if not end of matrix
                next;
        }

        join == 1 {
                partial = partial " " $0;       # add this line to the partial matrix
                next;
        }

        END {
                printf( "%s]\n", last );
        }
        ' $f

I ran a quick test with data that I dummied up. It seems to give sane output, but I didn't look too closely. You should be able to tweek this if it's not just right.

SriJit · August 19, 2010, 10:49am

Hi,

The code works fine in reading the lines. But since within [] when space is seen it prints 0.00, when input is like,

1FCG_234_455 35 36 [1 2 3 ....30
31 32...60
61 .... ]

I am unable to show you the space, but in the input, if there are more than 30 columns per row, after 30 column the rest are wrapped to the next line and the starting number of the second line is in the same position as where the starting number of first line is. for ex. 31 in the second line of the example is positioned in the same column as 1. 1,31,61 are all in the same column of the text file. So there are empty spaces till that position.

for the empty spaces in the beginning of each line within[] is substitued with 0.00. Hence the matrix size becomes large. 6060 matrix becomes 9090. I need to manipulte this matrix. So it has to be the same, just in the diagonal it will have 0.

Is it possible to modify your code to make this work? I tried but dint succeed.

Thanks,
Srijit

---------- Post updated at 10:49 AM ---------- Previous update was at 10:33 AM ----------

Is it possible to do the same what I have asked in the previous post like the following.

First, let it insert 0.00 for all the four spaces it sees in the input.

Then ,may be another awk command orin the same, it can check the output got from the previous output for continuous zeroes and remove them. In my input only 0.00 will be along the diagonal.

Since I am new to awk, am unable to try writing a code for my problem.

Thanks,
Srijit

agama · August 19, 2010, 10:51pm

It can all be done with a single awk. It just would have been helpful to know that the continuation lines were indented. In the future, if you place "code tags" around programmes or output, it preserves spacing and can illustrate things like the multiple blanks. To insert code tags, click on the '#' button at the top of the edit window and then type/paste your text between the tags that are inserted in the window.

Here is a revision to the awk that will allow for multiple spaces at the beginning of a continued line. It assumes that the text is aligned below the numbers like this:

leading junk tokens on first line [01.00 02.00 03.00 04.00 05.00 06.00 07.00 08.00 09.00 10.00 11.00 12.00 13.00
                                   14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00
                                   27.00 28.00 29.00 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00]

The programme will properly handle the case where the blanks in the data are at the beginning of a continued record -- it does not blindly delete all blanks at the beginning of the line to provide for this.

        awk '
        BEGIN {printf( "[" );   }               # opening bracket

        /]/  {
                if( last )
                        printf( "%s\n",  last );  # print the last one we saw

                if( partial )                           # add current line to partial buffer
                        buffer = partial " " substr( $0, indent );      # ditch leading spaces, but dont trash the "blank == 0"
                else
                        buffer = $0;                    # no partial, just use current line

                gsub( ".*\\[", "[", buffer );      # trash all before [, but keep [
                gsub( "].*", "]", buffer );        # trash all after ], but keep ]
                gsub( "\\[", "", buffer );         # ditch opening bracket
                gsub( "]", "", buffer );           # ditch trailing bracket
                gsub( "  +", " 0.00 ", buffer );   # two or more spaces causes 0.00 to insert
                gsub( "  ", " ", buffer );         # cleanup if multiple spaces
                gsub( " $", "", buffer );          # cleanup trailing space if there
                gsub( "^ ", "", buffer );          # cleanup leading space if there

                last = buffer;               # save to add trailing ] if this is the last one
                join = 0;
                partial = "";

                next;
        } 

        /\[/ {                                  # beginning of matrix, but not end
                indent = index( $0, "[" ) + 1;  # number of spaces to skip for secondary lines
                gsub( "^.*\\[", "[", $0 );      # ditch beginning junk
                partial = $0;                   # start a partial buffer
                join = 1;                       # join next line(s) if not end of matrix
                next;
        }

        join == 1 {
                buffer = substr( $0, indent )   # ditch leading spaces, but dont trash the "blank == 0"
                partial = partial " " buffer;   # add this line to the partial matrix
                next;
        }
        
        END {
                printf( "%s]\n", last );
        }
        '  $f

Hope this works better for you.

SriJit · August 20, 2010, 12:23pm

Hi,

I still have to make small change. I get the output as,
[0.00 0.20 0.49 0.31 0.32 1.04 1.64 1.64 1.47 3.00 3.00 3.00 3.00 3.00 1.41 1.38 1.47 2.37 1.78 1.94 1.67 2.12 3.00 2.95 2.96 3.02 2.94 1.28 1.41 1.55 0.00 3.30 3.05 2.85

Green color values - second line of the same row.
Black first line.

For the above, the input had in the first row of the matrix as two lines. According to the code, it reads the second line too and writes it to the output file in the same line. But it inserts another 0.00 when it reads the the beginning of the second line. I have highlighted it above.
How should I modify the code?

Thanks,
Srijit

agama · August 20, 2010, 10:33pm

There is probably a trailing blank, or blanks on the first line. Initially I was deleting all trailing blanks, but that wouldn't work if the las 'value' on the line was really a blank set intended to be replaced with 0.00.

It could also be the way the data is aligned on the second row. If there is an extra space ahead of the data on the second line, that'd be a problem as well.

If either of these two problems is the cause, and its a single space that's causing the grief, then modify this line to have 4 spaces before the + instead of 2. If that doesn't work, then please post the first few lines, and it is important that you put them inside code tags to preserve the spacing.

gsub( "    +", " 0.00 ", buffer );   # 4 or more spaces causes 0.00 to insert

SriJit · August 21, 2010, 11:55am

Hi,
Here is how my input looks like.

Structure:                       RMSD RMSD  (Top-Right=Heavy atom alignment, Bottom-Left=Backbone atom alignment)

1E4P_0020_CCGA_PPNN_A_17_20.pdb  2.83 1.79  [     2.90 2.83 2.10 2.14 2.03 2.00 3.96 3.97 2.07 4.05 1.95 2.05 2.13 1.93 4.04 3.83 3.91 2.49 2.11 3.68 2.88 3.08 3.09 2.73 2.86 3.08 2.88 2.93 2.87 
                                             2.80 2.96 3.25 2.94 2.93 3.05 2.09 2.64 2.64 2.61]
1EHT_0002_CCGA_PPNN_A_12_15.pdb  3.33 2.14  [2.45      0.98 3.56 3.58 3.50 3.50 4.15 4.17 3.44 4.09 3.29 3.39 3.48 3.49 4.51 4.52 4.40 3.81 3.50 3.67 3.19 3.21 3.71 3.02 3.41 3.40 3.36 3.29 3.39 
                                             3.37 3.67 4.02 3.51 3.57 3.56 3.46 0.90 0.89 0.97]
1EHT_0007_CCGA_PPNN_A_12_15.pdb  3.37 1.97  [2.22 0.84      3.49 3.47 3.41 3.39 4.30 4.32 3.44 4.23 3.20 3.37 3.33 3.35 4.77 4.77 4.66 3.63 3.45 3.63 3.20 3.23 3.77 2.90 3.43 3.47 3.40 3.34 3.40 
                                             3.35 3.80 4.10 3.57 3.59 3.60 3.45 1.30 1.26 1.21]
1LDZ_0002_CCGA_PPNN_A_5_8.pdb    2.55 1.60  [1.17 2.20 2.03      0.45 0.49 0.55 3.31 3.26 0.88 3.47 0.79 1.09 0.67 0.81 3.29 3.07 3.25 0.89 0.71 3.58 3.37 3.58 3.10 3.46 3.15 3.35 3.22 3.40 3.18 
                                             3.18 3.09 3.10 3.21 2.97 3.21 0.80 3.50 3.47 3.45]

Please let me know how I should modify the code.

Thanks,
Srijit

agama · August 21, 2010, 12:07pm

Thanks for posting the sample data -- very useful.

Small change. I was inserting an extra blank when joining multiple lines and that was causing the problem. Not sure why my test wasn't showing that, or maybe I wasn't looking closely enough.

Statements in bold were changed.

        awk '
        BEGIN {printf( "[" );   }               # opening bracket

        /]/  {
                if( last )
                        printf( "%s\n",  last );  # print the last one we saw

                if( partial )                           # add current line to partial buffer

                        buffer = partial substr( $0, indent );  # ditch leading spaces, but dont trash the "blank == 0" 
                else
                        buffer = $0;                    # no partial, just use current line

                gsub( ".*\\[", "[", buffer );      # trash all before [, but keep [
                gsub( "].*", "]", buffer );        # trash all after ], but keep ]
                gsub( "\\[", "", buffer );         # ditch opening bracket
                gsub( "]", "", buffer );           # ditch trailing bracket
                gsub( "  +", " 0.00 ", buffer );   # two or more spaces causes 0.00 to insert
                gsub( "  ", " ", buffer );         # cleanup if multiple spaces
                gsub( " $", "", buffer );          # cleanup trailing space if there
                gsub( "^ ", "", buffer );          # cleanup leading space if there

                last = buffer;               # save to add trailing ] if this is the last one
                join = 0;
                partial = "";

                next;
        }

        /\[/ {                                  # beginning of matrix, but not end
                indent = index( $0, "[" ) + 1;  # number of spaces to skip for secondary lines
                gsub( "^.*\\[", "[", $0 );      # ditch beginning junk
                partial = $0;                   # start a partial buffer
                join = 1;                       # join next line(s) if not end of matrix
                next;
        }

        join == 1 {
                buffer = substr( $0, indent )   # ditch leading spaces, but dont trash the "blank == 0"

                partial = partial buffer;       # add this line to the partial matrix (changed)
                next;
        }

        END {
                printf( "%s]\n", last );
        }
        '

Maybe this time! Have a great day.

SriJit · August 23, 2010, 2:49pm

It works fine.

Thanks,
Srijit