AWK script - extracting min and max values from selected lines

grincz · February 2, 2012, 4:57pm

Hi guys!
I'm new to scripting and I need to write a script in awk.

Here is example of file on which I'm working

 ATOM   4688  HG1 PRO A 322      18.080  59.680 137.020  1.00  0.00            
ATOM   4689  HG2 PRO A 322      18.850  61.220 137.010  1.00  0.00            
ATOM   4690  CD  PRO A 322      18.800  60.090 135.140  1.00  0.00            
ATOM   4691  HD1 PRO A 322      17.770  60.020 134.790  1.00  0.00            
ATOM   4692  HD2 PRO A 322      19.330  60.890 134.620  1.00  0.00            
ATOM   4693  C   PRO A 322      20.020  56.920 136.410  1.00  0.00            
ATOM   4694  O1  PRO A 322      18.780  56.610 136.300  1.00  0.00            
ATOM   4695  O2  PRO A 322      20.890  56.130 136.870  1.00  0.00            
ATOM   4696  C1   TB B 323      85.140  34.010  62.880  1.00  0.00            
ATOM   4697  C2   TB B 323      84.350  35.240  62.580  1.00  0.00            
ATOM   4698  C3   TB B 323      84.790  35.750  61.220  1.00  0.00            
ATOM   4699  C4   TB B 323      83.900  36.900  60.810  1.00  0.00            
ATOM   4700  O5   TB B 323      84.420  38.000  60.510  1.00  0.00            
ATOM   4701  O6   TB B 323      82.490  36.550  60.770  1.00  0.00            
ATOM   4702  C7   TB B 323      81.780  37.770  60.540  1.00  0.00            
ATOM   4703  C8   TB B 323      80.240  37.650  60.340  1.00  0.00

I would like to obtain min and max values of column 9 from lines having TB in 4th column. I was able to extract min and max values from whole file but couldn't restrict it to the lines with TB in 4th column. Any advice?

TY in advance!

birei · February 2, 2012, 5:10pm

Hi grincz,

One way using awk:

$ cat infile
 ATOM   4688  HG1 PRO A 322      18.080  59.680 137.020  1.00  0.00            
ATOM   4689  HG2 PRO A 322      18.850  61.220 137.010  1.00  0.00            
ATOM   4690  CD  PRO A 322      18.800  60.090 135.140  1.00  0.00            
ATOM   4691  HD1 PRO A 322      17.770  60.020 134.790  1.00  0.00            
ATOM   4692  HD2 PRO A 322      19.330  60.890 134.620  1.00  0.00            
ATOM   4693  C   PRO A 322      20.020  56.920 136.410  1.00  0.00            
ATOM   4694  O1  PRO A 322      18.780  56.610 136.300  1.00  0.00            
ATOM   4695  O2  PRO A 322      20.890  56.130 136.870  1.00  0.00            
ATOM   4696  C1   TB B 323      85.140  34.010  62.880  1.00  0.00            
ATOM   4697  C2   TB B 323      84.350  35.240  62.580  1.00  0.00            
ATOM   4698  C3   TB B 323      84.790  35.750  61.220  1.00  0.00            
ATOM   4699  C4   TB B 323      83.900  36.900  60.810  1.00  0.00            
ATOM   4700  O5   TB B 323      84.420  38.000  60.510  1.00  0.00            
ATOM   4701  O6   TB B 323      82.490  36.550  60.770  1.00  0.00            
ATOM   4702  C7   TB B 323      81.780  37.770  60.540  1.00  0.00            
ATOM   4703  C8   TB B 323      80.240  37.650  60.340  1.00  0.00
$ awk 'BEGIN { max = -1; min = -1 }
  $4 == "TB" {
    max = max < $9 ? $9 : max;
    min = (min == -1 || min > $9) ? $9 : min
  }
  END { printf "max = %.3f\nmin = %.3f\n", max, min }
' infile
max = 62.880
min = 60.340

Regards,
Birei

grincz · February 2, 2012, 5:25pm

Birei, thank You very much!
I just noticed that some of the results must be from 8th column - some lines lack 5th column from early lines. Maybe it is possible to count columns backwards? Or make a condition saying that if in the 5th column there is a letter script should take value from 9th column, else from 8th?

 ATOM   4688  HG1 PRO A 322      18.080  59.680 137.020  1.00  0.00            
ATOM   4689  HG2 PRO A 322      18.850  61.220 137.010  1.00  0.00            
ATOM   4690  CD  PRO A 322      18.800  60.090 135.140  1.00  0.00            
ATOM   5178  C21  TB X 345      78.520  55.030  66.630  1.00  0.00            
ATOM   5179  C1   TB Y 346      54.110  41.980  81.650  1.00  0.00            
ATOM   5180  C2   TB Y 346      55.480  42.240  82.250  1.00  0.00            
ATOM   5181  C3   TB Y 346      56.470  41.760  81.170  1.00  0.00            
ATOM   5182  C4   TB Y 346      57.930  41.990  81.460  1.00  0.00            
ATOM   5183  O5   TB Y 346      58.450  41.810  82.590  1.00  0.00            
ATOM   5184  O6   TB Y 346      58.660  42.140  80.220  1.00  0.00            
ATOM   5185  C7   TB Y 346      60.070  42.430  80.520  1.00  0.00            
ATOM   5186  C8   TB Y 346      61.130  42.800  79.430  1.00  0.00            
ATOM   5187  O9   TB Y 346      62.430  43.290  79.860  1.00  0.00            
ATOM   5188  C10  TB Y 346      63.400  42.500  80.470  1.00  0.00                      
ATOM   5198  C20  TB Y 346      58.830  43.040  74.180  1.00  0.00            
ATOM   5199  C21  TB Y 346      59.260  42.850  72.710  1.00  0.00            
ATOM   5200  C1   TB Z 347      66.200  64.420  74.140  1.00  0.00            
ATOM   5201  C2   TB Z 347      65.770  63.120  73.420  1.00  0.00            
ATOM   5202  C3   TB Z 347      65.520  62.060  74.480  1.00  0.00            
ATOM   5203  C4   TB Z 347      65.220  60.710  73.880  1.00  0.00            
ATOM   5204  O5   TB Z 347      65.740  60.380  72.810  1.00  0.00            
ATOM   5205  O6   TB Z 347      64.790  59.800  74.890  1.00  0.00                      
ATOM   5221  C1   TB   348      82.400  42.410  76.490  1.00  0.00            
ATOM   5222  C2   TB   348      81.300  43.360  76.020  1.00  0.00            
ATOM   5223  C3   TB   348      81.800  44.780  75.790  1.00  0.00            
ATOM   5224  C4   TB   348      80.550  45.700  75.480  1.00  0.00            
ATOM   5225  O5   TB   348      80.390  46.740  76.150  1.00  0.00            
ATOM   5226  O6   TB   348      79.690  45.360  74.310  1.00  0.00            
ATOM   5227  C7   TB   348      78.480  46.220  74.270  1.00  0.00            
ATOM   5228  C8   TB   348      77.460  46.020  73.150  1.00  0.00            
ATOM   5229  O9   TB   348      76.250  46.810  73.220  1.00  0.00            
ATOM   5230  C10  TB   348      76.160  47.920  72.370  1.00  0.00            
ATOM   5231  O11  TB   348      77.230  48.280  71.940  1.00  0.00            
ATOM   5232  C12  TB   348      74.880  48.650  72.320  1.00  0.00            
ATOM   5233  C13  TB   348      74.880  49.800  71.380  1.00  0.00            
ATOM   5234  C14  TB   348      73.520  50.550  71.610  1.00  0.00            
ATOM   5235  C15  TB   348      77.190  44.510  73.170  1.00  0.00            
ATOM   5236  O16  TB   348      76.640  44.120  71.890  1.00  0.00            
ATOM   5237  C17  TB   348      75.870  42.970  72.030  1.00  0.00            
ATOM   5238  O18  TB   348      75.580  42.440  73.130  1.00  0.00            
ATOM   5239  C19  TB   348      75.490  42.400  70.680  1.00  0.00            
ATOM   5240  C20  TB   348      76.150  41.020  70.640  1.00  0.00            
ATOM   5241  C21  TB   348      75.980  40.350  69.280  1.00  0.00            
ATOM   5242  C1   TB   349      74.410  58.030  87.730  1.00  0.00

birei · February 2, 2012, 5:41pm

Try with this. What modified? In case fifth field is an alphabetic character, asign the ninth column to value variable, else assign the eigth one. And work with that.

$ awk 'BEGIN { max = -1; min = -1 }
  $4 == "TB" {
    value = $5 ~ /^[[:alpha:]]$/ ? $9 : $8;  
    max = max < value ? value : max;
    min = (min == -1 || min > value) ? value : min
  }
  END { printf "max = %.3f\nmin = %.3f\n", max, min }
' infile
max = 87.730
min = 66.630

Regards,
Birei

grincz · February 2, 2012, 6:03pm

Unfortunately it doesn't work. I have much larger file - around 100k lines and it gives me such results...

max = 99,000
min = 100,000

Any ideas? I checked and there are no more variants of the TB lines...

ahamed101 · February 2, 2012, 10:29pm

Try this...

awk '/TB/{f=$(NF-2);!min?min=f:0;max=f>max?f:max;min=f>min?min:f}END{print max":"min}' infile

--ahamed

grincz · February 3, 2012, 3:07am

Hi ahamed,
results are more possible bus still not there..

max = 99,000
min = -1,000

I know that max should be over 160 and min around 40-50.
@Birei script is working on a smaller files but with the bigger files it doesn't work.

Maybe the problem is that some lines have 4digit values in the 5th column - like this?

ATOM  25935  O9   TB  1334      -0.810  62.300  67.500  1.00  0.00            
ATOM  25936  C10  TB  1334       0.460  61.650  67.370  1.00  0.00            
ATOM  25937  O11  TB  1334       1.350  61.990  66.560  1.00  0.00            
ATOM  25938  C12  TB  1334       0.690  60.320  68.140  1.00  0.00            
ATOM  25939  C13  TB  1334       1.260  60.630  69.520  1.00  0.00            
ATOM  25940  C14  TB  1334       1.220  59.500  70.510  1.00  0.00            
ATOM  25941  C15  TB  1334      98.740  64.820  67.620  1.00  0.00            
ATOM  25942  O16  TB  1334      98.500  66.090  67.050  1.00  0.00            
ATOM  25943  C17  TB  1334      98.350  67.110  68.070  1.00  0.00            
ATOM  25944  O18  TB  1334      98.930  67.020  69.120  1.00  0.00            
ATOM  25945  C19  TB  1334      97.930  68.370  67.370  1.00  0.00            
ATOM  25946  C20  TB  1334      96.460  68.870  67.670  1.00  0.00            
ATOM  25947  C21  TB  1334      96.170  70.140  66.910  1.00  0.00

ahamed101 · February 3, 2012, 3:52am

For the data you have pasted, it is working for me. May be you should paste or upload for which it is not working.

--ahamed

grincz · February 3, 2012, 5:17am

Thanks Ahamed,
I uploaded example of the file on dropbox
http://dl.dropbox.com/u/2765863/delta.txt

ahamed101 · February 3, 2012, 5:27am

root@bt> awk '/TB/{f=$(NF-2);!min?min=f:0;max=f>max?f:max;min=f>min?min:f}END{print max":"min}' infile
110.300:51.130

Not correct?

--ahamed

grincz · February 3, 2012, 5:30am

No, maybe the problem is that I'm on a mac with bash version ?

GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin11)

ctsgnb · February 3, 2012, 6:16am

Your code may return erroneous min value if the min value is equal to zero .

Example :

min=0
f=3

!min?min=f ---> this will set min to 3 and will lose the zero value

min=f>min?min:f ---> this will then evaluate 3>3? to false ---> min=f=3 ... intead of keeping the zero 0

grincz · February 3, 2012, 10:48am

Thanks for answering guys!

I figured out why I have this problem. In my file there are "." [dots] not "," [commas] between numbers and this was ruining my results.

BEGIN { max = -1; min = 1000 }
  $4 == "TB" {
    value = $5 ~ /^[[:alpha:]]$/ ? $9 : $8;  
    max = max < value ? value : max;
    min = min > value ? value : min;
  }
  END { printf "max = %.3f\n min = %.3f\n", max, min }

This changed script from @bieri is working for me - I had to change min value because when it was set to -1 the result was -1 as well [there were no smaller values in the file].

Do You have any propositions how I could obtain results without changing all the [dots] to [commas]?

Corona688 · February 3, 2012, 10:51am

awk -F. ?

grincz · February 3, 2012, 3:25pm

The Corona688 method didn't work. Probably it's just a problem with EastEuropean coding of signs or some other minor thing like that. I'll change the dots into comas and will use the file only for analysis.

Thanks everyone for replaying and trying to help, really appreciate it!

birei · February 3, 2012, 3:56pm

@grincz, there is one thing I don't understand. You linked a file and there were no commas. And script seems to give the correct output.

Why don't you paste your input data and check the reason the script is failing.

Regards,
Birei.

grincz · February 3, 2012, 4:13pm

Exactly there were no commas instead there were dots [this way the script wasn't working on my system] and after changing all the dots into commas script started to work. As I mentioned before I'm on a mac and probably this is the fault of the differences in the functioning of the script.

The input data which I pasted before is good representation of whole input file - with dots the script wasn't working on neither and with commas it works on both.

I think that for further work it would be good if I will install linux distribution

Take care

Corona688 · February 3, 2012, 4:21pm

You mean, you had to turn 59.680 into 59,680 for it to be taken as a floating point number?

That probably has more to do with your locale/language settings than your version of awk!

grincz · February 3, 2012, 6:24pm

Exactly!

I think so too...