Hi guys!
I'm new to scripting and I need to write a script in awk.
Here is example of file on which I'm working
ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00
ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00
ATOM 4690 CD PRO A 322 18.800 60.090 135.140 1.00 0.00
ATOM 4691 HD1 PRO A 322 17.770 60.020 134.790 1.00 0.00
ATOM 4692 HD2 PRO A 322 19.330 60.890 134.620 1.00 0.00
ATOM 4693 C PRO A 322 20.020 56.920 136.410 1.00 0.00
ATOM 4694 O1 PRO A 322 18.780 56.610 136.300 1.00 0.00
ATOM 4695 O2 PRO A 322 20.890 56.130 136.870 1.00 0.00
ATOM 4696 C1 TB B 323 85.140 34.010 62.880 1.00 0.00
ATOM 4697 C2 TB B 323 84.350 35.240 62.580 1.00 0.00
ATOM 4698 C3 TB B 323 84.790 35.750 61.220 1.00 0.00
ATOM 4699 C4 TB B 323 83.900 36.900 60.810 1.00 0.00
ATOM 4700 O5 TB B 323 84.420 38.000 60.510 1.00 0.00
ATOM 4701 O6 TB B 323 82.490 36.550 60.770 1.00 0.00
ATOM 4702 C7 TB B 323 81.780 37.770 60.540 1.00 0.00
ATOM 4703 C8 TB B 323 80.240 37.650 60.340 1.00 0.00
I would like to obtain min and max values of column 9 from lines having TB in 4th column. I was able to extract min and max values from whole file but couldn't restrict it to the lines with TB in 4th column. Any advice?
$ cat infile
ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00
ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00
ATOM 4690 CD PRO A 322 18.800 60.090 135.140 1.00 0.00
ATOM 4691 HD1 PRO A 322 17.770 60.020 134.790 1.00 0.00
ATOM 4692 HD2 PRO A 322 19.330 60.890 134.620 1.00 0.00
ATOM 4693 C PRO A 322 20.020 56.920 136.410 1.00 0.00
ATOM 4694 O1 PRO A 322 18.780 56.610 136.300 1.00 0.00
ATOM 4695 O2 PRO A 322 20.890 56.130 136.870 1.00 0.00
ATOM 4696 C1 TB B 323 85.140 34.010 62.880 1.00 0.00
ATOM 4697 C2 TB B 323 84.350 35.240 62.580 1.00 0.00
ATOM 4698 C3 TB B 323 84.790 35.750 61.220 1.00 0.00
ATOM 4699 C4 TB B 323 83.900 36.900 60.810 1.00 0.00
ATOM 4700 O5 TB B 323 84.420 38.000 60.510 1.00 0.00
ATOM 4701 O6 TB B 323 82.490 36.550 60.770 1.00 0.00
ATOM 4702 C7 TB B 323 81.780 37.770 60.540 1.00 0.00
ATOM 4703 C8 TB B 323 80.240 37.650 60.340 1.00 0.00
$ awk 'BEGIN { max = -1; min = -1 }
$4 == "TB" {
max = max < $9 ? $9 : max;
min = (min == -1 || min > $9) ? $9 : min
}
END { printf "max = %.3f\nmin = %.3f\n", max, min }
' infile
max = 62.880
min = 60.340
Birei, thank You very much!
I just noticed that some of the results must be from 8th column - some lines lack 5th column from early lines. Maybe it is possible to count columns backwards? Or make a condition saying that if in the 5th column there is a letter script should take value from 9th column, else from 8th?
ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00
ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00
ATOM 4690 CD PRO A 322 18.800 60.090 135.140 1.00 0.00
ATOM 5178 C21 TB X 345 78.520 55.030 66.630 1.00 0.00
ATOM 5179 C1 TB Y 346 54.110 41.980 81.650 1.00 0.00
ATOM 5180 C2 TB Y 346 55.480 42.240 82.250 1.00 0.00
ATOM 5181 C3 TB Y 346 56.470 41.760 81.170 1.00 0.00
ATOM 5182 C4 TB Y 346 57.930 41.990 81.460 1.00 0.00
ATOM 5183 O5 TB Y 346 58.450 41.810 82.590 1.00 0.00
ATOM 5184 O6 TB Y 346 58.660 42.140 80.220 1.00 0.00
ATOM 5185 C7 TB Y 346 60.070 42.430 80.520 1.00 0.00
ATOM 5186 C8 TB Y 346 61.130 42.800 79.430 1.00 0.00
ATOM 5187 O9 TB Y 346 62.430 43.290 79.860 1.00 0.00
ATOM 5188 C10 TB Y 346 63.400 42.500 80.470 1.00 0.00
ATOM 5198 C20 TB Y 346 58.830 43.040 74.180 1.00 0.00
ATOM 5199 C21 TB Y 346 59.260 42.850 72.710 1.00 0.00
ATOM 5200 C1 TB Z 347 66.200 64.420 74.140 1.00 0.00
ATOM 5201 C2 TB Z 347 65.770 63.120 73.420 1.00 0.00
ATOM 5202 C3 TB Z 347 65.520 62.060 74.480 1.00 0.00
ATOM 5203 C4 TB Z 347 65.220 60.710 73.880 1.00 0.00
ATOM 5204 O5 TB Z 347 65.740 60.380 72.810 1.00 0.00
ATOM 5205 O6 TB Z 347 64.790 59.800 74.890 1.00 0.00
ATOM 5221 C1 TB 348 82.400 42.410 76.490 1.00 0.00
ATOM 5222 C2 TB 348 81.300 43.360 76.020 1.00 0.00
ATOM 5223 C3 TB 348 81.800 44.780 75.790 1.00 0.00
ATOM 5224 C4 TB 348 80.550 45.700 75.480 1.00 0.00
ATOM 5225 O5 TB 348 80.390 46.740 76.150 1.00 0.00
ATOM 5226 O6 TB 348 79.690 45.360 74.310 1.00 0.00
ATOM 5227 C7 TB 348 78.480 46.220 74.270 1.00 0.00
ATOM 5228 C8 TB 348 77.460 46.020 73.150 1.00 0.00
ATOM 5229 O9 TB 348 76.250 46.810 73.220 1.00 0.00
ATOM 5230 C10 TB 348 76.160 47.920 72.370 1.00 0.00
ATOM 5231 O11 TB 348 77.230 48.280 71.940 1.00 0.00
ATOM 5232 C12 TB 348 74.880 48.650 72.320 1.00 0.00
ATOM 5233 C13 TB 348 74.880 49.800 71.380 1.00 0.00
ATOM 5234 C14 TB 348 73.520 50.550 71.610 1.00 0.00
ATOM 5235 C15 TB 348 77.190 44.510 73.170 1.00 0.00
ATOM 5236 O16 TB 348 76.640 44.120 71.890 1.00 0.00
ATOM 5237 C17 TB 348 75.870 42.970 72.030 1.00 0.00
ATOM 5238 O18 TB 348 75.580 42.440 73.130 1.00 0.00
ATOM 5239 C19 TB 348 75.490 42.400 70.680 1.00 0.00
ATOM 5240 C20 TB 348 76.150 41.020 70.640 1.00 0.00
ATOM 5241 C21 TB 348 75.980 40.350 69.280 1.00 0.00
ATOM 5242 C1 TB 349 74.410 58.030 87.730 1.00 0.00
Try with this. What modified? In case fifth field is an alphabetic character, asign the ninth column to value variable, else assign the eigth one. And work with that.
$ awk 'BEGIN { max = -1; min = -1 }
$4 == "TB" {
value = $5 ~ /^[[:alpha:]]$/ ? $9 : $8;
max = max < value ? value : max;
min = (min == -1 || min > value) ? value : min
}
END { printf "max = %.3f\nmin = %.3f\n", max, min }
' infile
max = 87.730
min = 66.630
I figured out why I have this problem. In my file there are "." [dots] not "," [commas] between numbers and this was ruining my results.
BEGIN { max = -1; min = 1000 }
$4 == "TB" {
value = $5 ~ /^[[:alpha:]]$/ ? $9 : $8;
max = max < value ? value : max;
min = min > value ? value : min;
}
END { printf "max = %.3f\n min = %.3f\n", max, min }
This changed script from @bieri is working for me - I had to change min value because when it was set to -1 the result was -1 as well [there were no smaller values in the file].
Do You have any propositions how I could obtain results without changing all the [dots] to [commas]?
The Corona688 method didn't work. Probably it's just a problem with EastEuropean coding of signs or some other minor thing like that. I'll change the dots into comas and will use the file only for analysis.
Thanks everyone for replaying and trying to help, really appreciate it!
Exactly there were no commas instead there were dots [this way the script wasn't working on my system] and after changing all the dots into commas script started to work. As I mentioned before I'm on a mac and probably this is the fault of the differences in the functioning of the script.
The input data which I pasted before is good representation of whole input file - with dots the script wasn't working on neither and with commas it works on both.
I think that for further work it would be good if I will install linux distribution