finding least out of selected lines

CAch · September 1, 2011, 2:44pm

Hello,

I have a file, which looks like:

1 PRO 3 ILE 6.816858
2 GLN 4 THR 6.763534
3 ILE 5 LEU 6.659603
4 THR 6 TRP 5.887478
5 LEU 7 GLN 5.224145
6 TRP 4 THR 5.887478
7 GLN 5 LEU 5.224145
8 ARG 10 LEU 5.922154
9 PRO 23 LEU 5.841176
10 LEU 23 LEU 4.665862
11 VAL 22 ALA 5.404240
12 THR 21 GLU 4.437617
13 ILE 66 ILE 4.792131
14 LYS 19 LEU 4.804988
15 ILE 17 GLY 5.244380
16 GLY 18 GLN 5.444090
17 GLY 15 ILE 5.244380
18 GLN 15 ILE 5.435863
19 LEU 14 LYS 4.804988
20 LYS 13 ILE 5.280103
21 GLU 12 THR 4.437617
22 ALA 83 ASN 4.773669
23 LEU 10 LEU 4.665862
24 SER 85 ILE 5.104049
25 ASP 86 GLY 5.401655
26 THR 28 ALA 5.716655

I want to print the row containg "PRO" in second column after comparing and finding the minimum value of fifth column present in all "PRO". and likewise for every other string present in second column.

I am using :

 
filename=list
exec<$filename
while read line
do
awk '{print $2,"\t"$5,"\t"$1,"\t"$3,"\t"$4}' $line | sort | uniq | awk '{if ($1 != prev_1 && $2 != prev_2){print}; prev_1=$1; prev_2=$2}' > $line"20m"
done

I am getting results, but I didnt understand this command....
and if there is only one string like "SER" in 2nd row, it is not printed in output file. Whereas, I want to have all strings with minimum fifth column.
Can any one plz suggest me for the same. or make me understand the command ??????

bartus11 · September 1, 2011, 3:19pm

Try:

awk '!a[$2]{a[$2]=$0;m[$2]=$5}$5<m[$2]{a[$2]=$0;m[$2]=$5}END{for (i in a) print a}' file

DGPickett · September 1, 2011, 3:31pm

Well, use shell or awk, not both.

sed '
  /^[0-9]* PRO /!d
  s/.* //
 ' your_file |sort -nu | read key5
 
grep "^[0-9]* PRO .* $key5$" your_file

The file name of a list of file names file goes into a variable, that file is made stdin for the rest of the script, each line is read into a line variable, awk is called for that file name to rearrange the fields using tab separators, it is sorted left to right (not numeric, 10 may be less than 9, made unique (sort does that better with a u), fed to a second awk that test the first two fields against saved prior 2 fields, only prints the first for any set of values, with output to a file with same name suffix 20m.