Selecting lowest and highest values in columns 1 and 2, based on subsets in column 3

hubleo · April 25, 2012, 7:15am

Hi,

I have a file with the following columns:

361459  447394  CHL1
290282  290282  CHL1
361459  447394  CHL1
361459  447394  CHL1
178352861  178363529  AGA
178352861  178363529  AGA
178363657  178363657  AGA

Essentially, using CHL1 as an example. For any line that has CHL1 in column 3, I want to select the lowest value in column 1 , and the highest value in column 2. These should then produce a single line that looks like

 290282 447394 CHL1

Using the same principle, for AGA, the line should look like

 178352861 178363657 AGA

The whole file contains a about 500 unique names in column 3. CHL1 and AGA would be 2/500. Some sort of loop to run through each of these names would be perfect. I'm very new to Linux and have a bit of knowledge in very basic use of AWK, sed, grep ect but I'm unsure how to maniplulate the file to get the output as stated above.

Any help would be very much appreciated!

jim_mcnamara · April 25, 2012, 7:31am

awk  '{ max[$3]=($2>max[$3])? $2: max[$3];
           if(! $3 in min) {min[$3]=999999999};  
           min[$3]=($1<min[$3])? $1: min[$3];
           next;
        }
        END {
               for(i in min){ print  min, max, i}
        } '   inputfilename  > outputfilename

Assuming I understood... try this.

hubleo · April 25, 2012, 7:59am

Hi, many thanks for your reply! That prints the max value in column 2 for the specific name in column 3, but not the minimum value in column 1. I'll have a play and see if I can get that to work. One solution that was provided elsewhere was, which works perfectly is:

#!/bin/bash

awk '{
if ( larr[$3] == "" ) {
  larr[$3] = $1
  harr[$3] = $2 
}
if ( larr[$3] != "" ) {
  if ( larr[$3] >= $1 ) { larr[$3] = $1 }
  if ( harr[$3] <= $2 ) { harr[$3] = $2 }
}
}
END { for (x in larr)
     print larr[x], harr[x],x
}' infile > outfile

Thanks again for your help!