Find highest records in table

GDC · July 2, 2013, 11:56am

Dear All,

I have table files with different measurements for the same sensors. The first column indicate the sensor (7 different sensors and 16 measurements in the example below). I would like to find the best measurement for each sensor. The best measurement is determined by the higher value of ($6 x $7). If two or more measurements are equally good I would need the first one. The number of measurements per sensor varies between 3-6.

cat infile.txt
M01072-1    FJ973371    238617596    400    1979    97.91    383    8    0    1    383    1105    723
M01072-1    DQ980471    116222186    400    3178    97.91    383    8    0    1    383    1070    688
M01072-1    AM901334    195944118    400    988    90.74    378    26    9    1    373    986    613
M01072-2    AM901325    195944109    400    988    90.74    378    26    9    1    373    986    613
M01072-2    AM901311    195944095    400    988    89.82    393    29    11    1    387    986    599
M01072-2    EF205010    147743297    443    1269    99.55    443    2    0    1    443    869    427
M01011-2    HQ670228    339699497    443    1313    93.72    446    22    6    1    443    871    429
M01011-2    HQ670227    339699496    443    1210    93.72    446    22    6    1    443    817    375
M01072-4    HM483399    299790116    443    3656    92.63    448    23    9    1    443    3314    2872
M01072-4    HM483396    299790113    443    3541    92.60    446    27    6    1    443    3298    2856
M01072-5    AF183464    6120004    403    2120    93.20    397    22    5    8    403    539    931
M01072-5    AF183460    6120000    403    2126    93.20    397    22    5    8    403    538    930
M01072-6    AF395513    15281673    403    2104    93.25    400    12    13    1    392    510    902
M01072-6    AF183469    6120009    403    2142    93.23    399    14    13    1    392    529    921
M01072-6    DQ015714    66864259    403    1684    92.73    399    20    9    8    403    196    588
M01072-7    EF205010    147743297    358    1269    99.43    351    2    0    8    358    791    441

cat outfile.txt
M01072-1    FJ973371    238617596    400    1979    97.91    383    8    0    1    383    1105    723
M01072-2    EF205010    147743297    443    1269    99.55    443    2    0    1    443    869    427
M01011-2    HQ670228    339699497    443    1313    93.72    446    22    6    1    443    871    429
M01072-4    HM483399    299790116    443    3656    92.63    448    23    9    1    443    3314    2872
M01072-6    AF395513    15281673    403    2104    93.25    400    12    13    1    392    510    902
M01072-7    EF205010    147743297    358    1269    99.43    351    2    0    8    358    791    441

The following solution works but it is slow.

a. Add new columns 
awk '{ print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$1,$6*$7}' infile.txt > infile.new
b. Get a list of all the sensors:
awk '{ print $1}' infile.txt | sort -u > sensor.list
c. Sort file according to the last column
for SENSOR in `cat sensor.list`
do
  grep "${SENSOR}" infile.new | sort -k 15 -n | head -n 1 >  ${SENSOR}_best.res
done

Does anybody have a better, faster solution? Thank you very much for considering my request.

Yoda · July 2, 2013, 12:21pm

awk '
        {
                $1 = $1
                V = $6 * $7
                if ( $1 in A )
                {
                        if ( A[$1] < V )
                        {
                                A[$1] = V
                                R[$1] = $0 OFS V
                        }
                }
                else
                {
                        A[$1] = V
                        R[$1] = $0 OFS V
                }
        }
        END {
                for ( k in R )
                        print R[k]
        }
' OFS='\t' file

Don_Cragun · July 2, 2013, 12:51pm

gdc:

Dear All,

I have table files with different measurements for the same sensors. The first column indicate the sensor (7 different sensors and 16 measurements in the example below). I would like to find the best measurement for each sensor. The best measurement is determined by the higher value of ($6 x $7). If two or more measurements are equally good I would need the first one. The number of measurements per sensor varies between 3-6.

cat infile.txt
M01072-1    FJ973371    238617596    400    1979    97.91    383    8    0    1    383    1105    723
M01072-1    DQ980471    116222186    400    3178    97.91    383    8    0    1    383    1070    688
M01072-1    AM901334    195944118    400    988    90.74    378    26    9    1    373    986    613
M01072-2    AM901325    195944109    400    988    90.74    378    26    9    1    373    986    613
M01072-2    AM901311    195944095    400    988    89.82    393    29    11    1    387    986    599
M01072-2    EF205010    147743297    443    1269    99.55    443    2    0    1    443    869    427
M01011-2    HQ670228    339699497    443    1313    93.72    446    22    6    1    443    871    429
M01011-2    HQ670227    339699496    443    1210    93.72    446    22    6    1    443    817    375
M01072-4    HM483399    299790116    443    3656    92.63    448    23    9    1    443    3314    2872
M01072-4    HM483396    299790113    443    3541    92.60    446    27    6    1    443    3298    2856
M01072-5    AF183464    6120004    403    2120    93.20    397    22    5    8    403    539    931
M01072-5    AF183460    6120000    403    2126    93.20    397    22    5    8    403    538    930
M01072-6    AF395513    15281673    403    2104    93.25    400    12    13    1    392    510    902
M01072-6    AF183469    6120009    403    2142    93.23    399    14    13    1    392    529    921
M01072-6    DQ015714    66864259    403    1684    92.73    399    20    9    8    403    196    588
M01072-7    EF205010    147743297    358    1269    99.43    351    2    0    8    358    791    441

cat outfile.txt
M01072-1    FJ973371    238617596    400    1979    97.91    383    8    0    1    383    1105    723
M01072-2    EF205010    147743297    443    1269    99.55    443    2    0    1    443    869    427
M01011-2    HQ670228    339699497    443    1313    93.72    446    22    6    1    443    871    429
M01072-4    HM483399    299790116    443    3656    92.63    448    23    9    1    443    3314    2872
M01072-6    AF395513    15281673    403    2104    93.25    400    12    13    1    392    510    902
M01072-7    EF205010    147743297    358    1269    99.43    351    2    0    8    358    791    441

The following solution works but it is slow.

a. Add new columns 
awk '{ print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$1,$6*$7}' infile.txt > infile.new
b. Get a list of all the sensors:
awk '{ print $1}' infile.txt | sort -u > sensor.list
c. Sort file according to the last column
for SENSOR in `cat sensor.list`
do
  grep "${SENSOR}" infile.new | sort -k 15 -n | head -n 1 >  ${SENSOR}_best.res
done

Does anybody have a better, faster solution? Thank you very much for considering my request.

Why doesn't your desired output (as shown in outfile.txt) contain an entry for sensor M01072-5?
Yoda provided a solution that should be faster (and use a lot fewer resources) than your script. If the input for each of your sensors is grouped together by sensor (as shown in your example), it can be simplified even more.

So, is the input for each of your sensors always grouped as in your sample above?

GDC · July 2, 2013, 3:42pm

Dear Yoda,

thank you very much for your help. I am not quite sure if I understand your awk script. Would you please explain me the use of "if ( $1 in A )"?

Thanks again,

Yoda · July 2, 2013, 3:47pm

if ( $1 in A ) statement scans if field $1 in present in array A index.

To know more about this expression check: Scanning an Array

Basically the code is checking if a 1st field is already part of array index, if yes then we will perform the comparison to determine whether to keep or discard existing record:

                        if ( A[$1] < V )
                        {
                                A[$1] = V
                                R[$1] = $0 OFS V
                        }

GDC · July 2, 2013, 3:49pm

Dear Don Cragun,

thanks for point out the missing value for one of the sensor. Finding the best values for each sensor is part of a larger script and at the end a few sensors have to be removed.

The list is concatenated and therefore the sensors are not sorted. I could, however, add a extra step to sort the file first.

Thanks for your suggestions!

rdrtx1 · July 2, 2013, 4:02pm

try also:

awk '
{if (($6 * $7) > b[$1]) {a[$1]=$0; b[$1]=$6 * $7}}
END { for (i in a) print a }
' infile.txt

pipe into sort if needed.

Don_Cragun · July 2, 2013, 4:10pm

I wasn't sure if sensor M01072-5 was missing or if sensor M01072-4 was there by mistake. You originally said that there would be three to six measurements per sensor and both of these sensors only had two measurements in your sample data.

Sorting your input file takes time and might rearrange the order of lines that have the same sort key, so if picking the 1st measurement from measurements with the same calculated "best" measurement matters, Yoda's suggestion is probably much better than sorting and then using a simpler awk script.

The script rdrtx1 just provided also looks good as long as none of your sensor reading calculations ($6 * $7) produce negative values.

RudiC · July 2, 2013, 4:49pm

I'm not sure I understand rdrtx1's script (please explain; thanks!), but it will not print the best measurement (i.e. the one with max $6*$7), but the first line for each sensor.

Don_Cragun · July 3, 2013, 12:03am

RudiC is correct; rdrtx1's script uses the array a[$1] to store both the computed value of the best measurement and the contents of the line containing the best measurement. It can't do both.

Yoda's script uses a tab as the field separator (where the requested separator seems to be four spaces) and not only prints the line with the best measurement for each sensor, but also the computed measurement (which was not requested). Rearranging Yoda's script, changing OFS, leaving off the computed measurement, and dropping one unneeded assignment ( $1 = $1 ); I came up with this script:

awk '
{       V = $6 * $7
        if(!($1 in A) || A[$1] < V) {
                A[$1] = V
                R[$1] = $0
        }
}
END {   for(k in R)
                print R[k]
}
' OFS='    ' infile.txt

which produces the following output:

M01072-1    FJ973371    238617596    400    1979    97.91    383    8    0    1    383    1105    723
M01072-2    EF205010    147743297    443    1269    99.55    443    2    0    1    443    869    427
M01072-4    HM483399    299790116    443    3656    92.63    448    23    9    1    443    3314    2872
M01072-5    AF183464    6120004    403    2120    93.20    397    22    5    8    403    539    931
M01072-6    AF395513    15281673    403    2104    93.25    400    12    13    1    392    510    902
M01072-7    EF205010    147743297    358    1269    99.43    351    2    0    8    358    791    441
M01011-2    HQ670228    339699497    443    1313    93.72    446    22    6    1    443    871    429

which matches the requested output except that the output order is different and the output for sensor M01072-5 (shown in red above) was missed in the original request. If the output order difference is important, you'll need to specify what order is needed. We can then either sort the results externally or modify the awk script (to preserve the input order) depending on what output order is required.

As always, if you're going to use this on a Solaris system; use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk or /bin/awk .

Hope this helps.

GDC · July 3, 2013, 10:47am

Dear All,

I tested the awk script provided by rdrtxt and it worked ... although I see now the double assignment problem.

---------- Post updated at 04:47 PM ---------- Previous update was at 04:24 PM ----------

Dear Don Cragun,

thanks for pointing out the problem with rdrtx1's awk script. I used a sorted table and did not realize the problem. I will use the modified script originally suggested by Yoda. Thanks!