I have table files with different measurements for the same sensors. The first column indicate the sensor (7 different sensors and 16 measurements in the example below). I would like to find the best measurement for each sensor. The best measurement is determined by the higher value of ($6 x $7). If two or more measurements are equally good I would need the first one. The number of measurements per sensor varies between 3-6.
a. Add new columns
awk '{ print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$1,$6*$7}' infile.txt > infile.new
b. Get a list of all the sensors:
awk '{ print $1}' infile.txt | sort -u > sensor.list
c. Sort file according to the last column
for SENSOR in `cat sensor.list`
do
grep "${SENSOR}" infile.new | sort -k 15 -n | head -n 1 > ${SENSOR}_best.res
done
Does anybody have a better, faster solution? Thank you very much for considering my request.
awk '
{
$1 = $1
V = $6 * $7
if ( $1 in A )
{
if ( A[$1] < V )
{
A[$1] = V
R[$1] = $0 OFS V
}
}
else
{
A[$1] = V
R[$1] = $0 OFS V
}
}
END {
for ( k in R )
print R[k]
}
' OFS='\t' file
Why doesn't your desired output (as shown in outfile.txt) contain an entry for sensor M01072-5?
Yoda provided a solution that should be faster (and use a lot fewer resources) than your script. If the input for each of your sensors is grouped together by sensor (as shown in your example), it can be simplified even more.
So, is the input for each of your sensors always grouped as in your sample above?
Basically the code is checking if a 1st field is already part of array index, if yes then we will perform the comparison to determine whether to keep or discard existing record:
thanks for point out the missing value for one of the sensor. Finding the best values for each sensor is part of a larger script and at the end a few sensors have to be removed.
The list is concatenated and therefore the sensors are not sorted. I could, however, add a extra step to sort the file first.
I wasn't sure if sensor M01072-5 was missing or if sensor M01072-4 was there by mistake. You originally said that there would be three to six measurements per sensor and both of these sensors only had two measurements in your sample data.
Sorting your input file takes time and might rearrange the order of lines that have the same sort key, so if picking the 1st measurement from measurements with the same calculated "best" measurement matters, Yoda's suggestion is probably much better than sorting and then using a simpler awk script.
The script rdrtx1 just provided also looks good as long as none of your sensor reading calculations ($6 * $7) produce negative values.
I'm not sure I understand rdrtx1's script (please explain; thanks!), but it will not print the best measurement (i.e. the one with max $6*$7), but the first line for each sensor.
RudiC is correct; rdrtx1's script uses the array a[$1] to store both the computed value of the best measurement and the contents of the line containing the best measurement. It can't do both.
Yoda's script uses a tab as the field separator (where the requested separator seems to be four spaces) and not only prints the line with the best measurement for each sensor, but also the computed measurement (which was not requested). Rearranging Yoda's script, changing OFS, leaving off the computed measurement, and dropping one unneeded assignment ( $1 = $1 ); I came up with this script:
awk '
{ V = $6 * $7
if(!($1 in A) || A[$1] < V) {
A[$1] = V
R[$1] = $0
}
}
END { for(k in R)
print R[k]
}
' OFS=' ' infile.txt
which matches the requested output except that the output order is different and the output for sensor M01072-5 (shown in red above) was missed in the original request. If the output order difference is important, you'll need to specify what order is needed. We can then either sort the results externally or modify the awk script (to preserve the input order) depending on what output order is required.
As always, if you're going to use this on a Solaris system; use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk or /bin/awk .
I tested the awk script provided by rdrtxt and it worked ... although I see now the double assignment problem.
---------- Post updated at 04:47 PM ---------- Previous update was at 04:24 PM ----------
Dear Don Cragun,
thanks for pointing out the problem with rdrtx1's awk script. I used a sorted table and did not realize the problem. I will use the modified script originally suggested by Yoda. Thanks!