Questions on File filtering

stevefox · November 11, 2010, 12:25am

I have a file test.txt with the lines below :

$ cat test.txt
AAA 1
AAA 2
BBB 5
BBB 7
BBB 9
CCC 3
CCC 4
DDD 6
EEE 5

I want to filter the file above to make it have unique rows with the condition that if there are rows with the same value in the first column I want the row with the biggest value in the second column to be displayed.

e.g. There are three rows which has "BBB" in the first column but I want the row with the biggest number in the second column "BBB 9" to be displayed.

My desired result after filtering is the below :

AAA 2
BBB 9
CCC 4
DDD 6
EEE 5

I have created a ksh script below which gets the above result but when there are many rows it takes very long to process so I would like to get some help in getting the same result with shorter process time.

#!/bin/ksh

INPUT=$1

sort $INPUT > $$ && mv $$ $INPUT

cut -d' ' -f1 $INPUT | sort | uniq -d > ROWS_WITH_MULTIPLE_VALUES
cut -d' ' -f1 $INPUT | sort | uniq -u > ROWS_WITH_SINGLE_VALUES

if [[ `cat ROWS_WITH_MULTIPLE_VALUES | wc -l` -gt 0 ]] ;
then

while read VALUE_1
do

    grep $VALUE_1 $INPUT | tail -1

done < ROWS_WITH_MULTIPLE_VALUES

fi;

if [[ `cat ROWS_WITH_SINGLE_VALUES | wc -l` -gt 0 ]] ;
then

while read VALUE_1
do

    grep $VALUE_1 $INPUT

done < ROWS_WITH_SINGLE_VALUES

fi;

rm ROWS_WITH_MULTIPLE_VALUES ROWS_WITH_SINGLE_VALUES

Any help will be greatly appreciated.

Cheers
Steve

anurag.singh · November 11, 2010, 2:14am

#!/bin/ksh
sort -rk2 $INPUT > temp
awk '!x[$1]++' temp > temp1
sort temp1 > $OUTPUT

stevefox · November 11, 2010, 2:49am

anurag.singh,

Your script works great with much shorter process time!!
I'll learn to use awk from next time.
Thank you very much for your help!!!

Cheers
Steve

Scrutinizer · November 11, 2010, 3:21am

@anurag:That should be sort -rnk2 , no? Otherwise number > 9 would not get represented properly..
Also there is no need for temp files and an extra sort step if you sort like this:

sort -k1,1 -k2,2rn infile

So in short:

sort -k1,1 -k2,2rn infile | awk '!x[$1]++'

anurag.singh · November 11, 2010, 3:56am

Thanks Scrutinizer, This is much much better !!

rdcwayx · November 11, 2010, 7:24am

awk '{a[$1]=(a[$1]>$2)?a[$1]:$2}END{for (i in a) print i,a|"sort"}' test.txt