Selecting lines having same values for first two columns

manojmalhotra13 · December 27, 2012, 2:01am

Hello to all.

This is first post. Kindly excuse me if I do not adhere to any rules and regulations of this forum.

I have a file containing some rows with three columns each per row(separeted by a space). There are certain rows for which first two columns have same value but the value in third column is different. In that case I wish to retain only that row which has the maximum value for third column. Further I want to wirte the rows not required to a file for further processing.

 
1123022201 9310777627 3976
1127021871 9312262893 3600  -- not required, sent to another file.
1127021871 9312262893 4016  -- to be retained.
1122000518 9350066745 4464
1127455152 7827493958 3600  -- not required, sent to another file.
1127455152 7827493958 5138  -- to be retained.

Kindly help me in this regard.

Thanks in anticipation.

Manoj

pamu · December 27, 2012, 2:13am

Try

awk 'NR==FNR{A[$1,$2]++; B[$1,$2]=B[$1,$2]?B[$1,$2]<$3?$3:B[$1,$2]:$3;next}{if(A[$1,$2]>1){if(B[$1,$2]==$3){print}else{print > "another_file"}}else{print }}' file file

sathyaonnuix · December 27, 2012, 3:08am

PS: Look the below code as written by an awk noob

sort -k 1,2 file | awk '{print $1,$2}' | uniq -d > temp #Identifies the reduntant values and stores it in a temp file

while read i
do
row1=$(grep "$i" file | head -1) #Takes the first row from the redundant rows
row2=$(grep "$i" file | tail -1) #Takes the second row from the redundant rows

diff=$(grep "$i" file | awk '{gsub(/[a-zA-Z: ]+/," ")
m=split($3,a," ");
for (i=1;i<=m;i++)
if (NR==1) b=a; else print a - b }') #Finds the difference between the rows

if [ $diff -lt 0 ];
then
 echo "$row2 moved to new file"
 echo $row2 >> new_file #With the difference the least row is found and stored in another file
else
 echo "$row1 moved to new file"
 echo $row1 >> new_file
fi 
done < temp

---------- Post updated at 03:08 AM ---------- Previous update was at 03:06 AM ----------

By comparing the above two comments, [pamu and myself], you can see how powerful awk is...

RudiC · December 27, 2012, 3:37am

You did not specify what should happen
a) if the third column's values are identical
b) if values can go negative
c) to lines that do not have duplicates.
Thus any proposal might need further refinement. Try:

$ awk   '{x=$1" "$2}
         $3 > Ar[x] {if (Ar[x]) print x, Ar[x] > "further"; Ar[x]=$3}
         $3 < Ar[x] {print  > "further"}
         END {for (x in Ar) print x, Ar[x]}
        ' file
1127455152 7827493958 5138
1122000518 9350066745 4464
1127021871 9312262893 4016
1123022201 9310777627 3976
$ cat further
1127021871 9312262893 3600
1127455152 7827493958 3600

manojmalhotra13 · December 27, 2012, 4:10am

Thanks for replying so quickly and providing me the desired solution.

I tried the solution provided by RudiC and Sathyaonunix. Both are providing me the solutions. I am looking into the script by Pamu ( I am finding it difficult to understand as my expertise in Unix is not good, so kindly give me some time.)

As regarding query raised by RudiC

(i) if the third column entries are identical, then only one row is to be retained in the original file.

(ii) values cannot be negative.

(iii) the lines without duplicate values to be retained in the orginal file as such.

Thank you once again.

Manoj

pamu · December 27, 2012, 4:17am

Thanks for providing above details.

As per your current requirement I've modified script lit bit.

Now please check.

awk 'NR==FNR{A[$1,$2]++; B[$1,$2]=B[$1,$2]?B[$1,$2]<$3?$3:B[$1,$2]:$3;next}
{if(A[$1,$2]>1){if(B[$1,$2]==$3){B[$1,$2]=0;print }else{print > "another_file"}}else{print }}' file file

A[$1,$2]++ # Increments the array index for $1 and $2.

B[$1,$2]=B[$1,$2]?B[$1,$2]<$3?$3:B[$1,$2]:$3 # Here we compare is $3 is greater than previous $3 for $1 and $2 and retain max value and assign it to B[$1,$2]

I hope this helps

pamu

manojmalhotra13 · December 27, 2012, 10:37pm

@Pamu

The script provided by you is working fine and correctly.

Thanks.

Manoj