Remove the partial duplicates by checking the length of a field

asyed · September 23, 2011, 9:15am

Hi Folks -

I'm quite new to awk and didn't come across such issues before. The problem statement is that, I've a file with duplicate records in 3rd and 4th fields. The sample is as below:

aaaaaa|a12|45|56
abbbbaaa|a12|45|56
bbaabb|b1|51|45
bbbbbabbb|b2|51|45
aaabbbaaaa|a11|45|56

Here,the combination of field3 and field is same for few records viz. 4556 for the first 2 and last rows and so on..

Now,the output file is expected to be like this:

aaabbbaaaa|a11|45|56
bbbbbabbb|b2|51|45

That is, checking the length of first field for the rows where field3&field4 match and return the row with highest length in first field among them. So, one row will be picked from each set of duplicates based on the length on first field

Could you please help with a one line awk command to achieve this?

bartus11 · September 23, 2011, 9:24am

Try:

awk -F"|" 'length($1)>l[$3"|"$4]{l[$3"|"$4]=length($1);a[$3"|"$4]=$0}END{for (i in a) print a}' file

radoulov · September 23, 2011, 9:25am

awk -F\| 'END { 
 for (R in r)
   print r[R]
 }
length($1) > l[$3, $4] {  
  l[$3, $4] = length($1)
  r[$3, $4] = $0
  }' infile

asyed · September 23, 2011, 9:33am

Hi Bartus....Thanks a lot...your solution worked out