Rewriting GNU uniq in awk

mij · October 23, 2012, 9:55am

Within a shell script I use

uniq -w 16 -D

in order to process all lines in which the first 16 characters are duplicated.

Now I want to also run that script on a BSD based system where the included version of uniq does not support the -w (--check-chars) option. To get around this I have written an awk script to instead use if GNU uniq is not available. It seems to work with both GNU and BSD versions of awk, but it is pretty ugly.

awk -v b=0 '{if (substr(a,0,16) == substr($0,0,16) || substr(b,0,16) == substr(a,0,16)) print a; b = a; a = $0} 
END{ if (substr(a,0,16) == substr(b,0,16)) print $0 }'

I am wondering if this can be simplified, or whether there is another, better solution.

Thanks.

vgersh99 · October 23, 2012, 10:07am

how about:

awk '{print substr($0,1,16) "\t" $0}' | uniq -D | cut -f2-

mij · October 23, 2012, 10:32am

Unfortunately uniq will still check the whole line for uniqueness. Also BSD uniq annoyingly does not include the -D (--all-repeated) switch either.

radoulov · October 23, 2012, 10:38am

Is the input data sorted?

This will work for both sorted and unsorted data (but uniq may produce a different output if the data is not sorted):

awk 'END {
  for (i = 0; ++i <= NR;)
    if (u[substr(d, 1, l)] > 1)
      print d
  }
{ 
  u[substr($0, 1, l)]++
  d[NR] = $0
  }' l=16  infile

Your solution is more efficient though.

Tweaking your solution a little bit:

awk 'NR > 1 && (substr(a, 1, l) == substr($0, 1, l) || substr(b, 1, l) == substr(a, 1, l)) {
  print a
  }
{  b = a; a = $0 } 
END { 
  if (substr(a, 1, l) == substr(b, 1, l)) 
    print 
    }' l=16

alister · October 23, 2012, 11:06am

What does the data look like? Does it adhere to some format? Does it contain whitespace? Are certain characters guaranteed to appear? Are certain characters guaranteed to not appear? Knowing what we're dealing with might suggest alternative approaches.

Regards,
Alister

mij · October 23, 2012, 11:18am

In my script I use sort before uniq so the data is sorted. I assume that would still be more efficient than having awk perform the sort itself?

Thanks.

radoulov · October 23, 2012, 11:20am

The script I provided doesn't sort the data at all (and it doesn't require the data to be sorted). It just prints the desired lines.

mij · October 23, 2012, 11:32am

Each line has a number, right aligned with leading spaces, which takes up the first 16 characters, a space, then an unquoted string of variable length that can include any characters. There is another version that is sometimes used which has a 32 character MD5 hash, followed again by a space then the string.

The data is sorted so a simple comparison with the previous line is enough to find a match. It could consist of any number of lines, from just a few to tens of thousands, similarly there could be any number with a duplicated first field. The initial number or hash is used to group different strings, which will always be unique. The lines in a duplicated group are then piped into a "while read" loop for processing.