in order to process all lines in which the first 16 characters are duplicated.
Now I want to also run that script on a BSD based system where the included version of uniq does not support the -w (--check-chars) option. To get around this I have written an awk script to instead use if GNU uniq is not available. It seems to work with both GNU and BSD versions of awk, but it is pretty ugly.
awk -v b=0 '{if (substr(a,0,16) == substr($0,0,16) || substr(b,0,16) == substr(a,0,16)) print a; b = a; a = $0}
END{ if (substr(a,0,16) == substr(b,0,16)) print $0 }'
I am wondering if this can be simplified, or whether there is another, better solution.
What does the data look like? Does it adhere to some format? Does it contain whitespace? Are certain characters guaranteed to appear? Are certain characters guaranteed to not appear? Knowing what we're dealing with might suggest alternative approaches.
Each line has a number, right aligned with leading spaces, which takes up the first 16 characters, a space, then an unquoted string of variable length that can include any characters. There is another version that is sometimes used which has a 32 character MD5 hash, followed again by a space then the string.
The data is sorted so a simple comparison with the previous line is enough to find a match. It could consist of any number of lines, from just a few to tens of thousands, similarly there could be any number with a duplicated first field. The initial number or hash is used to group different strings, which will always be unique. The lines in a duplicated group are then piped into a "while read" loop for processing.