Compare Tab Separated Field with AWK to all and print lines of unique fields.

rocket_dog · May 26, 2011, 7:36pm

Hi.

I have a tab separated file that has a couple nearly identical lines. When doing:

sort file | uniq > file.new

It passes through the nearly identical lines because, well, they still are unique.

a)
I want to look only at field x for uniqueness and if the content in field x is the same as field x in any other line, move all the duplicate lines to a new file called file.duplicates.

b)
I also want to be able to look only at field x for uniqueness and if the content in field x is the same as field x in any other line, remove the following lines with the duplicate field x.

Thanks in advance!

agama · May 26, 2011, 8:03pm

Try this on for size:

#!/usr/bin/env ksh
if [[ -z $1 ]]  ||  [[ -z $2 ]]
then
   echo "missing parms"
   exit 1
fi

sort -k $1,$1  $2 | awk -v col=$1 -v dup_file=$2.dups '{
    if( last == $(col) )
        print >dup_file;
    else
        print;
    last = $( col );
}'

exit $?

Command line args to the script are the column number (1 based) and the input file name. If your field is numeric, you should use -k ${1}n,$1 instead.

Also, if you use sort -u sort will do the uniq for you without invoking a second process.