UNIX scripting for finding duplicates and null records in pk columns

praveenraj.1991 · May 10, 2014, 4:32pm

Hi,
I have a requirement.for eg: i have a text file with pipe symbol as delimiter(|) with 4 columns a,b,c,d. Here a and b are primary key columns..
i want to process that file to find the duplicates and null values are in primary key columns(a,b) . I want to write the unique records in which pks are not null into one file.. and the duplicate records,the records havin pk columns as null into another file.

sample.. input:abc.txt

a|b|c|d
11|55|ram|mgr
22||raj|celrk
33|10|sam|am
11|55|ram|mgr

ouput file 1 : unique records

11|55|ram|mgr
33|10|sam|am

ouput file 2 : duplicate records and records with pk columns null

 22||raj|celrk
11|55|ram|mgr

pls help me to achieve thisusing unix script.
Thanks

Don_Cragun · May 10, 2014, 5:31pm

Is this a homework assignment?

praveenraj.1991 · May 11, 2014, 1:27am

Nooo... its requirement in my project..

Yoda · May 11, 2014, 1:48am

Here is an awk approach:

awk -F\| '
        NR == 1 {
                next
        }
        {
                I = $1 OFS $2
                if ( ( I in U ) || !($1 && $2) )
                        print $0 > "dupl.txt"
        }
        $1 && $2 {
                U = $0
        }
        END {
                for ( k in U )
                        print U[k] > "uniq.txt"
        }
' abc.txt

This program creates output files: dupl.txt and uniq.txt with duplicate and unique records.

Don_Cragun · May 11, 2014, 3:14am

Hi praveenraj.1991,
Yoda's script may work fine for you, but your requirements are a little vague.

Can 0 be a key? If so, can 00 be a key? If so, are 0 and 00 distinct keys? (If the answer to any of these is yes, Yoda's script won't work for you.)

If you have lines:

11|55|ram|mgr
55|11|abc|def
11|10|efg|hij
33|10|sam|am

what should be the output? Or, more explicitly, does the order matter: are 11|55 and 55|11 duplicates keys? And, does each pair of keys have to be unique, or does each individual key have to be unique: are 55|11 and 11|30 duplicates because 11 is a common key? (If the answer to any of these is yes, Yoda's script won't work for you.)

Scrutinizer · May 11, 2014, 4:20am

Simplifying a bit:

awk -F\| 'NR>1{print>(!A[$1,$2]++ && $1!="" && $2!=""?u:d)}' u=uniq.out d=dupl.out abc.txt