Removing duplicates in fixed width file which has multiple key columns

saj · December 16, 2012, 11:47am

Hi All ,

I have a requirement where I need to remove duplicates from a fixed width file which has multiple key columns .Also , need to capture the duplicate records into another file .

File has 8 columns.
Key columns are col1 and col2.
Col1 has the length of 8 col 2 has the length of 3.

Please help...

Don_Cragun · December 16, 2012, 11:56am

Please give a sample input file (showing field contents and separators), and provide the outputs that you expect to get from that input. Please use code tags when you post the input and output files.

saj · December 16, 2012, 12:11pm

Please find the sample input file .

abc12345567hiabckd
abc12345567njipele
bcd23456890mkpele

Red colored is col1 and blue is col2

Sample output :
Duplicate file

abc12345567njipele

file with out Duplicate :

abc12345567hiabckd
bcd23456890mkpele

Please let me know if I need to provide any more details ..

Don_Cragun · December 16, 2012, 12:29pm

Assuming your input file is named Input , the following awk script will create a file named Output containing what you described as "file with out Duplicate" and a file named Duplicates that will contain what you described as "Duplicate file":

awk -v df=Duplicates -v of=Output '
substr($0, 1, 11) in key {
        print > df
        next
}
{       key[substr($0, 1, 11)]
        print > of
}' Input

saj · December 16, 2012, 11:36pm

Thanks Don ,It works , I have one more scenario where Key columns are not continuous.
Ex:

abc12345567hiabckd
abc12345567hiaipele
bcd23456890mkpele

when it comes in col1 and col3 which are marked as red , can you please help me how to solve this..

Don_Cragun · December 17, 2012, 12:06am

awk -v df=Duplicates2 -v of=Output2 '
(substr($0, 1, 8),substr($0, 12, 3)) in key {
        print > df
        next
}
{       key[substr($0, 1, 8),substr($0, 12, 3)]
        print > of
}' Input2