To remove duplicates from pipe delimited file

ginrkf · October 19, 2013, 2:38pm

Hi some one please help me to remove duplicates from a pipe delimited file based on first two columns.

123|asdf|sfsd|qwrer
431|yui|qwer|opws
123|asdf|pol|njio

Here My first record and last record are duplicates.As per my requirement I want all the latest records into one file.

I want the output looks like below

431|yui|qwer|opws
123|asdf|pol|njio

My file is having around 20 million records.So needs a faster output

danmero · October 19, 2013, 2:58pm

sort -ut '|' -k 1,2 file.txt

ginrkf · October 19, 2013, 3:11pm

thanks for the reply..But its not working fine.Its not showing me any error.but not giving me the correct result also.Its just displaying what ever there in the file

danmero · October 19, 2013, 3:25pm

Can you post a real sample data and in the same time state your OS & version

mjf · October 19, 2013, 3:34pm

Here is an awk solution that doesn't require sorting:

awk -F"|" '!x[$1 $2]++' file.txt

Don_Cragun · October 19, 2013, 4:18pm

This is a much harder problem that it appears at first glance.
The sort solution proposed by danmero should just give one line sort each set of lines with identical values in the 1st 2 fields, but the one printed depends on the sort order of the remaining fields. The order ginkrf requested was that the last line in the (unsorted) file be printed for each set of lines with identical values in the 1st two fields.

The awk solution proposed by mjf will print the 1st line of each matching set instead of the last line of each matching set. (And, if the 1st 2 fields when concatenated yield the same key even though the fields are different, some desired output lines may be skipped. For example if $1 is "ab" and $2 is "c" in one record and "a" and "bc" in another, they will both have key "abc".)

Since ginkrf didn't say whether the order of the lines in the output has to match the order in which they appeared in the input, I won't try to guess at an efficient way to do what has been requested. If the order is important, the input file could be reversed, fed through mfj's awk script (with !x[$1 $2]++ changed to !x[$1,$2]++ ), and then reverse the order of the output again. Depending on the output constraints this might or might not be grossly inefficient.

If the output order is not important, it could be done easily with an awk script, but could require almost 400mb of virtual address space to process 20 million 20 byte records.

With a better description of the input (is there anything in a record other than its position in the input that can be used to determine which of several lines with the 1st two fields matching should be printed) and the output constraints, we might be able to provide a better solution. Are there ever more than two lines with the same 1st two fields? If yes, out of the 20 million input records, how many output records do you expect to be produced? Are there likely to be lots of lines that only have one occurrence of the 1st two fields? What are the file sizes (input and output) in bytes (instead of records)? What is the longest input line in bytes?

What OS and hardware are you using? How much memory? How much swap space?

drl · October 20, 2013, 7:35am

Hi.

We once needed a code that would run on a number of different systems, yet produce consistent results. We ran into the situation that utility uniq was not consistent among the systems. We introducing an option:

--last
allows over-writing, effectively keeping the most-recently
seen instance. Some versions of uniq on other *nix systems use
the most recent (Solaris), the default is compatibility with
GNU/Linux uniq, which keeps the first occurrence.

By substituting this idea for the system version of uniq, we were able to produce consistent results.

I think this problem can approached with the sort idea of danmero, but with the stable option set, and a "final filter" that eliminates duplicates. Because the file is already sorted, no additional storage is needed: in the final filter, if the fields of the incoming record differ from that in storage, then write out the saved line, and save the new line. If the fields are the same, then save the new instance of the line. Our code was in perl, but awk could be as easily used.

Best wishes ... cheers, drl

Akshay_Hegde · October 20, 2013, 9:02am

One more approach

$ awk -F'|' '{if (a[$1 FS $2]) next}a[$1 FS $2]=$0' file

alister · October 20, 2013, 10:43am

You misunderstood the problem. A correct solution must make more than a single pass over the data.

The simplest AWK solution would make two passes. The first determines key frequency. The second decrements each key's value and prints a record only when that value becomes zero.

Regards,
Alister

Akshay_Hegde · October 20, 2013, 10:51am

Thanks Alister this will work as user requested in #1

$ awk -F'|' '{a[$1 FS $2]=$0}END{for (i in a) print a}' file
431|yui|qwer|opws
123|asdf|pol|njio

Don_Cragun · October 20, 2013, 12:02pm

Only if ginkrf doesn't care about the output line order being different from the input line order...

for (index in array) ...

is allowed to produce a random (unrelated to the order in which elements were added to the array and unrelated to the collation sequence or numeric sequence of the indices) output order.

Akshay_Hegde · October 20, 2013, 12:07pm

Thats true Don

MadeInGermany · October 21, 2013, 4:06am

With too passes, and using minimal memory:

awk -F'|' '{k=$1 FS $2} NR==FNR {A[k]=NR; next} A[k]==FNR' file file