fastest way to remove duplicates.

radhika · June 23, 2005, 3:37pm

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.

Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.

Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size.

Aprpeciate any pointers.
Thanks,
Radhika.

vino · June 24, 2005, 1:27am

Just a thought.

Why not use the divide and conquer approach ?

Vino

pixelbeat · June 24, 2005, 6:22am

That's about 200KB/s. Pretty crap.
I presume you're thrashing swap?

One thing to check is if you don't need multibyte sorting,
then prepend the sort command with LANG=C

Sounds like you need a database (indexes) to be honest.

If the output is a small % of the input, then
explicitly partitioning the input would be beneficial.
I.E.: while sort -u chunk | sort -u

amit_sapre · June 24, 2005, 6:24am

Try out this one...

sed '$!N; /^$.*$\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

I have tested this with around 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.

vino · June 24, 2005, 6:27am

Havn't tried your sed. But doesnt it assume that all the entries are already sorted and then it removes the duplicates.

and/or

If the file is unsorted, then duplicate entries based on first line are removed. since sed makes just one-pass through the file.

Or did I get it wrong ?

vino

amit_sapre · June 24, 2005, 8:19am

Hi Vino,

This command will keep the first entry as it is and delete the other entries,

irrespective of whether the file is sorted or not.

No prior assumptions while executing this command.

radhika · June 24, 2005, 8:37am

Hi Amit,

>>
sed '$!N; /^$.*$\n\1$/!P; D'

Could you explain the command - bit by bit if you don't mind.

Thanks!

pixelbeat · June 24, 2005, 9:23am

It's equivalent to uniq, so it won't help you.
If your data is in fact already sorted then just use `uniq` instead of `sort -u`

radhika · June 24, 2005, 9:37am

No, my data is not sorted.

RishiPahuja · June 24, 2005, 9:48am

The best possible approach will be push all the data in oracle using sqlloader.
Create index on the fly for the key u want unique.
And fire query to get the unique records.

Any better alternatives?

radhika · June 24, 2005, 10:03am

I am not sure if I want to reload all that data again into another table and .....

As I am pulling data from a table using select * from table name into a text file and then doing sort -u file1 > file2.

Although, I could try doing a select distinct columns from the table.... and see if it will take more time than it took my original approach. Is it worth trying? I don't know.

I just don't have the luxury of trying different options at my will as it is a production database unless I know it's worth trying.

pixelbeat · June 24, 2005, 10:07am

It's already in a database!
Just do add a sort by in the select clause and
index the appropriate fields.

RishiPahuja · June 24, 2005, 10:24am

Definetly its worth a try.

Precautions u can take are:

Make sure all distinct columns are indexed.
If it is one table, then u need not worry about joins...else make sure the joins are in such a way that you get maximum throughput instead of least response time
Run the query at such a time when no other big activity is going on in same table, bcos if query will be long...it can give rollback segmetn too old error.

All the best.

amit_sapre · June 24, 2005, 10:33am

Sorry for reply back ....

>> Hi Amit,

>> sed '$!N; /^$.*$\n\1$/!P; D'

>> Could you explain the command - bit by bit if you don't mind.

>> Thanks!

I think u can refer the man page of sed and look for sed Addresses

I think the topic is self explainatory...

BTW ...

I tested this command with more than 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.

radhika · July 3, 2005, 3:06pm

Amit,

I tried sed '$!N; /^$.*$\n\1$/!P; D'
to remove duplicates. It didn't work:

ex:
file test1.txt has the following rows:
123
123
145
123
123

I used the following command to remove duplicates:
sort.sh test1.txt > test2.txt

sort.sh script has your sed command:
#!/bin/ksh
file1=$1
sed '$!N; /^$.*$\n\1$/!P; D' < $file1

Do you know for sure this sed command works? Or is there some thing that I am doing wrong. Because, result file test2.txt has the following rows: It didn't remove all the duplicates?

123
123
145
123

Appreciate any pointers.
Radhika.

m_chandroo · November 3, 2005, 6:27am

Why don't try to sort first & then take unique records.....
Is it not a good idea????