fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.

Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.

Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size.

Aprpeciate any pointers.
Thanks,
Radhika.

Just a thought.

Why not use the divide and conquer approach ?

Vino

That's about 200KB/s. Pretty crap.
I presume you're thrashing swap?

One thing to check is if you don't need multibyte sorting,
then prepend the sort command with LANG=C

Sounds like you need a database (indexes) to be honest.

If the output is a small % of the input, then
explicitly partitioning the input would be beneficial.
I.E.: while sort -u chunk | sort -u

Try out this one...

sed '$!N; /^\(.*\)\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

I have tested this with around 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.

:cool:

Havn't tried your sed. But doesnt it assume that all the entries are already sorted and then it removes the duplicates.

and/or

If the file is unsorted, then duplicate entries based on first line are removed. since sed makes just one-pass through the file.

Or did I get it wrong ?

vino

Hi Vino,

This command will keep the first entry as it is and delete the other entries,

irrespective of whether the file is sorted or not.

No prior assumptions while executing this command.

Hi Amit,

>>
sed '$!N; /^\(.*\)\n\1$/!P; D'

Could you explain the command - bit by bit if you don't mind.

Thanks!

It's equivalent to uniq, so it won't help you.
If your data is in fact already sorted then just use `uniq` instead of `sort -u`

No, my data is not sorted.

The best possible approach will be push all the data in oracle using sqlloader.
Create index on the fly for the key u want unique.
And fire query to get the unique records.

Any better alternatives?

I am not sure if I want to reload all that data again into another table and .....

As I am pulling data from a table using select * from table name into a text file and then doing sort -u file1 > file2.

Although, I could try doing a select distinct columns from the table.... and see if it will take more time than it took my original approach. Is it worth trying? I don't know.

I just don't have the luxury of trying different options at my will as it is a production database unless I know it's worth trying.

It's already in a database!
Just do add a sort by in the select clause and
index the appropriate fields.

Definetly its worth a try.

Precautions u can take are:

  1. Make sure all distinct columns are indexed.
  2. If it is one table, then u need not worry about joins...else make sure the joins are in such a way that you get maximum throughput instead of least response time
  3. Run the query at such a time when no other big activity is going on in same table, bcos if query will be long...it can give rollback segmetn too old error.

All the best.

Sorry for reply back ....

>> Hi Amit,

>> sed '$!N; /^\(.*\)\n\1$/!P; D'

>> Could you explain the command - bit by bit if you don't mind.

>> Thanks!

I think u can refer the man page of sed and look for sed Addresses

I think the topic is self explainatory...

BTW ...

I tested this command with more than 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.

Amit,

I tried sed '$!N; /^\(.*\)\n\1$/!P; D'
to remove duplicates. It didn't work:

ex:
file test1.txt has the following rows:
123
123
145
123
123

I used the following command to remove duplicates:
sort.sh test1.txt > test2.txt

sort.sh script has your sed command:
#!/bin/ksh
file1=$1
sed '$!N; /^\(.*\)\n\1$/!P; D' < $file1

Do you know for sure this sed command works? Or is there some thing that I am doing wrong. Because, result file test2.txt has the following rows: It didn't remove all the duplicates?

123
123
145
123

Appreciate any pointers.
Radhika.

Why don't try to sort first & then take unique records.....
Is it not a good idea????