I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.
Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.
Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size.
The best possible approach will be push all the data in oracle using sqlloader.
Create index on the fly for the key u want unique.
And fire query to get the unique records.
I am not sure if I want to reload all that data again into another table and .....
As I am pulling data from a table using select * from table name into a text file and then doing sort -u file1 > file2.
Although, I could try doing a select distinct columns from the table.... and see if it will take more time than it took my original approach. Is it worth trying? I don't know.
I just don't have the luxury of trying different options at my will as it is a production database unless I know it's worth trying.
If it is one table, then u need not worry about joins...else make sure the joins are in such a way that you get maximum throughput instead of least response time
Run the query at such a time when no other big activity is going on in same table, bcos if query will be long...it can give rollback segmetn too old error.
I tried sed '$!N; /^\(.*\)\n\1$/!P; D'
to remove duplicates. It didn't work:
ex:
file test1.txt has the following rows:
123
123
145
123
123
I used the following command to remove duplicates:
sort.sh test1.txt > test2.txt
sort.sh script has your sed command:
#!/bin/ksh
file1=$1
sed '$!N; /^\(.*\)\n\1$/!P; D' < $file1
Do you know for sure this sed command works? Or is there some thing that I am doing wrong. Because, result file test2.txt has the following rows: It didn't remove all the duplicates?