Find and remove duplicate record and print list

Gents,

I needs to delete duplicate values and only get uniq values based in columns 2-27

Always we should keep the last record found...

I need to store one clean file and other with the duplicate values removed.

Input :

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

Desired Output:
cleaned file.txt

S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

removed.txt

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Thanks in advance :b:

could you please try sort -u or uniq -u

Dear Anshum

Please let me know how to sort the imput file in order to get the output file as I request?..

sort -u -t  "    " -k 2.27 filename

Please can you let me know how i get the file output removed.txt with the removed points
thanks

cut -d "    " -f3 filename|sort |uniq -d>removed.txt

:b:

Thanks a lot Anshuman,

Then, I will have the both files .. clean file and removed .. as my Desired Output:

---------- Post updated at 04:16 PM ---------- Previous update was at 02:46 PM ----------

Guys,
There is any options to do it using awk... I don't want to sort the file output ...only remove the duplicated values an print both files like I write. Thanks for your help

---------- Post updated 11-07-12 at 01:24 AM ---------- Previous update was 11-06-12 at 04:16 PM ----------

The sort comands does not work.....

Please can somebody help me to solve this issue... Thanks a lot

use this..

Just change $1 to on which column you want to find duplicates

awk '{if(!X[$1]++){print > "clean.txt"}else{print > "remove.txt"}}' file

for multiple columns

if(!X[$2,$3]++)

Hope this helps you:)

pamu

Hi Pamu

I try:

awk '{if(!X[$1,$2]++){print > "clean.txt"}else{print > "remove.txt"}}' file

but i got the error = X[ Event not found?

Please advise,

Thanks

---------- Post updated at 03:17 AM ---------- Previous update was at 02:08 AM ----------

Dear Puma

I would like to keep always the last value found,,, looks like the code keep always the fist one... Please advise

---------- Post updated at 03:54 AM ---------- Previous update was at 03:17 AM ----------

Please help me to get the to files output as I display bellow,,, the objetive is to delete the duplicate files, keeping allwas the last one...

The columns where are the duplicate files are $2 $3 ( colum 2-25),, and they have a index indentity (colum 26),, example
in the input file this value appears,

S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

Therefore I should get

file # 1 cleaned.txt

S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

file # 2 removed.txt

S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Following all the file as input

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

Desired Output 2 files
file # 1 cleaned.txt

S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

file # 2 removed.txt

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Thanks in advance ,,

Try this..

awk '{if(!X[$1]){X[$1]=$0}else{print X[$1] > "remove.txt";X[$1]=$0}}END{for(i in X)print X >"clean.txt"}' file

You may need to sort the output later.

i don't think there are any duplicates from $2 and $3 in first sample. and there are only 7 columns.

Dear Pamu

The duplicate values are in $1 and $2,,, (awk)
With a text editor (colunm 2-25 ).

I will try and I let you know

Thanks for your help

---------- Post updated at 06:50 AM ---------- Previous update was at 04:10 AM ----------

Dear Pamu

using the code

awk '{if(X[$1]){X[$1]=$0}else{print X[$1] > "remove.txt";X[$1]=$0}}END{for(i in X)print X >"clean.txt"}' file

input
file

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18 303373051 30337305
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430 303373052 30337305
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130 303573051 30357305
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542 334972751 33497275
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657 334972801 33497280
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809 334972802 33497280
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958 335172701 33517270
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846 335372701 33537270
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336 335372751 33537275
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922 335372801 33537280
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224 335572751 33557275
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733 335572752 33557275
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034 335572753 33557275

I have sorted the values in colun #9

Then I got the

clean.txt

S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542 334972751 33497275
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130 303573051 30357305
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846 335372701 33537270
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034 335572753 33557275
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922 335372801 33537280
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430 303373052 30337305
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958 335172701 33517270
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336 335372751 33537275
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809 334972802 33497280

but the file
remove.txt

Is empty???????????

Please can you advise me where is the problem...

I should get this

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Thanks for your help and time

---------- Post updated at 06:52 AM ---------- Previous update was at 06:50 AM ----------

dear Pamu,,

This is the code that I am using..

awk '{if(X[$9]){X[$9]=$0}else{print X[$9] > "remove.txt";X[$9]=$0}}END{for(i in X)print X >"clean.txt"}' file

only I have changed the colunm number

I'm not sure I understand your "duplicate" criterion. In one post, it's 7275.01 (6 digits + "."), in the other it's just 4 digits before the period. On top, your input files vary from post to post. This does not help us to help you.

Try this; you may want to sort both files afterwards:

$ awk '{Ar[$2]=$0} END{for (i in Ar) print Ar}' inputfile >clean.txt
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
$ grep -vfclean.txt inputfile >removed.txt
S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
1 Like

Dear Rudic

The only change that I did in the output file was to increase 2, columns more to concatenate 4 dig from colun 1 & 4 dig in colum 2. Saved to column 9. To take it as reference to find duplicated records.... Fir that I have used colum 9...

Please can you let me know where is the error in code that I am using...why the file of removed file is empty

Thanks a lot

I'd prefer pamu to explain his code to you. Did you give my proposal a try? The removed file is not empty with that approach. Right now, it is using the full 7270.01 for testing uniqueness; could be adapted to 4 digits by minor modifications.

Dear RudiC

Your code is working perfect. THanks a lot

Actually, after a little rearranging, pamu's code should do as well:

awk     'X[$9] {print X[$9] > "removed.txt"}
         {X[$9]=$0}
         END {for (i in X) print X > "clean.txt"}
        ' file
1 Like

Dear RudiC,

Thanks alot for your help and time, your support is greatly appreciated... The code works perfect and I got the 2 output files at same time... :slight_smile:

I don't think there is any difference between my script(post 10) and Rudic's script(same logic, post 16).
But don't know what you are trying...:confused:
How one script gives you desired output and other didn't...:rolleyes:

1 Like

Dear Pamu
First of all I appreciate your help and support.
Regarding the code, really it was very strange but it was not working well, perhaps I did something wrong. Anyway.. Thanks again for your help... Now my problem is solved thanks to all.

Regards