Removing all the duplicates

pandeesh · August 16, 2011, 3:18am

i want to remove all the duplictaes in a file.I dont want even a single entry.

For the input data:

12345|12|34
12345|13|23
3456|12|90
15670|12|13
12345|10|14
3456|12|13

i need the below data in one file

15670|12|13

and the below data in another file

 
 12345|12|34
 12345|13|23
 12345|10|14 
 3456|12|90 
 3456|12|13

I am identifying duplictaes based on first field alone.

if use sort -t"|" -u -k 1,1 it gives

 
12345|10|14
15670|12|13
3456|12|13

But i dont want the single entry too.

Please help me.

And also if i wnat to sort based on 10th field, can i use sort -k10 or sort -k 10,10?

Whats the difference between those?

Thanks

bartus11 · August 16, 2011, 3:57am

Try:

awk -F"|" '{a[$1]++;b[$1]=b[$1]?b[$1]"\n"$0:$0}END{for(i in a){if(a==1){print b>"file1"}else{print b>"file2"}}}' input

It will create two files: file1 and file2.

pandeesh · August 16, 2011, 4:09am

But it's giving illegal statement near line 1, syntax error at line 1.
I am checking in SunOS

itkamaraj · August 16, 2011, 4:11am

use nawk

pandeesh · August 16, 2011, 4:20am

Yes with nawk its working.But i want to make 10th field as key field.so what i need to change in that script?
shall i replace $1 by $10?

Thanks

bartus11 · August 16, 2011, 4:23am

Yes.

pandeesh · August 16, 2011, 4:41am

I have changed like

 
awk -F"|" '{a[$10]++;b[$10]=b[$10]?b[$10]"\n"$0:$0}END{for(i in a){if(a==1){print b>"file1"}else{print b>"file2"}}}' input

But its not giving correct result.
Anything else i need to change?

Thanks

---------- Post updated at 02:11 PM ---------- Previous update was at 02:02 PM ----------

In the file1 i am getting unique records.

But in file2 i am getting all the records.

From the below code anything else i need to change for making 10th field as key?

awk -F"|" '{a[$10]++;b[$10]=b[$10]?b[$10]"\n"$0:$0}END{for(i in a){if(a==1){print b>"file1"}else{print b>"file2"}}}' input

I have tried $(10) too.

Please help me.. thanks

bartus11 · August 16, 2011, 4:42am

Can you post sample of your real data?

pandeesh · August 16, 2011, 5:36am

the data is like below:

12116|  |12116     |C                  |M                 |                         |8913   |189  |111189  |12119249  |8000       |E|029|W Clock| ger                 |0|E 12th Street                      |                                        |  |FL |60       |U |111189      | 
 
12116|  |12116     |k               |Dsd                   |Y                    |10   |124  |224  |19621192 |850       |E|D007| |SMr                 |0|. J- 12      |                                        |Wrs            |FL |3331       |US |111224      |

i need to find the duplictaes based on 10th field.

---------- Post updated at 03:06 PM ---------- Previous update was at 02:40 PM ----------

Anything i need to chnage in the below code for that?

 
awk -F"|" '{a[$10]++;b[$10]=b[$10]?b[$10]"\n"$0:$0}END{for(i in a){if(a==1){print b>"file1"}else{print b>"file2"}}}' input

bartus11 · August 16, 2011, 6:41am

I've checked that code for following data:

12116|  |12116     |C                  |M                 |                         |8913   |189  |111189  |12119249  |8000       |E|029|W Clock| ger                 |0|E 12th Street                      |                                        |  |FL |60       |U |111189      | 
22116|  |12116     |C                  |M                 |                         |8913   |189  |111189  |12119249  |8000       |E|029|W Clock| ger                 |0|E 12th Street                      |                                        |  |FL |60       |U |111189      | 
12116|  |12116     |k               |Dsd                   |Y                    |10   |124  |224  |19621192 |850       |E|D007| |SMr                 |0|. J- 12      |                                        |Wrs            |FL |3331       |US |111224      |

And got following result:

solaris% cat file1
12116|  |12116     |k               |Dsd                   |Y                    |10   |124  |224  |19621192 |850       |E|D007| |SMr                 |0|. J- 12      |                                        |Wrs            |FL |3331       |US |111224      | 
solaris% cat file2
12116|  |12116     |C                  |M                 |                         |8913   |189  |111189  |12119249  |8000       |E|029|W Clock| ger                 |0|E 12th Street                      |                                        |  |FL |60       |U |111189      | 
22116|  |12116     |C                  |M                 |                         |8913   |189  |111189  |12119249  |8000       |E|029|W Clock| ger                 |0|E 12th Street                      |                                        |  |FL |60       |U |111189      |

So it is working as expected for this sample... Can you post sample data that gives incorrect results?