Find duplicates in the first column of text file

gameboy87 · June 27, 2010, 3:37am

Hello,

My text file has input of the form

abc dft45.xml
ert  rt653.xml
abc ert57.xml

I need to write a perl script/shell script to find duplicates in the first column and write it into a text file of the form...

abc dft45.xml
abc ert57.xml

Can some one help me plz?

guruprasadpr · June 27, 2010, 4:12am

Hi

awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 1)print;}' file file

You need to give the filename twice as shown above.

Guru.

gameboy87 · June 27, 2010, 4:49am

Can u plz explain the awk command what it is doing? & why u have mentioned "file" two times?

guruprasadpr · June 27, 2010, 5:23am

Hi
First time, when the file is processed, it takes the count of 1st column duplicates. Second time, when it is processed, it starts printing those lines which has count more than 1.

btw, did it work?

Guru.

alister · June 27, 2010, 10:03am

A single-pass version (increased ram requirement since all lines of the file are stored for END use):

 awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a}' data

Regards,
Alister

gameboy87 · June 28, 2010, 4:46am

what is meant by "First time process" and "Second time process" ?

I will try out & comment here quicky ASAP b'coz there is a problem in my machine.

---------- Post updated at 08:05 PM ---------- Previous update was at 07:52 PM ----------

alister:

A single-pass version (increased ram requirement since all lines of the file are stored for END use):
 awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a}' data
Regards,
Alister

Can u explain what's the code is doing?

---------- Post updated 06-28-10 at 02:16 PM ---------- Previous update was 06-27-10 at 08:05 PM ----------

It worked ! thanks...But i need to find the count of it's occurence

awk ' { per[$1] += 1}
END { for (i in per)
print i, per } ' dupli.txt > dupli_count.txt

in the above code i need to print the Total count as "Sum=????" (i need to count 2nd column.)