Find the replicate record using awk

xshang · November 6, 2012, 12:02am

We usually use the following awk code to delete of find out the replicate record.

awk -F, '{a[$1]++} END {for (i in a) if (a>=2) print i a}' file

My question is how can I print the whole record. The following code doesn't work.

awk -F, '{a[$1]++} END {for (i in a) if (a>=2) print $0}' file

Thank you!

ripat · November 6, 2012, 12:58am

Try this:

'{a[$0]++} END {for (i in a) if (a>=2) print i}' file

elixir_sinari · November 6, 2012, 1:07am

And one needn't wait till the file is read completely, to determine/print the duplicate records:

awk 'a[$0]++==1' file

xshang · November 6, 2012, 1:18am

Sorry I can't express my desire clearly. What I want is printing out the record when they have replicate $1.

---------- Post updated at 02:18 AM ---------- Previous update was at 02:18 AM ----------

Sorry I can't express my desire clearly. What I want is printing out the record when they have replicate $1.

rangarasan · November 6, 2012, 1:22am

Hi,

Try this one,

awk -F, '{a[$1]++;if(v[$1]){v[$1]=v[$1] ORS $0;}else{v[$1]=$0;}} END {for (i in a) if (a>=2) print v}' file

If you want to disply only the duplicated lines,

awk -F';' '{a[$1]++;}a[$1]>1{if(v[$1]){v[$1]=v[$1] ORS $0;}else{v[$1]=$0;}} END {for (i in a) if (a>=2) print v}' file

Cheers,
Ranga

elixir_sinari · November 6, 2012, 1:31am

With some assumptions:

sort -t, -k1,1 file|awk -F, 'p1==$1{if(p) print p0;p=0;print;next}{p1=$1;p0=$0;p=1}'

xshang · November 6, 2012, 1:38am

rangarasan:

Hi,

Try this one,

awk -F, '{a[$1]++;if(v[$1]){v[$1]=v[$1] ORS $0;}else{v[$1]=$0;}} END {for (i in a) if (a>=2) print v}' file

If you want to disply only the duplicated lines,

awk -F';' '{a[$1]++;}a[$1]>1{if(v[$1]){v[$1]=v[$1] ORS $0;}else{v[$1]=$0;}} END {for (i in a) if (a>=2) print v}' file

Cheers,
Ranga

It works, Thank you!

---------- Post updated at 02:38 AM ---------- Previous update was at 02:35 AM ----------

Nice! Thank you! Can you explain the code in awk? I never saw that kind of code.

elixir_sinari · November 6, 2012, 2:02am

Sure. First of all, I'd made a mistake in my earlier script. Corrected now in my post.
It sorts the input file on the first field (delimited by commas) so that the duplicate records (w.r.t. the first field) are adjacent.
In the awk script, track is kept of the previous first field (p1) and record (p0). When p1 happens to be the same as the current first field, that's the start of a duplicate "bunch".

xshang · November 6, 2012, 2:12am

I know a little more about awk again. Thanks!