Search for string dublicates in column

system · February 19, 2008, 10:10am

Hi

I have a file with one column. There are a few replicas in this column, that is some lines look exactly the same. I want to know the ones that occur twice.

Inputfile.xml
"AAH.dbEUR"
"ECT.dbEUR"
"AEGN.dbEUR"
"AAH.dbEUR"
"AKZO.dbEUR"
...

Here I would like to be informed that "AAH.dbEUR" is occuring twice.

Thanks

radoulov · February 19, 2008, 10:14am

sort filename|uniq -d

jim_mcnamara · February 19, 2008, 10:15am

awk '{ arr[$0]++}
       END{ for (i in arr) { if (arr>1) {print i, arr} } }' file

radoulov · February 19, 2008, 10:22am

Or (if I'm not missing something):

awk 'x[$0]++==1' filename

system · February 19, 2008, 10:37am

Thanks alot. Works perfectly!

Lukeadams · February 19, 2008, 2:08pm

Wow.... So the output is any line that appears more than once - but only printed once.

Can you explain what's going on here?

jim_mcnamara · February 19, 2008, 2:16pm

awk arrays are associative - they hash aray indexes.
The syntax says: add one to the array element indexed zero.
But, since the ++ is after the arr[] it means evaluate the value of arr[] before you add one.

So - if arr[ $0 ] is one -- meaning it has been seen before - print $0 because it is a duplicate, then add one to arr[ $0 ]. Now: arr[ $0 ] == 2 so we never print it again no matter how many times it appears.

radoulov · February 19, 2008, 3:32pm

Two things first:

in Awk uninitialized variables have value zero (or null, depending on the context)
the '++' operator is used for adding one, it can be used to increment a variable either before or after taking its value.

Consider this:

$ print 'one
two
two
three
three
three'|awk '{printf "$0 is %s, first x[$0] is %s ,",$0,x[$0]++}{print "then x[$0] is",x[$0]}'
$0 is one, first x[$0] is 0 ,then x[$0] is 1
$0 is two, first x[$0] is 0 ,then x[$0] is 1
$0 is two, first x[$0] is 1 ,then x[$0] is 2
$0 is three, first x[$0] is 0 ,then x[$0] is 1
$0 is three, first x[$0] is 1 ,then x[$0] is 2
$0 is three, first x[$0] is 2 ,then x[$0] is 3

So now it should be easier to understand, this:

$ print 'one
two
two
three
three
three'|awk 'x[$0]++==1'
two
three

... and this:

$ print 'one
two
two
three
three
three'|awk '++x[$0]==2'
two
three