How to find duplicate entries

buzzme · April 25, 2013, 11:52am

I have a file contails as below

I/P:

Like above i have 1000's of entries
I need output as below
O/P:

123456
678905

I'm using uniq -d filename it is showing results but it is missing few duplicate entries and i dont know why.Please help me.

Yoda · April 25, 2013, 11:55am

awk 'A[$0]++==1' file

elixir_sinari · April 25, 2013, 12:07pm

Better to use $1 instead of $0 to avoid skipping some duplicate numbers due to leading/trailing whitespace.

rveri · April 25, 2013, 12:38pm

Hi Buzzme,

awk '!d[$0]++' file

Correction: This ll have only unique entries

elixir_sinari · April 25, 2013, 12:41pm

That will not work.
Revisit the OP's requirement.

rveri · April 25, 2013, 6:03pm

elixir_sinari,
you are correct, that ll not work!. I did not understand the problem at first shot. Thanks.. for correcting me.

Buzzme,
> I'm using uniq -d filename it is showing results but it is missing few duplicate entries and i dont know why

You may need to use

sort

before

uniq -d

, to have it work correctly. Wondering if you have tried it.

Please check it out with sorting numerical order:

sort -n file|uniq -d

Here is onother version with uniq that ll give output inclduing a numerical sorted output:

sort -n file|uniq -c|awk '{if ($1>1) print $2}'

Enjoy Have fun!.

alister · April 25, 2013, 6:43pm

rveri:

Please check it out with sorting numerical order:
sort -n file|uniq -d
...

Here is onother version with uniq that ll give output inclduing a numerical sorted output:
sort -n file|uniq -c|awk '{if ($1>1) print $2}'

You cannot meaningfully use a numeric sort, since uniq expects its data to be sorted lexicographically.

uniq will not consider "01" to be equal to "1", nor 1.0 to 1.00, and nor " 1" to "1". If leading/trailing zeroes/whitespace are a concern, then either the file needs to be preprocessed to normalize the entries, or a more capable tool should be used, e.g. perl or AWK.

Demonstration:

$ printf '%s\n' 1 01 001 '  1'
1
01
001
  1
$ printf '%s\n' 1 01 001 '  1' | sort -un
1
$ printf '%s\n' 1 01 001 '  1' | sort -n | uniq
  1
001
01
1

Notice how sort -un knows that it's doing a numeric comparison and considers all 4 terms to be equal. However, uniq considers each value to be distinct.

Regards,
Alister

hanson44 · April 25, 2013, 6:45pm

Then they are not really duplicates.
Check if you have trailing spaces.

$ cat input # first line has trailing space
123456
123456
234567
987654
678905
678905

$ uniq -d input
678905

$ tr -d " " < input | uniq -d
123456
678905

alister · April 25, 2013, 7:06pm

It could as be the result of the file not being sorted, as was mentioned earlier.

$ printf '%s\n' 1 2 1 2 1 2 1 2 > file
$ cat file
1
2
1
2
1
2
1
2
$ uniq -d file
$ sort file | uniq -d
1
2

Regards,
Alister

hanson44 · April 25, 2013, 7:43pm

Good point. I assumed that was already taken into account. But the sample input provided is not sorted.

Back to original poster: uniq only works correctly on sorted file. It runs on whatever you provide it, but to get meaningful results the input to uniq must be sorted. uniq looks for adjacent duplicated lines.