Count total duplicates

mikloz · March 24, 2015, 9:11pm

Hi all,

I have found another post threads talking about count duplicate lines, but I am interested in obtain the total number of duplicates. For example:

#file.txt

a1
a2
a1
a3
a1
a2
a4
a5

#out

3 (lines are duplicates)

Thank you!

Don_Cragun · March 24, 2015, 10:36pm

Is this a homework assignment?

What have you tried?

mikloz · March 25, 2015, 12:30am

Hi Don Cragun,

It is not a homework, I know it has to be easy to do but I don't find the correct way.

I know how to obtain the duplicates number for each line:

sort file.txt | uniq -c -d

3 a1
2 a2

or the number of uniq reads:

sort -u -k1 file.txt | wc -l

But not for my case.

Thank you

durden_tyler · March 25, 2015, 12:41am

$
$ cat f28
a1
a2
a1
a3
a1
a2
a4
a5
$
$ sort f28 | uniq -c | awk '$1 > 1{sum += $1 - 1} END{print sum" are duplicates"}'
3 are duplicates
$
$

$
$ awk '{a[$0] == 1 ? sum++ : a[$0]=1} END {print sum" are duplicates"}' f28
3 are duplicates
$

RavinderSingh13 · March 25, 2015, 12:48am

Hello Mikloz,

Following may help you in same.

awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S}' Input_file

When file is as per your provided input output will be 3 .
Let's say for testing we have changed the file to below.

cat check_count
a1
a2
a1
a3
a1
a2
a4
a5
a5
a1

Now when we will run above command following will come.

awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S}' check_count
5

Hope this helps.

NOTE: Considering that you input file has 1 column, else you can use A[$0] in place of A[$1] .

Thanks,
R. Singh

Don_Cragun · March 25, 2015, 1:07am

I would suggest the slightly simpler:

awk 'l[$0]++{d++}END{print d, "(lines are duplicates)"}' file.txt

mikloz · March 25, 2015, 1:09am

Thanks to all.

And if I am interested in the rate?

$ cat f28 
a1 
a2 
a1 
a3 
a1 
a2 
a4 
a5

3/8

RavinderSingh13 · March 25, 2015, 1:11am

Hello Milkoz,

Then following may help you in same.

awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S"/" NR}'  Input_file

Thanks,
R. Singh

Don_Cragun · March 25, 2015, 1:17am

Or:

awk 'l[$0]++{d++}END{printf("%d/%d\n",d,NR)}' file.txt

mikloz · March 25, 2015, 9:05pm

Perfect! Thank you!

---------- Post updated at 08:05 PM ---------- Previous update was at 12:26 AM ----------

Please, Don Cragun, can you explain it? I suppose you create a list, and add each line in it. If the line is already in the list, you increase the d variable. Is it correct?

Don_Cragun · March 25, 2015, 9:50pm

Yes, the script:

awk 'l[$0]++{d++}END{printf("%d/%d\n",d,NR)}' file.txt

can be rewritten as:

awk '		# Run awk with the following script...
l[$0]++ {	# Set array l[] indexed by the contents of the current input line to
		# the number of times this line has been seen so far and return the
		# number of times this line had been seen before this line.  If the
		# value returned is not zero and is not the empty string, execute
		# the commands in this section.  (This will happen any time this line
		# has been seen before.)

	d++	# Increment the number of duplicates seen.
}
END {		# After all lines have been read from all input files given to
		# this invocation of awk, run the commands in this section.

	printf("%d/%d\n, d, NR)	# Print the number of duplicates seen and
				# the Number of Records read from all of the input files
				# given to this invocation of awk.
}' file.txt	# End the script and specify the input file(s) to be processed by this
		# invocation of awk.

I hope this helps.

mikloz · March 26, 2015, 12:43am

Got it! Thanks!

ken6503 · March 26, 2015, 10:55pm

don cragun:

Yes, the script:

awk 'l[$0]++{d++}END{printf("%d/%d\n",d,NR)}' file.txt

can be rewritten as:

awk '		# Run awk with the following script...
l[$0]++ {	# Set array l[] indexed by the contents of the current input line to
		# the number of times this line has been seen so far and return the
		# number of times this line had been seen before this line.  If the
		# value returned is not zero and is not the empty string, execute
		# the commands in this section.  (This will happen any time this line
		# has been seen before.)

	d++	# Increment the number of duplicates seen.
}
END {		# After all lines have been read from all input files given to
		# this invocation of awk, run the commands in this section.

	printf("%d/%d\n, d, NR)	# Print the number of duplicates seen and
				# the Number of Records read from all of the input files
				# given to this invocation of awk.
}' file.txt	# End the script and specify the input file(s) to be processed by this
		# invocation of awk.

I hope this helps.

great explanation, thanks.