Hi all,
I have found another post threads talking about count duplicate lines, but I am interested in obtain the total number of duplicates. For example:
#file.txt
a1
a2
a1
a3
a1
a2
a4
a5
#out
3 (lines are duplicates)
Thank you!
Hi all,
I have found another post threads talking about count duplicate lines, but I am interested in obtain the total number of duplicates. For example:
#file.txt
a1
a2
a1
a3
a1
a2
a4
a5
#out
3 (lines are duplicates)
Thank you!
Is this a homework assignment?
What have you tried?
Hi Don Cragun,
It is not a homework, I know it has to be easy to do but I don't find the correct way.
I know how to obtain the duplicates number for each line:
sort file.txt | uniq -c -d
3 a1
2 a2
or the number of uniq reads:
sort -u -k1 file.txt | wc -l
5
But not for my case.
Thank you
$
$ cat f28
a1
a2
a1
a3
a1
a2
a4
a5
$
$ sort f28 | uniq -c | awk '$1 > 1{sum += $1 - 1} END{print sum" are duplicates"}'
3 are duplicates
$
$
$
$ awk '{a[$0] == 1 ? sum++ : a[$0]=1} END {print sum" are duplicates"}' f28
3 are duplicates
$
Hello Mikloz,
Following may help you in same.
awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S}' Input_file
When file is as per your provided input output will be 3
.
Let's say for testing we have changed the file to below.
cat check_count
a1
a2
a1
a3
a1
a2
a4
a5
a5
a1
Now when we will run above command following will come.
awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S}' check_count
5
Hope this helps.
NOTE: Considering that you input file has 1 column, else you can use A[$0]
in place of A[$1]
.
Thanks,
R. Singh
I would suggest the slightly simpler:
awk 'l[$0]++{d++}END{print d, "(lines are duplicates)"}' file.txt
Thanks to all.
And if I am interested in the rate?
$ cat f28
a1
a2
a1
a3
a1
a2
a4
a5
3/8
Hello Milkoz,
Then following may help you in same.
awk '{A[$1]++} END{;for(i in A){if(A>1){S=S?S+A-1:A-1;}};print S"/" NR}' Input_file
Thanks,
R. Singh
Or:
awk 'l[$0]++{d++}END{printf("%d/%d\n",d,NR)}' file.txt
Perfect! Thank you!
---------- Post updated at 08:05 PM ---------- Previous update was at 12:26 AM ----------
Please, Don Cragun, can you explain it? I suppose you create a list, and add each line in it. If the line is already in the list, you increase the d variable. Is it correct?
Yes, the script:
awk 'l[$0]++{d++}END{printf("%d/%d\n",d,NR)}' file.txt
can be rewritten as:
awk ' # Run awk with the following script...
l[$0]++ { # Set array l[] indexed by the contents of the current input line to
# the number of times this line has been seen so far and return the
# number of times this line had been seen before this line. If the
# value returned is not zero and is not the empty string, execute
# the commands in this section. (This will happen any time this line
# has been seen before.)
d++ # Increment the number of duplicates seen.
}
END { # After all lines have been read from all input files given to
# this invocation of awk, run the commands in this section.
printf("%d/%d\n, d, NR) # Print the number of duplicates seen and
# the Number of Records read from all of the input files
# given to this invocation of awk.
}' file.txt # End the script and specify the input file(s) to be processed by this
# invocation of awk.
I hope this helps.
Got it! Thanks!
great explanation, thanks.