Search Duplicates, Print Line #

genehunter · March 24, 2010, 5:59pm

Masters,

I have a text file in the following format.

  vrsonlviee	RVEBAALSKE
lyolzteglx	UUOSIWMDLR
pcybtapfee	DKGFJBHBJO
ozhrucfeau	YQXATYMGJD
cjwvjolrcv	YDHALRYQTG
mdukphspbc	CQZRIOWEUB
nbiqomzsgw	DYSUBQSSPZ
xovgvkneav	HJFQQYBLAF
boyyzdmzka	BVTVUDHSCR
vrsonlviee	TGTKUCUYMA
pcybtapfee	CQZRIOWEUB

I want to find duplicates in Col 2 and the get their line number.
I also want a solution to remove them using those line numbers.
The reason for choosing the line number is to make sure that I want to remove the line I chose from the duplicates, taking account of the variable in Col1.
Awk or sed egrep preferred.

Thanks

Scott · March 24, 2010, 6:09pm

You can find the duplicates with something like:

awk 'A[$2]++ { print NR }' file

You don't need the line number to remove them, when you can use:

awk '!A[$2]++' file

ldapswandog · March 24, 2010, 7:21pm

Some one who knows AWK will provide a much better solution, but I can at least provide a solution.

# # get the list of duplicates in column 2
awk '{print $2}' file | sort | uniq -c | sort -n | awk '$1>1 {print $2}' > list_dups

# # for each duplicate in column 2 grep the entries from the file with line numbers
for x in $(< list_dups); do grep -n $x file;done

# # output
6:mdukphspbc    CQZRIOWEUB
11:pcybtapfee   CQZRIOWEUB

# # now remove the duplicate on line 6
sed '6d' file > file2

# # output after removing line 6
cat file2
  vrsonlviee    RVEBAALSKE
lyolzteglx      UUOSIWMDLR
pcybtapfee      DKGFJBHBJO
ozhrucfeau      YQXATYMGJD
cjwvjolrcv      YDHALRYQTG
nbiqomzsgw      DYSUBQSSPZ
xovgvkneav      HJFQQYBLAF
boyyzdmzka      BVTVUDHSCR
vrsonlviee      TGTKUCUYMA
pcybtapfee      CQZRIOWEUB