Find lines with duplicate values in a particular column

kaktus · October 10, 2019, 6:34pm

I have a file with 5 columns. I want to pull out all records where the value in column 4 is not unique. For example in the sample below, I would want it to print out all lines except for the last two.

40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B
40993990	4645	724	58899	Cala A
40993991	4646	724	63849	Myco A

I tried this:

awk -F '\t' 'a=x[$4]{print a"\n"$0;} {x[$4]=$0;}'

It works well if there is only one duplicate per line (10154 above), but if there is more than 1 duplicate (47182 above), it prints out one of the matched duplicates twice (Cand B):

40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B

How can I get it to print each unique line only once?

Chubler_XL · October 10, 2019, 8:23pm

Try this:

awk -F '\t' '{ if($4 in x){ print (x[$4]?x[$4]"\n":"")$0;x[$4]=""} else x[$4]=$0}'

edit: or this

awk -F '\t' '{if($4 in x){if(x[$4]) print x[$4]; print;x[$4]=""} else x[$4]=$0}'

kaktus · October 10, 2019, 9:44pm

Thanks! Both of these get me the desired output. I don't fully understand how it works though. Would you mind breaking it down?

Chubler_XL · October 10, 2019, 9:55pm

On worries, it works like this.

On first occurrence of an new $4 value ($4 in x) will be false so x[$4] is assigned to the record value.

On second occurrence $4 will be in x (we assigned it on first occurrence) and x[$4] will be non-blank so we do print x[$4]
which prints the first value then we do print to print current record and assign x[$4] to blank.

On Third and further occurrences $4 is still in x but the array item is blank now so we just print and assign x[$4] to blank again.

Edit:
One thing to be careful of is that awk will create and array item as soon as it's referenced for example:

$ awk 'BEGIN { print T["test"]; print ("test" in T) }'

1

Using key in array is safe and does not create an item:

$ awk 'BEGIN { print ("test" in T); print ("test" in T) }'
0
0

kaktus · October 10, 2019, 10:29pm

Thanks, for the additional explanation. I can follow it now.

RudiC · October 11, 2019, 3:40am

Given your uniq provides all the options shown, try

sort -k4,4 file | uniq -D -f3 -w5
40993876    4531    724    10154    Strep A
40993878    4533    724    10154    Strep B
40991764    2419    724    47182    Cand A
40992936    3591    724    47182    Cand B
40993016    3671    724    47182    Cand C