I have a file with 5 columns. I want to pull out all records where the value in column 4 is not unique. For example in the sample below, I would want it to print out all lines except for the last two.
40991764 2419 724 47182 Cand A
40992936 3591 724 47182 Cand B
40993016 3671 724 47182 Cand C
40993876 4531 724 10154 Strep A
40993878 4533 724 10154 Strep B
40993990 4645 724 58899 Cala A
40993991 4646 724 63849 Myco A
I tried this:
awk -F '\t' 'a=x[$4]{print a"\n"$0;} {x[$4]=$0;}'
It works well if there is only one duplicate per line (10154 above), but if there is more than 1 duplicate (47182 above), it prints out one of the matched duplicates twice (Cand B):
40991764 2419 724 47182 Cand A
40992936 3591 724 47182 Cand B
40992936 3591 724 47182 Cand B
40993016 3671 724 47182 Cand C
40993876 4531 724 10154 Strep A
40993878 4533 724 10154 Strep B
How can I get it to print each unique line only once?
On first occurrence of an new $4 value ($4 in x) will be false so x[$4] is assigned to the record value.
On second occurrence $4 will be in x (we assigned it on first occurrence) and x[$4] will be non-blank so we do print x[$4]
which prints the first value then we do print to print current record and assign x[$4] to blank.
On Third and further occurrences $4 is still in x but the array item is blank now so we just print and assign x[$4] to blank again.
Edit:
One thing to be careful of is that awk will create and array item as soon as it's referenced for example: