how to identify duplicate columns in a row

suresh3566 · November 12, 2009, 7:11am

Hi,

How to identify duplicate columns in a row?

Input data: may have 30 columns

9211480750 LK 120070417 920091030
9211480893 AZ 120070607
9205323621 O7 120090914 120090914 1420090914 2020090914 2020090914
9211479568 AZ 120070327 320090730
9211479571 MM 120070326
9211480892 MM 120070324
9211479945 AZ 120070306 320091109 920091002
9211480855 AZ 120070330 920090913
9211479857 AZ 120070306 920090916
9211480863 MM 120070314
9211479935 MM 120070306
9211479588 AZ 120070323
9211479565 MM 120070311
9289819968 OD null
9211479947 AZ 120070306 120070306
9211479939 ID 120070306 220091105 920091031 1220091105

expected output

9205323621 O7 120090914 120090914 1420090914 2020090914 2020090914
9211479947 AZ 120070306 120070306

Franklin52 · November 12, 2009, 7:41am

Try this:

awk '{
  for(i=1;i<=NF;i++){
    for(j=i+1;j<=NF;j++){
      if($i==$j){print; next}
    }
  }
}' file

radoulov · November 12, 2009, 8:44am

With Perl:

perl -ane'
  grep $_{$_}++, @F and print; undef %_
  ' infile

And another one with AWK:

awk '{
  for (i=1; i<=NF; i++)
    if (_[$i]++) { print; break }
	split(x, _)
  }' infile

If your AWK implementation supports delete <array>:

awk '{
  for (i=1; i<=NF; i++)
    if (_[$i]++) { print; break }
	delete _
  }' infile

rdcwayx · November 16, 2009, 1:02am

Good solution to use split in awk. update the code for easily understanding.

awk '{
  for (i=1; i<=NF; i++)
    if (A[$i]++) { print; break }
	split(0, A)
  }'