Print unique names in each row of a specific column using awk

quincyjones · December 12, 2012, 11:36am

Is it possible to remove redundant names in the 4th column?

input

cqWE    100    200    singapore;singapore
AZO    300    400    brazil;america;germany;ireland;germany
....
....

output

cqWE    100    200    singapore
AZO    300    400    brazil;america;germany;ireland

radoulov · December 12, 2012, 11:46am

With Perl you could write something like this:

perl -lpe'
  %_ = ();
  s/[^\s]+$/join ";", grep !$_{$_}++, split ";", $&/e
  ' infile

With awk the code will be more noisy.

vgersh99 · December 12, 2012, 11:48am

a 'noisy' awk: awk -f quincy.awk myFile
quincy.awk:

{
  split("",t)
  n=split($4, a,";")
  $4=""
  for(i=1;i<=n;i++)
    if( !(a in t)) {
      $4=(i==1)?a:$4 ";" a
      t[a]
    }
  print
}

radoulov · December 12, 2012, 11:50am

radoulov · December 12, 2012, 12:23pm

If you want/prefer to use awk and you have a recent GNU awk implementation,
you could reconstruct the records after the change exactly (including variable FS')
and preserve the original formatting:

awk '{
  split($0, t, FS, s)
  for (i = 0; ++i < NF;)
    printf "%s", $i s
  n = split(t, tt, fs)
  delete _; lf = x
  for (i = 0; ++i <= n;)
    lf = lf (_[tt]++ ? x : tt fs) 
  print substr(lf, 1, length(lf) - 1)
  }' fs=\; infile