Print unique names in a specific column using awk

Is it possible to modify file like this.

  1. Remove all the duplicate names in a define column i.e 4th col
  2. Count the no.of unique names separated by ";" and print as a 5th col

thanx in advance!!
Q

input

c1	30	3	Eh2
c10	96	3	Frp
c41	396	3	Ua5;Lop;Kol;Kol
c62	2	30	Fmp;Fmp;Fmp

output

c1	30	3	Eh2	1
c10	96	3	Frp	1
c41	396	3	Ua5;Lop;Kol	3
c62	2	30	Fmp	1

Try

awk     '       {n=split ($4, T, ";")
                 for (i=n; i>=1; i--) {
                   for (j=i-1; j>=1; j--)
                     if (T==T[j]) {n--; break}
                    }
                 $4 = T[1]
                 for (i=2; i<=n; i++) $4=$4 ";" T
                 $5 = n
                }
         1
        ' OFS="\t" file
c1     30    3    Eh2    1
c10    96    3    Frp    1
c41    396   3    Ua5;Lop;Kol    3
c62    2    30    Fmp    1
1 Like

I just noticed one of my 4th col has 300 names (most of them duplicates). The script is failing in this case.

Are you sure its all in same line and not divided into two lines in your file?

Yes I am sure.

There was glitch in logic..

modified it

 
 
awk     '       {n=split ($4, T, ";")
                 for (i=n; i>=1; i--) {
                   for (j=i-1; j>=1; j--)
                     if (T==T[j]) {delete T; break}
                    }
                 $4 = T[1]
                 for (i=2; i<=n; i++) {if(T){ $4=$4 ";" T}}
                 $5 = split($4,A,";")
                }
         1
        ' OFS="\t" filename
1 Like