Print unique names in a specific column using awk

quincyjones · May 2, 2013, 4:01am

Is it possible to modify file like this.

Remove all the duplicate names in a define column i.e 4th col
Count the no.of unique names separated by ";" and print as a 5th col

thanx in advance!!
Q

input

c1	30	3	Eh2
c10	96	3	Frp
c41	396	3	Ua5;Lop;Kol;Kol
c62	2	30	Fmp;Fmp;Fmp

output

c1	30	3	Eh2	1
c10	96	3	Frp	1
c41	396	3	Ua5;Lop;Kol	3
c62	2	30	Fmp	1

RudiC · May 2, 2013, 4:48am

Try

awk     '       {n=split ($4, T, ";")
                 for (i=n; i>=1; i--) {
                   for (j=i-1; j>=1; j--)
                     if (T==T[j]) {n--; break}
                    }
                 $4 = T[1]
                 for (i=2; i<=n; i++) $4=$4 ";" T
                 $5 = n
                }
         1
        ' OFS="\t" file
c1     30    3    Eh2    1
c10    96    3    Frp    1
c41    396   3    Ua5;Lop;Kol    3
c62    2    30    Fmp    1

quincyjones · May 2, 2013, 5:00am

I just noticed one of my 4th col has 300 names (most of them duplicates). The script is failing in this case.

vidyadhar85 · May 2, 2013, 5:19am

Are you sure its all in same line and not divided into two lines in your file?

quincyjones · May 2, 2013, 5:25am

Yes I am sure.

vidyadhar85 · May 2, 2013, 6:01am

There was glitch in logic..

modified it

 
 
awk     '       {n=split ($4, T, ";")
                 for (i=n; i>=1; i--) {
                   for (j=i-1; j>=1; j--)
                     if (T==T[j]) {delete T; break}
                    }
                 $4 = T[1]
                 for (i=2; i<=n; i++) {if(T){ $4=$4 ";" T}}
                 $5 = split($4,A,";")
                }
         1
        ' OFS="\t" filename