REMOVE DUPLICATE IN a ROW AFTER CHECKING THE FIRST SIMILAR NAME

manigrover · August 11, 2012, 9:56pm

Hi all

I have a big file like this in rows and columns from 2 column onwards the next column is desciption of previous column means 3rd columns is description of 2 columns and 5 column is description of 4 column.

All cloumns are separated by comma

CHST3,docetaxel,xyznox,tyurppw,notavailble,docetaxel,xyznox,jfhdkg,notavailable

ESRT4,ghtscjgh,notavailable,Ghjfuti,notavailable,manhfd, kdcvgh,Ghjfuti,not available,manhfd, kdcvgh

I want to remove duplicates. The problem is I want it shuld check that whether n column entry equals to n+2 then n+2 and n+3 column should be reomve other wise not

so expected output is:

CHST3,docetaxel,xyznox,tyurppw,notavailble,jfhdkg,notavailable

ESRT4,ghtscjgh,notavailable,Ghjfuti,notavailable,manhfd,kdcvgh

agama · August 12, 2012, 12:14am

If I understand your requirements, this might do the trick:

awk '
    {
        n = split( $0, a, "," );
        for( i = 2; i < n; i += 2 )
        {
            if( a != "" )
                for( j = i+2; j < n; j += 2 )
                    if( a == a[j] && a[i+1] == a[j+1] )
                        a[j] = a[j+1] = "";
        }

        for( i = 1; i <= n; i++ )
            if( a )
                printf( "%s%s", a, i == n ? "" : "," );
        printf( "\n" );
    }
' input-file >output-file