Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

pedro88 · March 11, 2019, 7:08am

Hello all,
I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this:

A         B           C             D
a1        b1           c1            d1
a2       b2          c2           d2

output:

A         B           C             D
ad        bd       cd            d1,d2

where ad bd and cd are the dereplicated output rows and in D we have that for each of the unique rows we have all the data separated by a comma in one single cell for each unique row.

joeyg · March 11, 2019, 7:51am

You may want to try to explain that again.
I know that I do not see how you get from that example of 3 lines to 2 lines.

pedro88 · March 11, 2019, 8:39am

Basically, I have a tabular file with 4 columns (A,B,C,D). and several rows (1,2,3,4,5,6,7,....)
Considering column A the data are redundant (like :

A                           B        C                  D
apple                  15        aaa           agcacagcagc
apple                  25        bbb         acgacgacgcga
banana               12        cccc        acagcgaagccga
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc
banana               36        ffff         cacacgtgtgct

I need to output like:

A                         B              C            D
apple                25           aaa         agcacagcagc;acgacgacgcga
banana            36           cccc       acagcgaagccga;cacacgtgtgct
cherry              36           ddd        actgctgtcgagtag
berry                55            eee        gactgatgctgtcgtc

I don't really mind column C so whatever he keeps in the output it's ok. for column B I keep the higher ( I managed to do it with pandas but i'm not able to do the trick on column D)

thanks

nezabudka · March 11, 2019, 9:35am

awk '
($1 in A)       { if($2 > A[$1][2]) A[$1][2] = $2
                        A[$1][4] = A[$1][4] ";" $4
                        next
                }
                { for(n = split($0, M); n; n--) A[$1][n] = M[n]
                }
END             { for(i in A) {
                        for(j = 1; j <= NF; j++) printf "%s ",  A[j]
                                print ""
                        }
                }' file

Don_Cragun · March 14, 2019, 5:32pm

Moderator comments were removed during original forum migration.

RavinderSingh13 · March 14, 2019, 11:18pm

Hello pedro88,

Could you please try following too, I am reading Input_file 2 times here and output will be in same sequence in which $1 appears to be in Input_file.

awk '
FNR==NR{
  a[$1]=a[$1]>$2?a[$1]:$2
  b[$1]=a[$1]>$2?b[$1]?b[$1]:$0:$0
  next
}
($1 in a){
  print b[$1]
  delete a[$1]
}
'   Input_file  Input_file

Output will be as follows.

A                           B        C                  D
apple                  25        bbb         acgacgacgcga
banana               36        ffff         cacacgtgtgct
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc

Thanks,
R. Singh