Hello all,
I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this:
A B C D
a1 b1 c1 d1
a2 b2 c2 d2
output:
A B C D
ad bd cd d1,d2
where ad bd and cd are the dereplicated output rows and in D we have that for each of the unique rows we have all the data separated by a comma in one single cell for each unique row.
Basically, I have a tabular file with 4 columns (A,B,C,D). and several rows (1,2,3,4,5,6,7,....)
Considering column A the data are redundant (like :
A B C D
apple 15 aaa agcacagcagc
apple 25 bbb acgacgacgcga
banana 12 cccc acagcgaagccga
cherry 36 ddd actgctgtcgagtag
berry 55 eee gactgatgctgtcgtc
banana 36 ffff cacacgtgtgct
I need to output like:
A B C D
apple 25 aaa agcacagcagc;acgacgacgcga
banana 36 cccc acagcgaagccga;cacacgtgtgct
cherry 36 ddd actgctgtcgagtag
berry 55 eee gactgatgctgtcgtc
I don't really mind column C so whatever he keeps in the output it's ok. for column B I keep the higher ( I managed to do it with pandas but i'm not able to do the trick on column D)