Merging rows based on same ID in First column.

Hellow,

I have a tab-delimited file with 3 columns :

BINPACKER.13259.1.p2    SSF48239    
BINPACKER.13259.1.p2    PF13243    
BINPACKER.13259.1.p2    G3DSA:1.50.10.20
BINPACKER.13259.2.p2    SSF48239    
BINPACKER.13259.2.p2    PF13243    
BINPACKER.13259.2.p2    G3DSA:1.50.10.20
BINPACKER.31705.4.p1    PF00176    GO:0005524
BINPACKER.31705.4.p1    SM00490    
BINPACKER.31705.4.p1    SSF52540    
BINPACKER.31705.4.p1    G3DSA:3.40.50.300
BINPACKER.31705.4.p1    mobidb-lite
BINPACKER.31705.4.p1    SM00487    
BINPACKER.31705.4.p1    PS51194    
BINPACKER.31705.4.p1    cd00079    
BINPACKER.31705.4.p1    PF00271    
BINPACKER.31705.4.p1    PS51192    
BINPACKER.31705.4.p1    cd00046
BINPACKER.31705.4.p1    G3DSA:3.40.50.10810    
BINPACKER.31705.4.p1    SSF52540    
BINPACKER.9719.7.p1    PF00443    GO:0016579|GO:0036459
BINPACKER.9719.7.p1    SSF57850
BINPACKER.9719.7.p1    PS50235    
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    PF02148    GO:0008270
BINPACKER.9719.7.p1    SSF54001    
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    cd02669    GO:0000245|GO:0006397
BINPACKER.9719.7.p1    PS50271    GO:0008270
BINPACKER.9719.7.p1    SM00290    GO:0008270
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    G3DSA:3.30.40.10    
BINPACKER.9719.7.p1    G3DSA:3.90.70.10
BINPACKER.937.4.p1    PS51032    GO:0003700|GO:0006355
BINPACKER.937.4.p1    PIRSF038123    GO:0003700
BINPACKER.937.4.p1    cd00018    GO:0003700|GO:0006355
BINPACKER.937.4.p1    SSF54171    GO:0003677
BINPACKER.937.4.p1    G3DSA:3.30.730.10    GO:0003700|GO:0006355
BINPACKER.937.4.p1    PR00367    GO:0003700|GO:0006355

I want to mege the rows based on first column with same ID. In column 2, I want only ID starting with PF and in 3rd column, want to concatenate all GO term seperated with comma. in each case there should be no duplicate eg:

BINPACKER.13259.1.p2    PF13243    NA
BINPACKER.13259.2.p2  PF13243                    NA
 BINPACKER.31705.4.p1    PF00176,PF00271    GO:0005524
BINPACKER.9719.7.p1    PF00443,PF02148    GO:0016579,GO:0036459,GO:0008270,GO:0000245,GO:0006397
BINPACKER.937.4.p1    NA    GO:0003700,GO:0006355,GO:0003677

Thankyou

Great and thanks for posting.

Please show the code you have written so far and share your platform details.

Thanks.

According to requirements PF02148 should be in column 2 of the output shown (line 3). Also, there should be in the output: BINPACKER.13259.2.p2 PF13243 NA

Thankyou, you are right, i have edited my outpur.

awk -F'\t' '
{column_one[$1]=$1; gsub(" *[,|] *", ",");
 if ($2 ~ /^PF/) {
    if (! length(pf_string[$1,$2])) out_pf_string[$1]=out_pf_string[$1] $2 ",";
    pf_string[$1,$2]=$2;
 }
 c=split($3, column_three, " *, *");
 for (i=1; i<=c; i++) {
     if (column_three ~ /^GO/) {
        if (! length(go_string[$1,column_three])) out_go_string[$1]=out_go_string[$1] column_three ",";
        go_string[$1,column_three]=column_three;
     }
 }
}
END {
   for (i in column_one) {
      sub(",*$", "", out_pf_string);
      sub(",*$", "", out_go_string);
      out_pf_string=(length(out_pf_string)) ? out_pf_string : "NA";
      out_go_string=(length(out_go_string)) ? out_go_string : "NA";
      print i, out_pf_string, out_go_string;
   }
}' OFS='\t' infile
1 Like

code:

perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",\t";
    print "$_\t@{$h{$_}}" for @r;
  }
'

It is concatenating all values and separate them with comma, but i don't know how to remove duplicate entries and also need to replace NA in columns having no entries as depicted in my desirable output.

Try also

awk -F"[|       ]" '
$2 ~ /^PF/ &&
PF[$1] !~ $2    {PF[$1] = PF[$1] DLP[$1] $2
                 DLP[$1] = "," 
                }
NF > 2          {for (i=3; i<=NF; i++) if (GO[$1] !~ $i)        {GO[$1] = GO[$1] DLG[$1] $i
                                                                 DLG[$1] = ","
                                                                }
                }
END             {for (p in PF)  {print p, PF[p], GO[p]?GO[p]:"NA"
                                 delete GO[p]
                                }
                 for (g in GO)   print g, "NA", GO[g]
                }
' OFS="\t" file
BINPACKER.13259.2.p2    PF13243    NA
BINPACKER.31705.4.p1    PF00176,PF00271    GO:0005524
BINPACKER.9719.7.p1     PF00443,PF02148    GO:0016579,GO:0036459,GO:0008270,GO:0000245,GO:0006397
BINPACKER.13259.1.p2    PF13243    NA
BINPACKER.937.4.p1      NA       GO:0003700,GO:0006355,GO:0003677

Thankyou.

--- Post updated at 06:33 PM ---

--- Post updated at 06:35 PM ---

Thankyou so much for your help, it worked.