Count and merge using common column

empyrean · October 31, 2012, 6:14pm

I have the following records from multiple files.

415     A       G
415     A       G
415     A       T
415     A       .
415     A       .
421     G       A
421     G       A,C
421     G       A
421     G       A
421     G       A,C
421     G       .
427     A       C
427     A       C
427     A       .
427     A       .

1) i wanted to remove the columns which have "." in third column
2) count the columns and merge based on first column

I want output like this

 
      3  2,1    415     A       G/T
      5  3,2    421     G       A/A,C
      2         427     A       C

first column "3 2,1 415 A G/T"

3 - how many times 415 is repeated
2,1 - if i count uniq it is giving two times of " 415 A G" and one time "415 A T" pattern. so i wanted to merge this and get final as " 3 2,1 415 G/T"

I used this command to count unique but unable to merge and combine the columns

cat file | awk '$3 ~/A|T|G|C/{print $0}'| sort | uniq -c

By using above code i am getting the following output

      2 415     A       G
      1 415     A       T
      3 421     G       A
      2 421     G       A,C
      2 427     A       C

Chubler_XL · October 31, 2012, 7:40pm

Think this does what you want

awk '
$3!="." {
    key=$1"\t"$2
    if(!(key in I)) D[++keys]=key
        if(!((key SUBSEP $3) in M)) {
            I[key]++
            K[key,I[key]]=$3
        }
    M[key,$3]++
    S[key]++
}
END {
    for(i=1;i<=keys;i++) {
            key=D
        printf "%d\t", S[key]
        if(I[key]>1)
            for(j=1;j<=I[key];j++)
               printf "%s%d",(j>1?",":""),M[key,K[key,j]]
        printf "\t%s\t", key
        for(j=1;j<=I[key];j++)
               printf "%s%s",(j>1?"/":""),K[key,j]
        printf "\n"
    }
}' infile

pamu · November 1, 2012, 3:12am

I assume if second column is count of occurrences. then third row should contain 2 in second column.

This is lit bit nasty code..

awk 'function prnt() {
    split(A,ARR);
    n=split(ARR[3],ARR_2,"/");
    if(n>1){for(i=1;i<=n;i++){
    var=var?var","UN[ARR[1],ARR[2],ARR_2]:UN[ARR[1],ARR[2],ARR_2]}}
    print CN,var,A;var=""
    }
    {if($3 != "."){CN[$1]++;if(!UN[$1,$2,$3]++){A[$1]=A[$1]?A[$1]"/"$3:$0;}
    if(S != $1 && S){prnt()}
    {P=$0;S=$1}}}END{prnt()}' OFS="\t" file

Chubler_XL · November 1, 2012, 7:53pm

@pamu your solution requires that input file is sorted, the solution I posted didn't assume this, mostly because empyrean's posted code required sort:

cat file | awk '$3 ~/A|T|G|C/{print $0}'| sort | uniq -c