Count of occurrence in particular column of the file.

Hi All,

let's say an input looks like:

C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11
----------------------------------
1|0123452|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0156123|C501|X|X|X|E|E|E|E|E|E|E
1|0178903|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0127896|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0981678|C501|X|X|X|E|E|E|E|E|E|E

Third column having a value like C501. Now, I need to find out when the third column having value like C501,then need to check count of occurrence of value like 'Z' or 'X' or 'Y' value from C4 column to C11 column. Then need to print how many times occurred that particular value in each column.

Output should be like this.

C3,C4,C5,C6,C7,C8,C9,C10,C11
----------------------------------
C501|Z=3|Z=3|Z=3|E=5|E=5|E=5|E=5|E=5|3=5
        X=2 X=2 X=2

Kindly help me how to achieve this result.

A bit strange that the header has comma separators but the data have pipes instead.

---------- Post updated at 10:38 ---------- Previous update was at 10:36 ----------

And, your desired output seems somewhat inconsistent...

Apologies!it was typo error.

And, what is 3=5 in the result?

this is my input

C1|C2|C3|C4|C5|C6|C7|C8|C9|C10|C11
-------------------------------------------
1|0123452|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0156123|C501|X|X|X|E|E|E|E|E|E|E
1|0178903|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0127896|C501|Z|Z|Z|E|E|E|E|E|E|E
1|0981678|C501|X|X|X|E|E|E|E|E|E|E

Expected Output:

C3|C4|C5|C6|C7|C8|C9|C10|C11
----------------------------------
C501|Z=3|Z=3|Z=3|E=5|E=5|E=5|E=5|E=5|3=5
        X=2 X=2 X=2

Try:

awk '
  NR>2 {
    A[$3]
    for(i=4; i<=NF; i++) {
      B[$i]
      T[$3,$i,i]++
    }
  } 
  END {
    for(i in A) 
      for(j in B) {
        s=i
        for(k=4; k<=NF; k++)
          s=s OFS j "=" ((i,j,k) in T ? T[i,j,k] : 0)
        print s
     }
   }
' FS=\| OFS=\| file

Output:

C501|E=0|E=0|E=0|E=5|E=5|E=5|E=5|E=5|E=5|E=5
C501|X=2|X=2|X=2|X=0|X=0|X=0|X=0|X=0|X=0|X=0
C501|Z=3|Z=3|Z=3|Z=0|Z=0|Z=0|Z=0|Z=0|Z=0|Z=0 

--
Note: instead of s=s OFS j "=" ((i,j,k) in T ? T[i,j,k] : 0) one could use s=s OFS j "=" T[i,j,k]+0 but it uses more memory...

1 Like

Thanks Scrutinizer. Have got the expected result. Thanks again for resolve the issue.