Grep a string and count following lines starting with another string

I have a large dataset with following structure;

  C 0001 Carbon [C]
  D SAR001 methane [CH3]
  D SAR002 ethane
  D SAR003 propane
  D SAR004 butane
  D SAR005 pentane
  C 0002 Hydrogen [H]
  C 0003 Nitrogen [N]
  C 0004 Oxygen [O]
  D SAR011 ozone
  D SAR012 super oxide
  C 0005 Sulphur 
  D SAR013 Hydrogen Sulphide [H2S]
  D SAR014 Sulphuric acid
  .
  .
  .

In this dataset, lines starting with C are the headings and those with D are the components of their headings. I want to count the number of components in each heading and desires the output as;

0001 5
0002 0
0003 0
0004 2
0005 2
.
.
.

The pseudo code can be;

grep ^C
count next lines with ^D
print [$2 of ^C] and [count of ^D]
restart loop

Hi, try:

awk '$1=="C"{i=$2; A=0} $1=="D"{A++} END{for(i in A) print i,A}' file

or

awk '$1=="C"{if(i!="") print i, c; i=$2; c=0} $1=="D"{c++} END{print i, c}' file
1 Like

The second solution more "verbose":

awk '
function pr() {if (notfirst++) print heading,dcnt}
$1=="C" {pr(); heading=$2; dcnt=0}
$1=="D" {dcnt++}
END {pr()}
' file
1 Like

Different approach:

awk '/^ *C/ {if (L) print NR-L-1;  printf "%s\t", $2; L=NR} END {print NR-L}' file
0001    5
0002    0
0003    0
0004    2
0005    2
1 Like