Create bins with totals and percentage

I would like to create bins to get histogram with totals and percentage, e.g. starting from 0.

If possible to set the minimum and maximum value in the bins ( in my case value min=0 and max=20 )

Input file

8  5
10 1
11 4
12 4
12 4
13 5
16 7
18 9
16 9
17 7
18 5
19 5
20 1
21 7

output desired

      0 0        0.0%
 0 -  2 0        0.0%
 2 -  4 0        0.0%
 4 -  6 0        0.0%
 6 -  8 0        0.0%
 8 - 10 5        6.8%
10 - 12 5        6.8%
12 - 14 13      17.8%
14 - 16 0        0.0%
16 - 18 23      31.5%
18 - 20 19      26.0%
   > 20 8       11.0%
---------------------
Total: 73

I use this code, it works perfectly but the percentage is missed.

awk 'BEGIN { delta = (delta == "" ? 2 : delta) }
{
    bucketNr = int(($0+delta) / delta)
    cnt[bucketNr]++
    numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
    for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
        end = beg + delta
        printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
        beg = end
    }
}' file

Thanks in advance

how's the percentage supposed to be calculated?
percentage of the bin out of the total sum?
then 8-10 5 8.2 is supposed to be 8-10 5 3.5 for your example...
Please elaborate.

How far would this small adaption to your code get you:

awk '
BEGIN   {delta = (delta=="")?2:delta
        }

        {bucketNr = int(($1+delta) / delta)
         cnt[bucketNr] += $2
         TOT           += $2
         numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
        }

END     {for (bucketNr=1; bucketNr<=numBuckets; bucketNr++)     {end = beg + delta
                                                                 printf "%2.0f - %2.0f\t%d\t%4.1f%%\n", beg, end, cnt[bucketNr], cnt[bucketNr] / TOT *100
                                                                 beg = end
                                                                }
        }
' file
 0 -  2    0     0.0%
 2 -  4    0     0.0%
 4 -  6    0     0.0%
 6 -  8    0     0.0%
 8 - 10    5     6.8%
10 - 12    5     6.8%
12 - 14    13    17.8%
14 - 16    0     0.0%
16 - 18    23    31.5%
18 - 20    19    26.0%
20 - 22    8     11.0%
1 Like

Dear vgersh99 , the percentage is calculated based in the total sum .

For your "max bin" question, try

awk -v"MAX=20" '
BEGIN   {delta = (delta=="")?2:delta
         MXBCK = (MAX / delta) + 1
        }

        {bucketNr = int(($1+delta) / delta)
         if (bucketNr > MXBCK) bucketNr = MXBCK
         cnt[bucketNr] += $2
         TOT           += $2
         numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
        }

END     {for (bucketNr=1; bucketNr<numBuckets; bucketNr++)      {end = beg + delta
                                                                 printf "%2.0f - %2.0f\t%d\t%4.1f%%\n", beg, end, cnt[bucketNr], cnt[bucketNr] / TOT *100
                                                                 beg = end
                                                                }
         printf "   > %2.0f\t%d\t%4.1f%%\n", MAX, cnt[MXBCK], cnt[MXBCK] / TOT *100
         print "---------------------"
         print "Total:", TOT
        }
' file
 0 -  2    0     0.0%
 2 -  4    0     0.0%
 4 -  6    0     0.0%
 6 -  8    0     0.0%
 8 - 10    5     6.8%
10 - 12    5     6.8%
12 - 14    13    17.8%
14 - 16    0     0.0%
16 - 18    23    31.5%
18 - 20    19    26.0%
   > 20    8     11.0%
---------------------
Total: 73
1 Like

same as above but total included:

awk 'BEGIN { delta = (delta == "" ? 2 : delta) ; max=20 }
{
if ($1 > max) {
maxf++;
maxc+=$2;
} else {
bucketNr = int(($1+delta) / delta)
cnt[bucketNr]++
cntc[bucketNr]+=$2;
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
total+=$2;
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%2d-%2d %3d %.2f%\n", beg, end, cntc[bucketNr], (cntc[bucketNr]*100.0)/total;
beg = end
}
if (maxf) printf ">%2d %2d %.2f%\n", max, maxc, (maxc*100.0)/total;
print "-------------";
print "Total: " total;
}' file

The code posted above needs some work. Lucky that $0+delta works somehow. The code posted does not output the columns desired shown, etc.

1 Like

RudiC/rdrtx1,
the binning, the percentage and the totals are somewhat different from the OP's desired output in post #1.
Hence post #2 with the ask to elaborate.

2 Likes

Dear RudiC and rdrtx1

Codes are amazing, thanks a lot

Please a small request more.

It is possible to set the minimum too. As example minimum = 0.

The the results will show something like this.

      0    0     0.0%
  0 - 2    0     0.0%
  2 - 4    0     0.0%
  4 - 6    0     0.0%
 6 -  8    0     0.0%
 8 - 10    5     6.8%
10 - 12    5     6.8%
12 - 14   13    17.8%
14 - 16    0     0.0%
16 - 18   23    31.5%
18 - 20   19    26.0%
   > 20    8    11.0%
--------------------
Total:    73

Dear rdrtx1 , I have updated my post with correct values and requirements.. Thanks a lot for the code.

--- Post updated at 12:51 AM ---

Dear vgersh99, I have updated my output file as it was wrong values. sorry for the confusion..

I think your small request more can easily be satisfied along the lines shown by rdrtx1 and me...

1 Like

Dear RudiC.. Thanks a lot code is great.. I will post a new one similar