Create range of bins counting values and add the percentage

Objective, create a table of range of bins to count the values for each range in column 2.

The purpose is to count at beginning all values in column 2 which contends value =0 or < 0.01
Then create the range of binnes counting values in column 2 > 0.01

The range of bins need to be extended to the MAX value in my case 20, even if these values are not found in the column.

Input file

121 0.98
121 0.98
121 1.47
121 1.47
121 0.00
121 0.00
121 0.00
121 0.10
121    2
121    3
121    7
121    0
121    0
121   10
121   12
121   15

Desired output file

            0    5    31.3%
 0.01 -  2.00    6    37.5%
 2.01 -  4.00    1     6.3%
 4.01 -  6.00    0     0.0%
 6.01 -  8.00    1     6.3%
 8.01 - 10.00    0     0.0%
10.01 - 12.00    2    12.5%
12.01 - 14.00    0     0.0%
14.01 - 16.00    1     6.3%
16.01 - 18.00    0     0.0%
18.01 - 20.00    0     0.0%
      > 20       0     0.0%
----------------------------
      Total:    16

Code Use
This code works well, but dont count values with exactly 0, and dont force the range of bins to the maximun value.

awk -v"MIN=0" -v"MAX=20" '
BEGIN   {delta = (delta=="")?2:delta
         MXBCK = (MAX / delta) + 1
        }

        {bucketNr = int(($2+delta) / delta)
         if (bucketNr > MXBCK) bucketNr = MXBCK
         cnt[bucketNr] ++
         TOT           ++ $2
         numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
        }

END     {
for (bucketNr=1; bucketNr<numBuckets; bucketNr++)      {end = beg + delta
         printf " %05.2f - %05.2f %6d %6.2f%%\n", beg, end, cnt[bucketNr], cnt[bucketNr] / TOT *100
         beg = end
         }
         printf "       > %5.2f %6d %6.2f%%\n", MAX, cnt[MXBCK], cnt[MXBCK] / TOT *100
         print "----------------------------"
         print "        Total:", TOT
        }
' file

Thanks in advance

Hi, What does this word mean (you used it many times):

What do you mean by this word, exactly?

Do you mean "columns" (in an ASCII table)?

Thanks.

1 Like

Just run the for loop up to MXBCK in lieu of numBuckets (whose final value you ignore, BTW, running the loop with < instead of <= )

Where did your request for a MIN value disappear that you posted in your recent thread?

1 Like

I get the maximun like this

NBIR=20 #NUMBER OF RANGES ON BINS

awk -v"MIN=0" -v"MAX=20" -v"NRB=$NBIR" '
BEGIN   {delta = (delta=="")?1:delta
         MXBCK = (MAX / delta) + 1
        }

        {bucketNr = int(($8+delta) / delta)
         if (bucketNr > MXBCK) bucketNr = MXBCK
         cnt[bucketNr] ++
         TOT           ++ $2
         numBuckets = NRB +1

Thanks a lot

Hi RudiC,

Kindly can you help with the desired output. still I am trying to set the min at beginning.

Please check how i modify the code to get the desired output , but still i got troubles here :

00.01 - 02.00      5  31.25%

I want to start the range of bins on 0.01 not exactly in 0.

the code I modify.

NBIR=10 #NUMBER OF RANGES ON BINS
awk -v"MIN=0" -v"MAX=20" -v"NRB=$NBIR" '
BEGIN   {delta = (delta=="")?2:delta
         MXBCK = (MAX / delta) + 1
         MIBCK = (MIN / delta) + 1
        }

        {bucketNr = (($2+delta)/delta)
         if (bucketNr > MXBCK) bucketNr = MXBCK
         cnt[bucketNr] ++
         TOT           ++ $2
         numBuckets = NRB +1
        }
END     {
         printf "       < %5.2f %6d %6.2f%%\n", MIN, cnt[MIBCK], cnt[MIBCK] / TOT *100 
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++)      {end = beg + delta 
         printf " %05.2f - %05.2f %6d %6.2f%%\n", beg+0.01, end, cnt[bucketNr], cnt[bucketNr] / TOT *100
         beg = end
         }
         printf "       > %5.2f %6d %6.2f%%\n", MAX, cnt[MXBCK], cnt[MXBCK] / TOT *100
         print "-----------------------------"
         print "        Total:", TOT
         print dashes
        }
' file

output needed

           < 0    5    31.3%
 0.01 -  2.00    6    37.5%
 2.01 -  4.00    1     6.3%
 4.01 -  6.00    0     0.0%
 6.01 -  8.00    1     6.3%
 8.01 - 10.00    0     0.0%
10.01 - 12.00    2    12.5%
12.01 - 14.00    0     0.0%
14.01 - 16.00    1     6.3%
16.01 - 18.00    0     0.0%
18.01 - 20.00    0     0.0%
      > 20       0     0.0%
----------------------------
      Total:    16

What if a value falls into the gap between e.g. 0 and 0.01? This is not covered here. Make sure none of your values fall into those gaps. Try this

$ awk -v"MIN=0" -v"MAX=20" '
BEGIN   {delta = (delta=="")?2:delta
         MNBCK = (MIN / delta)
         MXBCK = (MAX / delta) + 1
        }

        {if ($2 <= MIN) bucketNr = MNBCK
           else if ($2 >  MAX) bucketNr = MXBCK
                  else bucketNr = int(($2+delta) / delta)
         cnt[bucketNr]++
         TOT++
        }

END     {printf "      <= %5.2f %6d %6.2f%%\n", MIN, cnt[MNBCK], cnt[MNBCK] / TOT *100
         beg = MIN
         for (bucketNr=MNBCK+1; bucketNr<MXBCK; bucketNr++)     {end = beg + delta
                                                                 printf " %5.2f - %5.2f %6d %6.2f%%\n", beg+.01, end, cnt[bucketNr], cnt[bucketNr] / TOT *100
                                                                 beg = end
                                                                }
         printf "       > %5.2f %6d %6.2f%%\n", MAX, cnt[MXBCK], cnt[MXBCK] / TOT *100
         print "----------------------------"
         print "   Total:", TOT
        }
' file

It may need some polishing, and it may not be the optimal approach. YMMV.

1 Like

Hi RudiC,,,

Tks a lot for the update it works fine...