Grouping data numbers in a text file into prescribed intervals and count

I have a text file that contains numbers (listed from the smallest to the largest).
For ex.

               34
  817
  1145
  1645
  1759
  1761
  3368
  3529
  4311
  4681
  5187
  5193
  5199
  5417
  5682

.
.
.
..
1000000

I need to count the number of values are there (data points in the text file) for every 10,000 intervals (10k). ie. number of values (indicated as numbers) are there in the interval 0 - 9999; 10,000 - 19999, 20,000-29,000 .........until last interval of 10K.

Please let me know the best way to implement it using either shell scripting or awk.

LA

Something like this maybe?

awk '$1>=t*10000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*10000"\t"A}' infile

Thanks Scrutinizer,
It worked.

LA

There is a bug here. Whenever there is a gap in the data, wherein there are zero values in an interval, the accounting will be off.

Example using most of the sample data above and an interval size of 1000 (instead of 10000):

$ cat data
34
817
1145
1645
1759
1761
3368
3529
4311
4681
5187
5193
5199
5417
5682

$ awk '$1>=t*1000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"A}' data
1000    2
2000    4
3000    1
4000    1
5000    2
6000    5

The output should be:

1000    2
2000    4
3000    0
4000    2
5000    2
6000    5

Regards,
alister

---------- Post updated at 01:24 PM ---------- Previous update was at 01:15 PM ----------

A bugfix for Scrutinizer's solution:

$ awk '$1>=t*1000{while($1>=++t*1000);} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"(A+0)}' data

A different solution I had been working on for kicks:

$ awk 'function ge() { return $1>=1000*(i+1) } function p(){print NR-1; NR=1; i++} ge(){p(); while(ge())p()} END {NR++; p()}' data
2
4
0
2
2
5

Take care,
alister

Yep, bit of a hasty solution..., well observed... :b:

Hello. My name is Alister and I have a sickness. Sometimes I can't help but revisit inconsequential, properly functioning commands to make them a tiny bit shorter. :slight_smile:

awk '{while ($1>=t*w) t++; A[t]++} END {for (i=1;i<=t;i++) print i*w"\t"(A+0)}' w=10000 data

As a bonus, the interval's width can be easily modified on the command line.

Cheers,
alister

1 Like