Grouping data numbers in a text file into prescribed intervals and count

Lucky_Ali · January 14, 2010, 12:07pm

I have a text file that contains numbers (listed from the smallest to the largest).
For ex.

.
.
.
..
1000000

I need to count the number of values are there (data points in the text file) for every 10,000 intervals (10k). ie. number of values (indicated as numbers) are there in the interval 0 - 9999; 10,000 - 19999, 20,000-29,000 .........until last interval of 10K.

Please let me know the best way to implement it using either shell scripting or awk.

LA

Scrutinizer · January 14, 2010, 12:38pm

Something like this maybe?

awk '$1>=t*10000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*10000"\t"A}' infile

Lucky_Ali · January 14, 2010, 12:49pm

Thanks Scrutinizer,
It worked.

LA

alister · January 14, 2010, 1:24pm

There is a bug here. Whenever there is a gap in the data, wherein there are zero values in an interval, the accounting will be off.

Example using most of the sample data above and an interval size of 1000 (instead of 10000):

$ cat data
34
817
1145
1645
1759
1761
3368
3529
4311
4681
5187
5193
5199
5417
5682

$ awk '$1>=t*1000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"A}' data
1000    2
2000    4
3000    1
4000    1
5000    2
6000    5

The output should be:

Regards,
alister

---------- Post updated at 01:24 PM ---------- Previous update was at 01:15 PM ----------

A bugfix for Scrutinizer's solution:

$ awk '$1>=t*1000{while($1>=++t*1000);} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"(A+0)}' data

A different solution I had been working on for kicks:

$ awk 'function ge() { return $1>=1000*(i+1) } function p(){print NR-1; NR=1; i++} ge(){p(); while(ge())p()} END {NR++; p()}' data
2
4
0
2
2
5

Take care,
alister

Scrutinizer · January 14, 2010, 2:22pm

Yep, bit of a hasty solution..., well observed...

alister · January 14, 2010, 2:41pm

Hello. My name is Alister and I have a sickness. Sometimes I can't help but revisit inconsequential, properly functioning commands to make them a tiny bit shorter.

awk '{while ($1>=t*w) t++; A[t]++} END {for (i=1;i<=t;i++) print i*w"\t"(A+0)}' w=10000 data

As a bonus, the interval's width can be easily modified on the command line.

Cheers,
alister