I have a text file that contains numbers (listed from the smallest to the largest).
For ex.
34
817
1145
1645
1759
1761
3368
3529
4311
4681
5187
5193
5199
5417
5682
.
.
.
..
1000000
I need to count the number of values are there (data points in the text file) for every 10,000 intervals (10k). ie. number of values (indicated as numbers) are there in the interval 0 - 9999; 10,000 - 19999, 20,000-29,000 .........until last interval of 10K.
Please let me know the best way to implement it using either shell scripting or awk.
LA
Something like this maybe?
awk '$1>=t*10000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*10000"\t"A}' infile
Thanks Scrutinizer,
It worked.
LA
There is a bug here. Whenever there is a gap in the data, wherein there are zero values in an interval, the accounting will be off.
Example using most of the sample data above and an interval size of 1000 (instead of 10000):
$ cat data
34
817
1145
1645
1759
1761
3368
3529
4311
4681
5187
5193
5199
5417
5682
$ awk '$1>=t*1000{t++} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"A}' data
1000 2
2000 4
3000 1
4000 1
5000 2
6000 5
The output should be:
1000 2
2000 4
3000 0
4000 2
5000 2
6000 5
Regards,
alister
---------- Post updated at 01:24 PM ---------- Previous update was at 01:15 PM ----------
A bugfix for Scrutinizer's solution:
$ awk '$1>=t*1000{while($1>=++t*1000);} {A[t]++} END{for (i=1;i<=t;i++) print i*1000"\t"(A+0)}' data
A different solution I had been working on for kicks:
$ awk 'function ge() { return $1>=1000*(i+1) } function p(){print NR-1; NR=1; i++} ge(){p(); while(ge())p()} END {NR++; p()}' data
2
4
0
2
2
5
Take care,
alister
Yep, bit of a hasty solution..., well observed...
Hello. My name is Alister and I have a sickness. Sometimes I can't help but revisit inconsequential, properly functioning commands to make them a tiny bit shorter.
awk '{while ($1>=t*w) t++; A[t]++} END {for (i=1;i<=t;i++) print i*w"\t"(A+0)}' w=10000 data
As a bonus, the interval's width can be easily modified on the command line.
Cheers,
alister
1 Like