Hi guys
First of all a salute to this wonderful platform which has helped me many a times. Im now faced with something that I cannot solve.
I have data like this
11:14:18 0.46975
11:14:18 0.07558
11:14:18 0.00020
11:14:18 0.00120
11:14:18 0.25879
11:14:19 0.00974
11:14:19 0.05656
11:14:19 0.00030
11:14:19 0.00639
11:14:19 0.01767
11:14:19 0.00215
I need to count the number of times the event happens every second. Here is an ex output
11:14:18 5
11:14:19 6
The first column is a time scale, so i need it to go till 60 then increment the minute count, and once minutes is 60 increment the hour count.
Someone please help me out here :wall:
CarloM
October 28, 2011, 6:20am
2
uniq -c -w8 infile
EDIT: Although you might need to filter the output, come to think of it.
An awk alternative:
awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' infile
getmmg
October 28, 2011, 6:28am
5
perl -lane '$a{$F[0]}++}{foreach (keys %a){print "$_ $a{$_}"}' input
Hi getmmg
The perl script threw output in random order, any way to have it in the same order as in input?
Thanks
---------- Post updated at 11:35 AM ---------- Previous update was at 11:33 AM ----------
Hi,
could you tell me which among these two you think would be faster?? I have to run this on a huge data set.
CarloM
October 28, 2011, 6:48am
7
Time them! Put a subset of the data (500K lines or whatever) in a file and run them on that with time
at the start of the command (e.g. time uniq -c -w8 file > /dev/null
).
fwiw, awk seems to run fastest on my system on a random file - but that probably has the same issue with output order.
$ nawk '{print $1|"sort|uniq -c"}' infile
Statistics - Executed on a file with ~68000 lines of size 1.1 MB
Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
[root@bt /tmp]ls -lrt inputfile
-rw-r--r-- 1 orange orange 1.1M Oct 28 16:56 inputfile
[root@bt /tmp]time uniq -c -w8 inputfile >/dev/null
real 0m0.216s
user 0m0.216s
sys 0m0.000s
[root@bt /tmp]time awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' inputfile >/dev/null
real 0m0.030s
user 0m0.028s
sys 0m0.004s
[root@bt /tmp]time awk '{print $1|"sort|uniq -c"}' inputfile >/dev/null
real 0m1.622s
user 0m1.264s
sys 0m0.312s
[root@bt /tmp]time perl -lane '$a{$F[0]}++}{foreach (keys %a){print "$_ $a{$_}"}' inputfile >/dev/null
real 0m0.198s
user 0m0.196s
sys 0m0.004s
#Winner
[root@bt /tmp]time awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' inputfile >/dev/null
real 0m0.030s
user 0m0.028s
sys 0m0.004s
--ahamed
If you want to preserve the original order without using the 'sort' (which is eating your CPU cycles):
awk '!($1 in c) {key[++key[0]]=$1}{c[$1]++} END {for (i=1;i<=key[0];i++) print key, c[key]}' myFile
CarloM
October 28, 2011, 9:13am
11
I was surprised uniq without sort wasn't fastest, to be honest. I always thought it would be more optimised for doing this kind of thing.
Wow!! thanks guys. Thanks to all of you.
ahamed101,
could you take a timing snapshot of my implementation in post #9 - just curious how much more cycles it takes than the other awk solutions.
Thanks!
[root@bt /tmp]time awk '!($1 in c) {key[++key[0]]=$1}{c[$1]++}
END {for (i=1;i<=key[0];i++) print key, c[key]}' inputfile >/dev/null
real 0m0.046s
user 0m0.044s
sys 0m0.000s
--ahamed
1 Like
system
October 28, 2011, 2:13pm
15
imho on such timeframes (0.2s) it's all about 'loading into memory' progress
perl is much more heavy than awk and require more time to load in memory