awk sed perl??

jamie_123 · October 28, 2011, 6:18am

Hi guys

First of all a salute to this wonderful platform which has helped me many a times. Im now faced with something that I cannot solve.

I have data like this

11:14:18 0.46975
11:14:18 0.07558
11:14:18 0.00020
11:14:18 0.00120
11:14:18 0.25879
11:14:19 0.00974
11:14:19 0.05656
11:14:19 0.00030
11:14:19 0.00639
11:14:19 0.01767
11:14:19 0.00215

I need to count the number of times the event happens every second. Here is an ex output

11:14:18 5
11:14:19 6

The first column is a time scale, so i need it to go till 60 then increment the minute count, and once minutes is 60 increment the hour count.

Someone please help me out here :wall:

CarloM · October 28, 2011, 6:20am

uniq -c -w8 infile

EDIT: Although you might need to filter the output, come to think of it.

An awk alternative:

awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' infile

jamie_123 · October 28, 2011, 6:23am

wow! thanks. perfect

getmmg · October 28, 2011, 6:28am

 
perl -lane '$a{$F[0]}++}{foreach (keys %a){print "$_  $a{$_}"}' input

jamie_123 · October 28, 2011, 6:35am

Hi getmmg
The perl script threw output in random order, any way to have it in the same order as in input?

Thanks

---------- Post updated at 11:35 AM ---------- Previous update was at 11:33 AM ----------

Hi,
could you tell me which among these two you think would be faster?? I have to run this on a huge data set.

CarloM · October 28, 2011, 6:48am

Time them! Put a subset of the data (500K lines or whatever) in a file and run them on that with time at the start of the command (e.g. time uniq -c -w8 file > /dev/null ).

fwiw, awk seems to run fastest on my system on a random file - but that probably has the same issue with output order.

jayan_jay · October 28, 2011, 7:35am

$ nawk '{print $1|"sort|uniq -c"}' infile

ahamed101 · October 28, 2011, 7:45am

Statistics - Executed on a file with ~68000 lines of size 1.1 MB
Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz

[root@bt /tmp]ls -lrt inputfile
-rw-r--r-- 1 orange orange 1.1M Oct 28 16:56 inputfile
 
[root@bt /tmp]time uniq -c -w8 inputfile >/dev/null
real    0m0.216s
user    0m0.216s
sys     0m0.000s
 
[root@bt /tmp]time awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' inputfile >/dev/null
real    0m0.030s
user    0m0.028s
sys     0m0.004s
 
[root@bt /tmp]time awk '{print $1|"sort|uniq -c"}' inputfile >/dev/null
real    0m1.622s
user    0m1.264s
sys     0m0.312s
 
[root@bt /tmp]time perl -lane '$a{$F[0]}++}{foreach (keys %a){print "$_  $a{$_}"}' inputfile >/dev/null
real    0m0.198s
user    0m0.196s
sys     0m0.004s
 
#Winner
[root@bt /tmp]time awk '{counts[$1]++} END {for (i in counts) {print i, counts}}' inputfile >/dev/null
real    0m0.030s
user    0m0.028s
sys     0m0.004s

--ahamed

vgersh99 · October 28, 2011, 9:05am

If you want to preserve the original order without using the 'sort' (which is eating your CPU cycles):

awk '!($1 in c) {key[++key[0]]=$1}{c[$1]++} END {for (i=1;i<=key[0];i++) print key, c[key]}' myFile

CarloM · October 28, 2011, 9:13am

I was surprised uniq without sort wasn't fastest, to be honest. I always thought it would be more optimised for doing this kind of thing.

jamie_123 · October 28, 2011, 11:04am

Wow!! thanks guys. Thanks to all of you.

vgersh99 · October 28, 2011, 11:10am

ahamed101,
could you take a timing snapshot of my implementation in post #9 - just curious how much more cycles it takes than the other awk solutions.
Thanks!

ahamed101 · October 28, 2011, 11:53am

[root@bt /tmp]time awk '!($1 in c) {key[++key[0]]=$1}{c[$1]++} 
END {for (i=1;i<=key[0];i++) print key, c[key]}' inputfile >/dev/null
real    0m0.046s
user    0m0.044s
sys     0m0.000s

--ahamed

system · October 28, 2011, 2:13pm

imho on such timeframes (0.2s) it's all about 'loading into memory' progress
perl is much more heavy than awk and require more time to load in memory