Awk Versus Cut

jaysunn · December 28, 2009, 7:49pm

Hello ALL,

I am looking for a comparison in 2 commands using awk and cut that would replicate the following command below.

This is completely for speed reasons checking apache logs for unique IPs.

Contender #1

awk '{!a[$1]++}END{for(i in a) if ( a >10 ) print a,i }' access_log

I need a string / command that is similar to the above awk command that performs the same checking using CUT. It should disregard UNIQUE apache access log IP'S that exceed lower than 10 entries on the the access logs unique IPs.

I need to modify below command to achieve this. My co worker and I have a bet.

Contender #2
FILE=/usr/local/apache/access_log

cut -d ' ' -f 1 "$FILE" | sort | uniq -c

Output of above awk command.

159070 67.72.16.xxx
14 41.223.30.22
159074 67.72.16.xxx
6586 10.4.20.xxx
6614 67.72.16.xxx

Please let me know,

Jaysunn

Scrutinizer · December 28, 2009, 8:32pm

You mean you need to filter out the entries with fewer than 10 occurrences? , like e.g.:

 | egrep -v ' {6}'

-or-

 | grep '[0-9][0-9] '

jaysunn · December 28, 2009, 9:04pm

Yes,
I am trying to see what command is the fastest at performing the task of searching trough a large apache access log.

I have had some opinions in regards to cut and awk performing the seaarch and revealing of results. These test are from bash on RHEL.

I feel that the awk command is superior. However I will need to confirm that cause the cut command that I constructed is missing the portion where it checks for duplicate IP's less than 10 on the infile. And to be completely honest I cannot construct a CUT command that will achieve this.

Hope I have explained this well enough.

Regards,

Jaysunn

Scrutinizer · December 28, 2009, 9:33pm

Hi jaysunn,

You can stick either of these filters at the end of your cut-sort-uniq sequence:

cut -d ' ' -f 1 "$FILE" | sort | uniq -c | grep '[0-9][0-9] '

And that should give you your output.

matrixmadhan · December 28, 2009, 10:41pm

scrutinizer:

Hi jaysunn,

You can stick either of these filters at the end of your cut-sort-uniq sequence:
cut -d ' ' -f 1 "$FILE" | sort | uniq -c | grep '[0-9][0-9] '
And that should give you your output.

just this,

awk '/[0-9][0-9]/ { print $1 }' $FILE | sort -u

Scrutinizer · December 29, 2009, 3:52am

Hi, the OP is looking for an alternative to awk in order to compare it.

jaysunn · December 29, 2009, 9:36am

Hello Scrutinizer,

If you were wondering. AWK destroyed the competition.

[root@radio10 testing]# ls -lah
total 255M
drwxr-xr-x   2 root root 4.0K Dec 29 09:18 .
drwxr-x---  15 root root 4.0K Dec 29 09:18 ..
-rw-r--r--   1 root root 255M Dec 29 09:15 access_log

[root@server1 testing]# FILE=/root/testing/access_log

[root@server1 testing]# time cut -d ' ' -f 1 "$FILE" | sort | uniq -c | grep '[0-9][0-9] '
 598129 10.4.20.236
 179838 67.72.16.134
    215 67.72.16.140
   7470 67.72.16.184
 414332 67.72.16.186
 884701 67.72.16.187
 880528 67.72.16.195
    379 67.86.131.180
    476 68.195.209.195
    166 68.195.209.198
     38 76.19.14.47

real	2m0.744s
user	2m11.299s
sys	0m0.758s

[root@server1 testing]# time awk '{!a[$1]++}END{for(i in a) if ( a >10 ) print a,i }' access_log 
880528 67.72.16.195
414332 67.72.16.186
884701 67.72.16.187
215 67.72.16.140
476 68.195.209.195
379 67.86.131.180
179838 67.72.16.134
166 68.195.209.198
38 76.19.14.47
7470 67.72.16.184
598129 10.4.20.236

real	0m2.756s
user	0m2.489s
sys	0m0.277s
[root@server testing]#

Thanks for making this test happen.

Regards,

Jaysunn