How to select only the most frequent instances of a variable string in a file?

kevinmccallum · September 23, 2009, 5:59pm

I've got a web access file that I want to grep (or awk or perl or whatever will work!) out the most frequent instances of unique IP entries. Meaning the file looks something like this:

I'd like to run a sort or grep (or whatever) that will only select out the lines from IP's that had the most hits......which in this example case would've been the 1.1.1.1 and 4.4.4.4 entries.

So something that sorts the entire file numerically, counts the instances of lines that start with the exact same (IP) number, and then outputs the results of only the MOST frequent occurances. So obviously the matching IP string is going to change each time it's run based on who is hitting the web server. Is this possible??

peterro · September 23, 2009, 6:11pm

#!/bin/sh

for ip in `awk '{print $1}' access_log | sort -u`
do
  ip_count=`grep -c $ip access_log`
  echo $ip $ip_count
done | sort -rn +1 | head -1

kevinmccallum · September 23, 2009, 6:40pm

Thanks for the quick suggestion.......but I can't seem to get that to work. I replaced the access.log file name with my file name. But when I run it, it just hangs with no output. I tried moving the "echo $ip" up higher in the script to be right after the awk (and before the 'do'), but it still wouldn't print out that variable either.

And I can already get the file to sort by IP, since the ip-address is the leading entry in every newline ('sort -n' works).

So now I just need it to scan the entire log, count the number of entries that start with the same IP number, and print out the lines for let's say the Top-5 IP's that appear the most times in the file (5 highest "hitters" of the webserver). Can you provide any further help or advice?? Please.....??

Scrutinizer · September 23, 2009, 7:56pm

How big is your access file? If it is very big then the times the input file gets read is proportional to the amount of ip addresses. Which may take time for very large files.

This script will ony read the file once:

#!/bin/ksh
typeset -A ACCESS
while read ipaddr dummy; do
  (( ACCESS[$ipaddr]++ ))
done<access_log
for ip in ${!ACCESS[@]}; do
  echo $ip ${ACCESS[$ip]}
done|sort -rn -k2|head -10

or the equivalent awk:

awk '{access[$1]++} END { for ( i in access ) print i " " access }' access_log |sort -rn -k2|head -10

Note that sort options "-rn -k2" stands for "reverse numerical sort on the second field". The syntax may vary per Unix platform; use "man sort" to find out the appropriate options. Head determines the number of IP addresses to list.

rdcwayx · September 23, 2009, 9:12pm

awk '{print $1}' urfile |sort |uniq -c |sort -n

kevinmccallum · September 24, 2009, 1:12pm

RDCWAYX....?? Shouldn't there be a closing single-quote somewhere in that awk line??

---------- Post updated at 10:12 AM ---------- Previous update was at 09:42 AM ----------

SCRUTINIZER...?? Your 'awk' line worked well for obtaining the Top-10 "heavy hitter" IP's and listing them out (with counts). Thanks for that.

But instead of just the IP and it's number of instances in the log file.........I need to return/save the entire log file entry line for each and every hit. So if 1.1.1.1 has 10 entries in the file, and 2.2.2.2 has 8 entries, instead of output that looks like this:

1.1.1.1 10
2.2.2.2 8

I instead need output that looks like this:

1.1.1.1 - [23/Sep/2009:14:18:41 -0700] "GET /home.do"
1.1.1.1 - [23/Sep/2009:14:18:51 -0700] "GET /home.do/category1"
1.1.1.1 - [23/Sep/2009:14:18:55 -0700] "GET /home.do/category2"
2.2.2.2 - [23/Sep/2009:14:19:31 -0700] "GET /home.do"
2.2.2.2 - [23/Sep/2009:14:19:33 -0700] "GET /home.do/file1"

Etc, etc. IE: the entire line from the original file, including date/time stamp, URL, etc. And not just the IP and a summary count.

Can your awk line be easily modified to save all that info....??

rdcwayx · September 27, 2009, 11:02pm

updated.

danmero · September 27, 2009, 11:40pm

Your requirement change over time, you should reformulate your request .. post a new sample data and new required output.

And don't forget to use [code] tags when you post sample data or required output.