faster way to loop?

tads98 · May 9, 2006, 6:46am

Sample Log file

IP.address Date&TimeStamp GET/POST URL ETC
123.45.67.89 MMDDYYYYHHMM GET myURL http://ABC.com
123.45.67.90 MMDDYYYYHHMM GET myURL http://XYZ.com

I have a very huge web server log file (about 1.3GB) that contains entries like the one above. I need to get the last entries of all the different IPs that has myURL in it? Is there a quick way of looping? My idea was

# Get all the Unique IP addresses and then proceed to check each
cat weblog | awk '{print $1} > ip.list

for i in `cat ip.list`
do
cat weblog | grep $i | grep myURL > lastpages.lis
done

each day has around 3000+ unique IP entries and a day's log is about 48MB. with this process, it takes around 30 mins to process a days worth of data. is there a faster way to do this?

Perderabo · May 9, 2006, 11:24am

This requires ksh and a lot of memory, but if it runs, it will be rather fast.

#! /usr/bin/ksh

exec < weblog
IFS=""
while read line ; do
        ip=${line%% *}
        octet4=${ip##*.}
        ip=${ip%.$octet4}
        octet3=${ip##*.}
        ip=${ip%.$octet3}
        octet2=${ip##*.}
        octet1=${ip%.$octet2}
        ip=${octet1}_${octet2}_${octet3}_${octet4}
        var=array_$ip
        eval $var=\$line
done
IFS="\="
set | while read  variable value ; do
        if [[ $variable = array_+([0-9])_+([0-9])_+([0-9])_+([0-9]) ]] ; then
                echo "$value"
        fi
done
exit 0

jim_mcnamara · May 9, 2006, 9:03pm

Unless I misunderstand, you want the last entry for each distinct ip, and since it is a log file it is already in date order with the last entry for an ip=last time it appears. Correct? try:

awk '{arr[$1]=$0 }
        END{for (i in arr )
                  print arr } '  myweblog > somefile

tads98 · May 10, 2006, 3:03am

thanks! it worked!

Here's a followup question...
I have a file that contains around 1500+ IPs, I want to get the last 5 entries of these IPs from the huge web log. how can I modify it to get only the last 5 entries of a specific IP address.

thanks for your help!

sumitpandya · May 10, 2006, 3:31am

If your originally solution is working then I'm proposing optimization which should reduce your time by 1/3rd

# Get all the Unique IP addresses and then proceed to check each
awk '{print $1} weblog > ip.list

while read i
do
grep -w "$i\|$myURL$" weblog
done < ip.list > lastpages.lis

tads98 · May 10, 2006, 6:09am

jim mcnamara:

Unless I misunderstand, you want the last entry for each distinct ip, and since it is a log file it is already in date order with the last entry for an ip=last time it appears. Correct? try:
awk '{arr[$1]=$0 }
   END{for (i in arr )
   print arr } '  myweblog > somefile

thanks! kindly interpret how this works. this gets the last entry for each IP. is there a way on how I can include a grep using this? I want to get the last entries with the myURL for each IP. thanks!

sumitpandya · May 10, 2006, 6:53am

Dear tads98,
People are here to help you out, they are not here to work for you. You got some good hints on how to achieve and follow best of shell scripting.
Please respond after doing some extra work from your side.
Good Luck & Happy messaging!!!

tads98 · May 10, 2006, 8:30am

I do apologize if it appeared that way. honestly i tried the whole day reading about arrays and doin some scripting stuff. unfortunately until now, I am still trying to understand all of it. i just what to understand how the the suggested script worked. i can actually pick it up from there.
thank you all for your time and ideas!

jim_mcnamara · May 10, 2006, 11:36am

awk '{ if(match($0,/myURL/)>0) {arr[$1]=$0 } }
        END{for (i in arr )
                  print arr } '  myweblog > somefile

match() does grep-like regular expressions in awk. There is also a ~ (match operator)
which makes the code harder to read if you are not used to it.

tads98 · May 10, 2006, 4:07pm

jim mcnamara:

awk '{ if(match($0,/myURL/)>0) {arr[$1]=$0 } }
   END{for (i in arr )
   print arr } '  myweblog > somefile
match() does grep-like regular expressions in awk. There is also a ~ (match operator)
which makes the code harder to read if you are not used to it.

thanks jim!