How to search backwards in a log file by timestamp of entries?

Hello. I'm not nearly good enough with awk/perl to create the logfile scraping script that my boss is insisting we need immediately. Here is a brief 3-line excerpt from the access.log file in question (actual URL domain changed to 'aaa.com'):

209.253.130.36 - - [23/Sep/2009:12:55:44 -0700] "GET /images/products/en_us/pc/detail/273595_dt.jpg HTTP/1.1" 200 28520 "http://www.aaa.com/product/holiday+parties/halloween+party+supplies.do?" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; FunWebProducts; .NET CLR 1.1.4322)" 22134 "__utma=8470452.136497171.1253643073.1253655989.1253731688.3; __utmb=8470452.4.10.1253731688; __utmz=8470452.1253643073.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_cc=true"
99.60.55.157 - - [23/Sep/2009:12:55:45 -0700] "GET /mod/productquickview/includes/themes/default.css HTTP/1.1" 200 767 "http://www.aaa.com/home.do?" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)" 14097 "customer=none; basket=none; __utma=8470452.1058319807.1252542208.1252547047.1252713609.3; __utmz=8470452.1252542208.1.1.utmcsr=yahoo|utmccn=(organic)|utmcmd=organic|utmctr=aaa; JSESSIONID=j0d7VJsXNBv6ztnpOp"
198.7.255.226 - - [23/Sep/2009:12:55:46 -0700] "GET /images/products/en_us/gateways/costumes_R_01_C_01.jpg HTTP/1.1" 200 30097 "http://www.aaa.com/category/costumes+%26+accessories.do" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)" 12334 "s_cc=true"

So the lines start with an IP-address, followed by date, and then time. We want to only search the last 10 minutes in the file (say if current time is 11:40, we want to only look at lines that go back to 11:30). I've got the code to convert the current time into scalar, subtract 600 secs, and store that time as single character variables (ie: $a = 1, $b = 1, $c = 3, $d = 0).

But I need help with an awk (or other?) code line that will parse each entry in the log file to skip over the IP and the date, and match against the TIMEstamp only. And what's more, we'd like it to do so starting from the bottom of the file (ie: with the most recent entry) and go backwards......and then hopefully stop the search when it hits the first entry that does NOT fall within the past 10-min (because log file is very, very large!).

Any and all help or suggestions would be monumentally appreciated.

use File::ReadBackwards

see how-to: reading a file backwards | Perl HowTo

#!/usr/bin/perl
use File::ReadBackwards;
 
$fh = File::ReadBackwards->new('access.log') or die "can't read file: $!\n";
 
while ( defined($line = $fh->readline) )
{
  if ($line =~ /regex to capture time/)
  {
      #get max time on first iteration
      #check time against max time
      #if within range, add to array
      #otherwise, exit the loop
  }
}
foreach my $line (reverse @lines)
{
   #process each line as needed
}

For the backwards part, you can use tac. I had some doubts about its efficiency for large files but I just did some tests and, to my great surprise, it is almost as efficient as cat.

Now the parse and time test part. Prerequisite:

  • the sample file is exactly as the one you provided. Otherwise you can adjust the field offset by playing around withe the $i's
  • you have GNU awk at hand. That's for the systime() and mktime() functions. If not, see remark below.

parselog.awk

BEGIN{
    FS="[ /:[]"
    now=systime()
    str="Jan_Feb_Mar_Apr_Mai_Jun_Jul_Aug_Sep_Oct_Nov_Dec"
    split(str, m, "_")
    for (i in m) mm[m]=i
}
{
    timestamp=mktime(sprintf("%s %s %s %s %s %s", $7,mm[$6],$5,$8,$9,$10))
    if (timestamp < (now-600)){
        exit
    }
    print
}

To run that snippet:

$ tac your.log | awk -f parselog.awk

The awk program will stop and exit as soon as it hits a line with a timestamp that is more than 10 min. old. That exit swtich is there to prevent awk to continue scanning the remaining lines which we know will never comply with the timestamp condition.

If you don't have GNU awk, let us know. There is a workaround using awk's system() I/O function and the shell date command.