Parse apache log file with three different time formats

Hi,

I want to parse below file and Write a function to extract the logs between two given timestamp.

Apache (Unix) Log Samples - MonitorWare

The challenge here is there are three date and time format.

First :- 07/Mar/2004:16:05:49
Second :- Sun Mar 7 16:02:00 2004
Third :- 29-Mar 15:18:20.54

I have sed command which can help to get this but we should force user to mention format . I want this to be general . How can i achieve this. I will like to parse log file and create a new file to keep time format same and then using sed or grep it's pretty simple.

sed -n '/07\/Mar\/2004:16:05:49/,/07\/Mar\/2004:16:31:48/p' log

sed -n '/Sun Mar 7 16:02:00 2004/,/Mon Mar 8 00:11:22 2004/p' log
sed -n '/29-Mar 15:18:20.50/,/29-Mar 15:18:20.54/p' log

Please let me know a good way to achieve this. Any pointers will also help

There are two basic approaches - one for linux, another for non-linux. So which one do you have? Shell would be helpful, too.

Also, utilities like GoAccess have nice analytics and reporting tools that handle several times of time stamp formats.

Try this to prefix the date/time to every log line:

awk -vDM="$(LC_ALL=C locale abday abmon)" '
BEGIN           {gsub (/;/, "|", DM)
                 split (DM, T)
                 MStr1 = "(" T[1] ") (" T[2] ") *[0-9]* [0-9:]* [0-9]*"
                 MStr2 = "[0-9]*/(" T[2] ")/[0-9:]* -[0-9]*"
                 MStr3 = "[0-9]*-(" T[2] ") [0-9:.]*"
                 MStr  = "(" MStr1 ")|(" MStr2 ")|(" MStr3 ")"
                }
match ($0, MStr)        {print substr ($0, RSTART, RLENGTH), $0
                        }
 ' /tmp/*log 

EDIT: or, somewhat simplified,

awk -vDM="$(LC_ALL=C locale abday abmon)" '
BEGIN           {gsub (/;/, "|", DM)
                 split (DM, T)
                 MStr1 = "(" T[1] ") (" T[2] ") *[0-9]* [0-9:]* [0-9]*"
                 MStr2 = "[0-9]*[-/](" T[2] ")(/[0-9:]* -| )*[0-9:.]*"
                 MStr  = "(" MStr1 ")|(" MStr2 ")"
                }
match ($0, MStr)        {print substr ($0, RSTART, RLENGTH), $0
                        }
' /tmp/*log 
2 Likes

Classic Approach: Convert dates to epoch and simply compare(classic: unexcited, not extraordinarily short, simple logic)

#!/bin/sh

awk -vstart="$1" -vend="$2" ' 

BEGIN {
        start_epoch = mktime(start)
        end_epoch   = mktime(end)
}

function monthnumber(monthname) {
        return sprintf("%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",monthname)+2))/3
}

match($0,/^([0-9]+)\/([a-zA-Z]+)\/([0-9]{4}):([0-9]{2}):([0-9]{2}):([0-9]{2})/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", r[3],monthnumber(r[2]),r[1],r[4],r[5],r[6])); }

match($0,/^[a-zA-Z]+ ([a-zA-Z]+) ([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([0-9]{4})/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", r[6],monthnumber(r[1]),r[2],r[3],r[4],r[5])); }

match($0,/^([0-9]+)-([a-zA-Z]+) ([0-9]+):([0-9]+):([0-9]+)/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", strftime("%Y"),monthnumber(r[2]),r[1],r[3],r[4],r[5])); } 

(current < start_epoch) { next }
(current > end_epoch  ) { exit }

1
' | "$3" 

run like this:

# call is: ./logsearch "YYYY mm dd HH MM SS" "YYYY mm dd HH MM SS" logfile

./logsearch "2010 10 24 16 34 00" "2020 10 25 23 59 00" my.log

 

Notes

  • This needs GNU awk
  • I assume the missing year in format #3 is the current year. Maybe this is not the case. If the search is within a year. This does not matter.
  • I do not take care of fractions of a second in format #3, so you get a bit more out of the log than you specify
  • Not locale aware(look at Rudis post for a possible method)
2 Likes

It's always best in my view to convert date and time strings to unixtime and do all calculations in unixtime and then convert the results back to a time string based on locale (local time information, timezone information, etc.).

It's kinda "nutty" in my view to try to manipulate / process time using formatted strings which are only a string representation of a "time" in the local time format.

That is why we store "time" in databases as unix timestamps. We do not, generally speaking, store "time" as a formatted time string.

If the logs are very big it may be a good trick to read them backwards, because maybe the interesting part is more likely at the end of the file, so we maybe save to read tons of old lines that way:

#!/bin/sh
logfile="$3"

# reverse at the beginning to read from end to start
tac "$logfile" | awk -vstart="$1" -vend="$2" ' 

BEGIN {
        start_epoch = mktime(start)
        end_epoch   = mktime(end)
}

function monthnumber(monthname) {
        return sprintf("%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",monthname)+2))/3
}

match($0,/^([0-9]+)\/([a-zA-Z]+)\/([0-9]{4}):([0-9]{2}):([0-9]{2}):([0-9]{2})/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", r[3],monthnumber(r[2]),r[1],r[4],r[5],r[6])); }

match($0,/^[a-zA-Z]+ ([a-zA-Z]+) ([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([0-9]{4})/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", r[6],monthnumber(r[1]),r[2],r[3],r[4],r[5])); }

match($0,/^([0-9]+)-([a-zA-Z]+) ([0-9]+):([0-9]+):([0-9]+)/,r) { 
        current=mktime( sprintf("%s %s %s %s %s %s", strftime("%Y"),monthnumber(r[2]),r[1],r[3],r[4],r[5])); } 

# we have to swap the actions here!
(current < start_epoch) { exit }
(current > end_epoch  ) { next }

1
' | tac 
# and reverse again at the end to return to chronological order

Script call stays the same.