Need Time Stamp Range On Log Files

I have created this script

#!/bin/sh
FILES=/data/log/access_*.log
for f in $FILES
do
echo "Processing $f file"
cat $f | awk '{print $1}' | sort | uniq -c | sort -n | tail
done

It produces this output

Processing /data/log/access_abc.log file
     
    114 1.1.1.1
    167 2.2.2.2
   2258 3.3.3.3
Processing /data/log/access_def.log file
    102 1.1.1.1
    102 2.2.2.2
   2040 3.3.3.3

This is the log file it comes from

1.1.1.1 - - [01/Jul/2015:10:59:29 -0400] 
         "GET /themes/warehouse/img/arrow_right_2.png HTTP/1.1" 200 149 
         "http://www.abcd.com/content/152-Tea_Tree_Oil_Uses_sp_153" 
         "Mozilla/5.0 (iPad; CPU OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F69 Safari/600.1.4"

(edited to show whole line - was a single long line)

I need to be able to specify the date and time range in the script. For example, I need to know what IP's were active 02:20 -02:30 June 30th, 2015.

Please help me to figure out the script that will permit this range of times reporting.

Thanks,

Randal

What operating system and shell are you using? (The tools available to process timestamps vary considerably from OS to OS, and in some cases from shell to shell.)

Are your log files sorted in increasing timestamp order?

Will the timestamp range you want to select ever cross day boundaries or will the start and end points always be on the same date?

Can you specify the timestamp format that the user has to enter for the start and end points (e.g., YYYYMMDDhhmmss , YYYYMMDD:hh:mm:ss , or DD/Mon/YYYY:hh:mm:ss )? If so, have you chosen a format, or can we select one that makes processing easier? If not, what format(s) do(es) your script need to be able to interpret?

Will all of the offsets from GMT (i.e., -0400 ) be the same for the timestamp ranges you're interested in? (Your script will be much more complex if log entries are coming from various timezones around the world or if your script has to make adjustments for timestamp ranges crossing shifts to and from daylight savings time.)

Is the output supposed to change at all (other than reducing the number of selected entires being processed)? If so, how?

Hi Don,

Thanks for getting back to me. I am using Centos 6 and Apache as an operating system.

My logs are sorted in increasing time stamp order.

1.1.1.1 - - [01/Jul/2015:10:59:29 -0400]
1.1.1.1 - - [01/Jul/2015:10:59:29 -0400] 
2.2.2.2 - - [01/Jul/2015:10:59:30 -0400] 

Yes, the range can cross day boundaries.

The time stamp format is
[01/Jul/2015:10:59:29 and it needs to be in this format DD/Mon/YYYY:hh:mm:ss

all of the offsets are the same which is GMT -0400

The output is the same as I posted earlier. I just need to find the most active IP's during a specific time stamp range.

Thanks,

I can't even guess how many times similar/identical requests have been posted and answered in these forums. Nevertheless, try (as a starter, NO error checking, day boundary crossing) and adapt

awk ' 
NR==1           {split (START, T, ":")
                 START=T[1]*60+T[2]
                 split (STOP, T, ":")
                 STOP=T[1]*60+T[2]
                }

FNR==1          {print "processing " FILENAME " file"}

                {sub (/\[/, _, $4)
                 split ($4, T, ":")
                 TIME=60*T[2]+T[3]
                }

DAY=T[1] &&
TIME >= START &&
TIME <= STOP    {print FILENAME, $1}
' DAY="01/Jul/2015" START="10:50" STOP="10:59" $FILES

Hi RudiC,

I appreciate the code but since the file location wildcards aren't specified, I need to know where in your script they should be specified.

I am sure you are correct about the many times this question has been asked but unless you know the keywords that were used they are hard to find. I looked up wildcards and range on the unix.com site as well as on the net and never could get an understandable answer I could use.

So I really appreciate you volunteering this code.

The files are specified in the FILES variable, as you did in post#1.

For other posts on the same topic, look at the bottom of this page.

How about this, using GNU date and gawk as they should be available on Centos 6:

#!/bin/bash

if (( $# < 3 || $# > 4 ))
then
   printf "Usage: $0 from_date from_time [to_date] to_time\n" >&2
   exit 2
fi

FDAY=$1
FTIME=$2

if (( $# == 3 ))
then
    TDAY=$FDAY
    TTIME=$3
else
    TDAY=$4
    TTIME=$3
fi

FROM=$(date -d "$FDAY $FTIME" +%s)
(($? != 0 )) && exit 3
TO=$(date -d "$TDAY $TTIME" +%s)
(($? != 0 )) && exit 4

if (( $# == 3 && TO < FROM ))
then
   #FROM time later that TO time so add a day
   (( TO+=3600*24))
fi

if (( TO < FROM ))
then
    echo "$0: FROM date must be before TO date" >&2
    exit 5
fi


echo "Examining from $(date -d @$FROM) ($FROM)"
echo "            to $(date -d @$TO) ($TO)"
echo
FILES=/data/log/access_*.log

gawk -v F=$FROM -v T=$TO -v debug=0 '
FNR==1 {
    for(ip in C) printf "%7d %s\n", C[ip], ip
    delete C
    print "Processing " FILENAME " file"
}
$NF == "-0400]" {
  split($(NF-1),v,"[[/: ]")
  mnum=int(index("JanFebMarAprMayJunJulAugSepOctNovDec", v[3])/3)
  tm=mktime(v[4] " " mnum " " v[2] " " v[5] " " v[6] " " v[7])

  if (tm >= F && tm <= T) C[$1]++
  else if(debug) print tm " not between " F " and " T
}
END {for(ip in C) printf "%7d %s\n", C[ip], ip} ' $FILES

You can use it like this:

$ ./logcount "1 June 2015" 10:30 2:00
Examining FROM Mon, Jun 01, 2015 10:30:00 AM (1433118600)
            TO Tue, Jun 02, 2015  2:00:00 AM (1433174400)

Processing /data/log/access_abc.log file
    114 1.1.1.1
    167 2.2.2.2
   2258 3.3.3.3
Processing /data/log/access_def.log file
    102 1.1.1.1
    102 2.2.2.2
   2040 3.3.3.3
1 Like

Hi Chubler_XL,

This looks really good especially being able to pass the time stamps to the program. However, it seems to only work part way. It is not givng any ip numbers or their counts.

[root@ip-3.3.3.3 log]# ./gawk.sh "30 June 2015" 02:20 02:30
Examining from Tue Jun 30 02:20:00 EDT 2015 (1435645200)
            to Tue Jun 30 02:30:00 EDT 2015 (1435645800)

Processing /data/log/access_bhp.log file
Processing /data/log/access_hpc.log file
Processing /data/log/access_tfl.log file
Processing /data/log/access_thp.log file

Those are the actual file names but looking at your code I couldn't see that would cause the problem. Am I wrong?

It's probably the format of your input file. This piece of code $NF == "-0400]" { looks for a line ending in "-0400]",

Perhaps your actual file isn't split on multi lines like the posted example?

The following may work better for 1 line per log entry:

$5 == "-0400]" {
  split($4,v,"[[/: ]")

Also try debug=1 to debug time range mismatches.

I tried your suggested code change but it still isn't producing any output.

Here is the code after the change

$5 == "-0400]" {
  split($4,v,"[[/: ]")
  mnum=int(index("JanFebMarAprMayJunJulAugSepOctNovDec", v[3])/3)
  tm=mktime(v[4] " " mnum " " v[2] " " v[5] " " v[6] " " v[7])

  if (tm >= F && tm <= T) C[$1]++
  else if(debug) print tm " not between " F " and " T
}
END {for(ip in C) printf "%7d %s\n", C[ip], ip} ' $FILES

If you need more detail on the log file let me know how to do it. I have never used gawk so I don't know where to insert the debug=1 statement. I tried it right under the shebang but no output was generated.

If you look in Chubbier_XL's suggested code, the gawk script started with:

gawk -v F=$FROM -v T=$TO -v debug=0 '

To enable debugging, change the -v debug=0 to -v debug=1 .

1 Like

Well there are 3 possible causes of the records not matching.

  1. mktime is not seeing a valid date string and it will set tm to -1
  2. The tm value is correct but logfile date/time is not falling in between the F and T range
  3. the records are not being processed because field 5 is not "-0400]"

Changing debug=0 to debug=1 will help you identify problem 1 and 2

For problem 3 try examining your logfile awk splits fields up using space as the separator so for the expected logfile format of:

1.1.1.1 - - [01/Jul/2015:10:59:29 -0400] 
         "GET /themes/warehouse/img/arrow_right_2.png HTTP/1.1" 200 149 
         "http://www.abcd.com/content/152-Tea_Tree_Oil_Uses_sp_153" 
         "Mozilla/5.0 (iPad; CPU OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F69 Safari/600.1.4"

We should get

$1=1.1.1.1
$2=-
$3=-
$4=01/Jul/2015:10:59:29
$5=-0400]

If you can't spot the problem try and change the code to:

gawk -v F=$FROM -v T=$TO -v debug=1 '
{for(i=1;i<=NF;i++) printf "$%d=%s\n", i, $i }
FNR==1 {
    for(ip in C) printf "%7d %s\n", C[ip], ip
    delete C
    print "Processing " FILENAME " file"
}
...

This will produce a lot of output so get ready to try CTRL-C or pipe it to head -50

Thanks to Don he pointed me in the right direction of debug=1

It seems that when I run the script I am getting this output

1432820036 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820037 not between 1433169000 and 1433224800
1432820060 not between 1433169000 and 1433224800
1432820061 not between 1433169000 and 1433224800
1432820061 not between 1433169000 and 1433224800
1432820061 not between 1433169000 and 1433224800
1432820061 not between 1433169000 and 1433224800

I just used your sample input of

./gawk.sh "1 June 2015" 10:30 2:00

I also ran it with out the e on June since my log uses Jun

./gawk.sh "1 Jun 2015" 10:30 2:00

but I got the same results.

I ran

./gawk.sh "02 Jul 2015" 01:50 02:30

with debug=0 I get this

 ./gawk.sh "02 Jul 2015" 01:30 3:20
Examining from Thu Jul  2 01:30:00 EDT 2015 (1435815000)
            to Thu Jul  2 03:20:00 EDT 2015 (1435821600)

Processing /data/log/access_bhp.log file
Processing /data/log/access_hpc.log file
Processing /data/log/access_tfl.log file
Processing /data/log/access_thp.log file

with debug=1 I get

1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822160 not between 1435816200 and 1435818600
1432822161 not between 1435816200 and 1435818600
1432822161 not between 1435816200 and 1435818600
1432822165 not between 1435816200 and 1435818600
1432822175 not between 1435816200 and 1435818600
1432822178 not between 1435816200 and 1435818600

By the way field 5 is -0400]

I'll reply again with the latest suggested code change output.

---------- Post updated at 05:43 PM ---------- Previous update was at 05:28 PM ----------

You're correct it does produce a lot of output. I will include the actual ip on this one entry because it is bingbot

$1=157.55.39.187
$2=-
$3=-
$4=[28/Jun/2015:16:27:29
$5=-0400]
$6="GET
$7=/content/10-customer-testimonials
$8=HTTP/1.1"
$9=200
$10=12025
$11="-"
$12="Mozilla/5.0
$13=(compatible;
$14=bingbot/2.0;
$15=+http://www.bing.com/bingbot.htm)"
1432844849 not between 1435816200 and 1435818600

If you want me to include something specific let me know.

Here is the entire script just to make sure I don't have something wrong

#!/bin/bash

if (( $# < 3 || $# > 4 ))
then
   printf "Usage: $0 from_date from_time [to_date] to_time\n" >&2
   exit 2
fi

FDAY=$1
FTIME=$2

if (( $# == 3 ))
then
    TDAY=$FDAY
    TTIME=$3
else
    TDAY=$4
    TTIME=$3
fi

FROM=$(date -d "$FDAY $FTIME" +%s)
(($? != 0 )) && exit 3
TO=$(date -d "$TDAY $TTIME" +%s)
(($? != 0 )) && exit 4

if (( $# == 3 && TO < FROM ))
then
   #FROM time later that TO time so add a day
   (( TO+=3600*24))
fi

if (( TO < FROM ))
then
    echo "$0: FROM date must be before TO date" >&2
    exit 5
fi


echo "Examining from $(date -d @$FROM) ($FROM)"
echo "            to $(date -d @$TO) ($TO)"
echo
FILES=/data/log/access_*.log

gawk -v F=$FROM -v T=$TO -v debug=1 '
{for(i=1;i<=NF;i++) printf "$%d=%s\n", i, $i }
FNR==1 {
    for(ip in C) printf "%7d %s\n", C[ip], ip
    delete C
    print "Processing " FILENAME " file"
}

$5 == "-0400]" {
  split($4,v,"[[/: ]")
  mnum=int(index("JanFebMarAprMayJunJulAugSepOctNovDec", v[3])/3)
  tm=mktime(v[4] " " mnum " " v[2] " " v[5] " " v[6] " " v[7])

  if (tm >= F && tm <= T) C[$1]++
  else if(debug) print tm " not between " F " and " T
}
END {for(ip in C) printf "%7d %s\n", C[ip], ip} ' $FILES

I noticed this output is including June 28 entries but the code I used was

./gawk.sh "02 Jul 2015" 01:50 02:30

I picked that time frame because there are over 30 entries with the same IP since that is the time I run my sitemap program. It seems it is ignoring what date I am putting in.

Not much need to add the detailed debugging as debug=1 is revealing the proglem... the times in your logfiles are not within the specified date range:

$ date -d@1432820036 
Thu, May 28, 2015  9:33:56 AM

$ date -d @1432820061
Thu, May 28, 2015  9:34:21 AM

Why not try a small test file with a few entries that are within the specified date/time range, you are probably being swamped by thousands of entries that are genuinely being skipped.

Looking at the debug ouput everything seems to be working fine, Field 4 has been rendered into the correct julian timestamp:

date -d @1432844849 
Thu, May 28, 2015  4:27:29 PM

And this in not between 1:50AM and 2:30AM on Jul-2

Just to be sure, you are filtering on 1:50AM to 2:30AM on 2 July, if you want PM use 24-hour notation e.g. 13:40 to 15:30

I changed the test to the same times you specified

./gawk.sh "28 May 2015" 09:33:56 09:34:21

and it is still putting this out with debug=1

1432814821 not between 1432820036 and 1432820061
1432814822 not between 1432820036 and 1432820061
1432814855 not between 1432820036 and 1432820061
1432814856 not between 1432820036 and 1432820061
1432814856 not between 1432820036 and 1432820061
1432814877 not between 1432820036 and 1432820061
1432814878 not between 1432820036 and 1432820061
1432814886 not between 1432820036 and 1432820061
1432814887 not between 1432820036 and 1432820061
1432814896 not between 1432820036 and 1432820061
1432814897 not between 1432820036 and 1432820061
1432814899 not between 1432820036 and 1432820061
1432814901 not between 1432820036 and 1432820061
1432814915 not between 1432820036 and 1432820061
1432814951 not between 1432820036 and 1432820061
1432814952 not between 1432820036 and 1432820061
1432814953 not between 1432820036 and 1432820061
1432814953 not between 1432820036 and 1432820061
1432814954 not between 1432820036 and 1432820061
1432814954 not between 1432820036 and 1432820061
1432814954 not between 1432820036 and 1432820061
1432814954 not between 1432820036 and 1432820061

I hope this is how you wanted me to test it.

I tested for gnu date and it is installed and so is Gawk.

Thanks for your continued help.

Everything appears to be working as intended I'd suggest turning off debug and removing the extra logging code I got you to add as the actual output is probably getting lost in all the debug.

Ensure your date/time range is correct (use 24hour times or AM/PM). If still nothing try a big date/time range eg:

./gawk.sh "1 Jan 2015" 3am "today" 10pm

When I run

 ./gawk.sh "28 May 2015" 09:33:56 09:34:21
Examining from Thu May 28 09:33:56 EDT 2015 (1432820036)
            to Thu May 28 09:34:21 EDT 2015 (1432820061)

Processing /data/log/access_bhp.log file
Processing /data/log/access_hpc.log file
Processing /data/log/access_tfl.log file
     25 2.2.2.2
     27 3.3.3.3
Processing /data/log/access_thp.log file

If I change it I get

./gawk.sh "02 Jul 2015" 01:50 02:30
Examining from Thu Jul  2 01:50:00 EDT 2015 (1435816200)
            to Thu Jul  2 02:30:00 EDT 2015 (1435818600)

Processing /data/log/access_bhp.log file
Processing /data/log/access_hpc.log file
Processing /data/log/access_tfl.log file
Processing /data/log/access_thp.log file

If I add the seconds I get the same

./gawk.sh "02 Jul 2015" 01:50:15 02:30:15Examining from Thu Jul  2 01:50:15 EDT 2015 (1435816215)
            to Thu Jul  2 02:30:15 EDT 2015 (1435818615)

Processing /data/log/access_bhp.log file
Processing /data/log/access_hpc.log file
Processing /data/log/access_tfl.log file
Processing /data/log/access_thp.log file

The big time range worked as needed. So what does that tell us?

Thanks,

It tells us that your awk script did not find any timestamps for July 2, 2015 between 1:50am and 2:30:15am in the files /data/log/access_bhp.log , /data/log/access_hpc.log , /data/log/access_tfl.log , and /data/log/access_thp.log . Can you show us any lines from any of those files with timestamps in that range that would lead you to believe that the script is working incorrectly?

Hi Don,

Yes, since I can't do a range easily I picked 01:55 AM on July 2nd. During that time for access_thp.log there are 51 entries.

I won't include them all since I need to change the ip addresses.

1.1.1.1 - - [02/Jul/2015:01:55:57 -0400] "GET /content/421-ahmtrust HTTP/1.0" 200 58071 "-" "Sphider"
207.46.13.135 - - [02/Jul/2015:01:55:57 -0400] "GET /Liquid_Herbs_page_1_c_11.html HTTP/1.1" 302 25 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
207.46.13.135 - - [02/Jul/2015:01:55:58 -0400] "GET /index.php?controller=category&id_category=21 HTTP/1.1" 301 25 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
1.1.1.1 - - [02/Jul/2015:01:55:57 -0400] "HEAD /content/422-ahmunbelief HTTP/1.1" 200 - "-" "Sphider"
2.2.2.2 - - [02/Jul/2015:01:55:59 -0400] "GET /themes/warehouse/js/script.js HTTP/1.1" 200 1313 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 5.0; SM-N900T Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36"
2.2.2.2 - - [02/Jul/2015:01:55:59 -0400] "GET /themes/warehouse/cache/50ca4d40aa6b13dfe15d7583bbe75eea.js HTTP/1.1" 200 69947 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 5.0; SM-N900T Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36"
2.2.2.2 - - [02/Jul/2015:01:55:59 -0400] "GET /themes/warehouse/cache/9f19013204b5f3ce3d256dea73bb91e5_all.css HTTP/1.1" 200 42230 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 5.0; SM-N900T Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36"
2.2.2.2 - - [02/Jul/2015:01:55:59 -0400] "GET /content/152-Tea_Tree_Oil_Uses_sp_153 HTTP/1.1" 200 17579 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 5.0; SM-N900T Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36"
207.46.13.135 - - [02/Jul/2015:01:55:59 -0400] "GET /21-Liquid_Herbs_page_1_c_11 HTTP/1.1" 200 16273 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

I left bingbot's ip but the 1.1.1.1 represents our ip running the sitemap cron and the 2.2.2.2 is a visitor of some kind. I picked that specific time because using cloudwatch I could see my cpu was getting hammered by bots.

Thanks for staying with me on this.

The awk on my system doesn't have the gawk date and time processing extensions you're using, so I can't take your code and your sample data and try it on my own.

Please save the data you showed us in post #19 in this thread in a file named /tmp/test.log and save the following slightly modified version of your script (the changes are marked in red):

#!/bin/bash

if (( $# < 3 || $# > 4 ))
then
   printf "Usage: $0 from_date from_time [to_date] to_time\n" >&2
   exit 2
fi

FDAY=$1
FTIME=$2

if (( $# == 3 ))
then
    TDAY=$FDAY
    TTIME=$3
else
    TDAY=$4
    TTIME=$3
fi

FROM=$(date -d "$FDAY $FTIME" +%s)
(($? != 0 )) && exit 3
TO=$(date -d "$TDAY $TTIME" +%s)
(($? != 0 )) && exit 4

if (( $# == 3 && TO < FROM ))
then
   #FROM time later that TO time so add a day
   (( TO+=3600*24))
fi

if (( TO < FROM ))
then
    echo "$0: FROM date must be before TO date" >&2
    exit 5
fi


echo "Examining from $(date -d @$FROM) ($FROM)"
echo "            to $(date -d @$TO) ($TO)"
echo
FILES=/data/log/access_*.log

gawk -v F=$FROM -v T=$TO -v debug=1 '
{for(i=1;i<=NF;i++) printf "$%d=%s\n", i, $i }
FNR==1 {
    for(ip in C) printf "%7d %s\n", C[ip], ip
    delete C
    print "Processing " FILENAME " file"
}

$5 == "-0400]" {
  split($4,v,"[[/: ]")
  mnum=int(index("JanFebMarAprMayJunJulAugSepOctNovDec", v[3])/3)
  tm=mktime(v[4] " " mnum " " v[2] " " v[5] " " v[6] " " v[7]) + 0
  if(debug) print "mtkime(" v[4] " " mnum " " v[2] " " v[5] " " v[6] " " v[7] "): " tm
  if (tm >= F && tm <= T) C[$1]++
  else if(debug) print tm " not between " F " and " T
}
END {for(ip in C) printf "%7d %s\n", C[ip], ip} ' /tmp/test.log

in a file named modified_gawk.sh , make it executable, and run it using the command line:

./modified_gawk.sh "02 Jul 2015" 01:55:58 01:56

and show us (in CODE tags) the output it produces.

This will allow us to see the actual seconds since the Epoch that gawk is calculating for known timestamps and hopefully will allow us to see what is going wrong without being overwhelmed with debugging data from large files containing lots of entries we don't care about.