Checking a pattern in file and the count of characters

I am having a zipped file which has the following URL contents -

98.70.217.222 - - [08/Jul/2012:09:14:29 +0000] "GET /liveupdate-aka.symantec.com/1340071490jtun_nav2k8enn09m25.m25?h=abcdefgh HTTP/1.1" 200 159229484 "-" "hBU1OhDsPXknMepDBJNScBj4BQcmUz5TwAAAAA" "-"

In this line here is we only need to consider the components marked in BOLD above so basically:
/liveupdate-aka.symantec.com/1340071490jtun_nav2k8enn09m25.m25?h=abcdefgh : is called the URL

200: is called the response code.
h=abcdefgh : is called the query string.

I am trying to write a script which does the following:
1.) Count of each URL which have a count of 10000 or greater than 10000 that have resulted in a non successful response code( basically a non � 200, 206 or 304 response code) and do not contain the following patterns in the URL : '/F200%5E*', '/F0%5E*' and '/F100%5E*'

2.) Count of each URL excluding the query string with 800 characters in length and do not contain the following patterns in the URL : '/F200%5E*', '/F0%5E*' and '/F100%5E*'

I tried to do this with the following command:

gunzip -c * |cut -d ' ' -f7|sort -n|uniq -c|grep '^.*\/[^?]*'|grep '.\{800,\}'

it needs some changes to get the desired output.

Your help is appreciated.
Thx

This is for the first one:

awk '$9~/(200|206|304)/&&$7!~/(F200%5E|F0%5E|F100%5E)/{a[$7]++}END{for(i in a)if(a>=10000){print a,i}}' file

You can do something similar for the second one

Last I read http, it was the URI, the part of the URL within the host.

Is this one task, not two?

sed can do the work of both grep and cut so that only the desired URLs are clean on the sed output to sort for a most popular over 9999 list:

sed '
    s/.*+0000\] "GET \(\/[^ ]*\) HTTP\/[0-9.]*" \([1-9][0-9]*\) .*/\1\2/
    t n
    d
    :n
    / 20[06]$/d
    / 304$/d
    /^\/F[21]00%5E/d
    /^\/F0%5E/d
    /[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{99\}[^ ]\{8\}/d
    s/ .*//
  ' | sort | uniq -c | grep '^ *[1-9][0-9][0-9][0-9][0-9]' | sort -nr

Sometimes, for speed, I break up the sed and use a long pipe of mixed sed and grep to speed tings up and multiprocess, as the last 5 lies of this sed are essentially "grep -v". Putting the best eliminator first speeds things up. For many gzipped files, in bash and /dev/fd/# UNIX's, you can go parallel dividing the files into (#cores x 2) lists (assuming 50% i/o bound processing) and replacing the first 'sort' with:

sort -m <(
    gzcat $list1 | ... |sort
 ) <(
    gzcat $list2 | ... |sort
 ) <(
    gzcat $list3 | ... |sort
 ) <(
    gzcat $list4 | ... |sort
 ) 

Thx for your reply Subbeh but when I tried this command for a[i]>=10 its not giving me any result.

awk '$9~/(200|206|304)/&&$7!~/(F200%5E|F0%5E|F100%5E)/{a[$7]++}END{for(i in a)if(a>=10){print a,i}}' url.log
$ cat url.log
98.70.217.222 - - [08/Jul/2012:09:14:29 +0000] "GET /liveupdate-aka.symantec.com/1340071490jtun_nav2k8enn09m25.m25?h=abcdefgh HTTP/1.1" 200 159229484 "-" "hBU1OhDsPXknMepDBJNScBj4BQcmUz5TwAAAAA" "-"

Try changing the number (10 in your case) to 1 and see what happens. The first column of the output should show the total per url

If you only need the first part of the url without "?h=abcdefgh" use this:

awk '$9~/(200|206|304)/&&$7!~/(F200%5E|F0%5E|F100%5E)/{gsub(/\?.*/,"",$7)a[$7]++}END{for(i in a)if(a>=1){print a,i}}' file
1 Like