Pythonic Parsing

Experts and All,

Hello !

I am trying to fabricate a simple shell script in python that has taken me almost 5 hours to complete. I am using python 3.6.

So, I am trying to read a file, parse the log file and trying to answer this basic question of how many GET's and how many POST's are there and sort them in the ascending order.

I pieced everything together here and it works fine but I know for sure that I have unnecessarily made it complicated than it is supposed to be.

  1. Why should I push the data into list (wordstring) ?
  2. Why is that I am not able to parse out if it is a get or post method from httpd log file ?

Please, show me the way and if you can, explain it to me in detail or just point me to the correct documentation site atleast.

manoharmahostav@ma-host:~/files$ python  log_file_analyse.py 
Stuff

GET: 1595922
PUT:      30
POST:      26

manoharmahostav@ma-host:
manoharmahostav@ma-host:~/files$ cat log_file_analyse.py 
#!/usr/bin/env python

import collections
from collections import Counter
from collections import defaultdict

#fname = 'testfile.txt'
fname = 'apache.log'

wordstring = []
c = collections.Counter()

with open(fname, 'r') as fh:
    for line in fh:
       if len(line.strip()):
           splitlines = line.split('"')[1]
           another = splitlines.split()[0]
           wordstring.append(another) 
           

c = Counter(wordstring)
print("Stuff")

for letter, count in c.most_common(30):
    print( '%s: %7d' % (letter, count))

manoharmahostav@ma-host:~/files$ 
manoharmahostav@ma-host:~/files$ head testfile.txt
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253

---------- Post updated at 02:41 AM ---------- Previous update was at 12:41 AM ----------

After due efforts, here is what I have and this looks a bit cleaner but this is not any faster than the previous version that I posted in here.

Any help in getting a performance improvement would be much appreciated.
Sincerely,
Manohar.



manoharmahostav@ma-host:~/files$ cat abc.py 
#!/usr/bin/env python

import collections
from collections import Counter


somelist = []

with open('apache.log', 'r') as f:
     for line in f:
         splitlines = line.split('"')
         pat = splitlines[1]
         pat2 = pat.split(' ')[0]
         somelist.append(pat2)         


a = Counter(somelist)

print('Most Common:')
for d, b in a.most_common(10):
    print('%s: %10d' %(d, b))
manoharmahostav@ma-host:~/files$ 
manoharmahostav@ma-host:~/files$ python abc.py 
Most Common:
GET:    1595922
PUT:         30
POST:         26

Try using a dictionary instead of list. Since dictionary uses unique key and value, it is an efficient replacement in this case.

from collections import Counter


somelist = {}

with open('apache.log', 'r') as f:
     for line in f:
         pattern = line.split('"')[1].split(' ')[0]
         somelist[pattern] = somelist.get(pattern,0) + 1

a = Counter(somelist)

print('Most Common:')
for d, b in a.most_common(10):
    print('%s: %10d' %(d, b))