Parse a Web Server Access Log

  1. The problem statement, all variables and given/known data:

Write a parser for a web server access log that will provide the statistics outlined below. Remember to format your output in a neat form. You may complete this assignment with one Awk script or a shell script using a combination of Awk scripts.

Obtain the file located at http://users.csc.tntech.edu/~elbrown/access_log.bz2. For full credit, you must not save this data file to disk. You must process the file by reading directly from the url above using bash commands.

Please submit this problem's script(s) and output combined as a separate zip file. (15 points)

Your script should address each of the following items:

  1. List the top 10 web sites from which requests came (non-404 status, external addresses looking in).

  2. List the top 10 local web pages requested (non-404 status).

  3. List the top 10 web browsers used to access the site. It is not necessary to get fancy and parse out all of the browser string. Simply print out the information that is there. Display the percentage of all browser types that each line represents.

  4. List the number of 404 errors that were reported in the log.

  5. List the number of 500 errors that were reported in the log.

  6. Add any other important information that you deem appropriate.

  7. Relevant commands, code, scripts, algorithms:

Awk will be used.

  1. The attempts at a solution (include all code and scripts):

I don't have a problem at all with the 1 - 6 part. I understand how to use awk. The problem I'm having is how to parse a .bz2 file without downloading and decompressing it. I don't even have an idea how to begin accessing the file without decompressing it.

  1. Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):

Tennessee Technological University, Cookeville, TN, USA, Eric Brown, CSC 2500 Unix Laboratory

Note: Without school/professor/course information, you will be banned if you post here! You must complete the entire template (not just parts of it).

look into 'man wget'

I tried man wget and it said no manual entry for wget. Just running wget said command not found.

I'm using bash on Mac OS 10.6.2 with the latest version of the Apple Developer Tools installed.

I'm not really familiar with what MacOS provide, but what you need is a utility (wget, curl, lftp, lynx etc..) that can down load a file via http.
maybe other would have better ideas.

You use a script/program with a regex parser.

Why are you using awk?

If I had to do this, I would use PHP or Perl.

The assignment says we have to use awk. I don't know anything about PHP or Perl.

To download a file without saving it, you've got 2 options:

  1. Use a utility like wget or curl. If you don't have it, install it.
  2. Use the network ability of bash itself.

Once you've got that, just pipe it into bzip2 with the appropriate switched to decompress to the console.