deleting double entries in a log file

Hi Folks,

I have a apache log file that has double entries (however not all lines appear twice).

How can I delete automatically the first line of a double entry?

Your help is greatly appreciated.

Thanks,

Klaus

Here is what the log file looks like

217.81.190.164 - - [28/Aug/2002:00:16:33 +0200] "GET /rmg/w4w/1000689.htm HTTP/1.1" 200 2409
217.81.190.164 - - [28/Aug/2002:00:16:33 +0200] "GET /rmg/w4w/1000689.htm HTTP/1.1" 200 2409 "http://www.opusforum.org/rmg/w4w/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
217.81.190.164 - - [28/Aug/2002:00:17:01 +0200] "GET /rmg/vec/ HTTP/1.1" 200 2631
217.81.190.164 - - [28/Aug/2002:00:17:01 +0200] "GET /rmg/vec/ HTTP/1.1" 200 2631 "http://www.opusforum.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
217.81.190.164 - - [28/Aug/2002:00:17:03 +0200] "GET /rmg/vec/1000868.htm HTTP/1.1" 200 2386
217.81.190.164 - - [28/Aug/2002:00:17:03 +0200] "GET /rmg/vec/1000868.htm HTTP/1.1" 200 2386 "http://www.opusforum.org/rmg/vec/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
213.23.52.237 - - [28/Aug/2002:00:17:10 +0200] "GET / HTTP/1.0" 200 16327

How about:
uniq <inputfile >outputfile

In this situation, I like to use a Perl hash for doing the dirty work for me.

Something like this:


#!/usr/bin/perl

open(LOG, "myLogFile") || die "$!";

my %logHash;

while ($inputLine = <LOG>) {
  if (!exists($logHash{$inputLine})) {
    $logHash{$inputLine} = 1;
    print "$inputLine";
  };
};

That should remove the dupe entries. Just redirect the output to a new log.

Oh sure, do it the eeaaasssy way! :slight_smile:

Hi Folks,

thanks a lot for your suggestions. Unfortunately, both suggestions don't work.

The "uniq" solution needs a "-w 50" in order to come up with the double entry. However, it gives me the first line but I need the second (the line with add. information).

The perl script doesn't give me the result because it compares line by line. But the lines are not really "exact" duplicates (only the first 50 characters or so).

Any refinements, so the solution works? I am sure we are close :wink:

Thanks

Klaus

I made it :slight_smile:

here is what worked for me:

perl -e 'print reverse <>' logfile|uniq -w 50|perl -e 'print reverse <>' >logfile.done

so first, the logfile is inverted (by lines) then the dupes are removed and finaly we do an invert again.

The inversion is needed in order to have the first of a duplicate line pair removed.

Thanks to your contributions folks. This pointed me into the right direction.

Klaus :slight_smile:

uniq -c <file1 >file2 would give you the number duplicate entries with a unique entry appending to the 2nd file.

Regards,
uniesh

try to use tail -r before using uniq command...