Parsing a large log

asth · May 28, 2008, 6:35am

I need to parse a large log say 300-400 mb
The commands like awk and cat etc are taking time.
Please help how to process.
I need to process the log for certain values of current date.
But I am unbale to do so.

zaxxon · May 28, 2008, 7:33am

What about good old grep?

asth · May 28, 2008, 7:37am

It is also not working

ripat · May 28, 2008, 7:39am

I personally doubt that grep would be more efficient that awk for large files. Please post sample input and ouput files.

era · May 28, 2008, 7:43am

And show us the regex you are using. Simple greps would take a few seconds max unless your disks are very slow. It's reading the file linearly so you can't get much faster performance than that. If cat is too slow, there really isn't much hope in making it fast enough, other than replacing the disk, or managing the file in a different way (split into smaller chunks? Import into a DBMS?)

asth · May 30, 2008, 7:10am

the log is of the form

05/29/08 01:56:53 nsrexecd: select() error: Invalid argument
05/29/08 01:56:53 nsrexecd: select() error: Invalid argument
05/29/08 01:56:53 nsrexecd: select() error: Invalid argument

i need to take the log for 13 pm of last day to 13 pm of cureent date.
Please help to get it using tail....a none of the grep , cat , awk etc are working....

era · May 30, 2008, 8:27am

egrep '^05/(29/08 (1[3-9]|2[0-3])|30/08 (0|1[0-2]))' logfile

For automation, the regular expression can be generated by date or a simple Perl script. It would be much easier if you could simply do it by date, though.

aigles · May 30, 2008, 3:28pm

The most rapid solution is to write a program with a compiled language like C.

With awk you can do something like that :

awk -v from=$(date --date=yesterday +'%D') \
    -v   to=$(date +'%D')  '
$1 == from {
   if (int($2) >= 13)
      print;
   next;
}
$1 == to  {
   if (int($2) < 13) {
      print;
      next;
   } else
      exit;
}
' inputfile

Input file:

05/29/08 01:56:53 nsrexecd: select() error: Invalid argument
05/29/08 01:56:53 nsrexecd: select() error: Invalid argument
05/29/08 01:56:53 nsrexecd: select() error: Invalid argument
05/29/08 12:59:50 not selected
05/29/08 13:00:00 selected 1
05/29/08 23:59:59 selected 2
05/30/08 00:00:01 selected 3
05/30/08 12:59:59 selected 4
05/30/08 13:00:00 not selected
06/01/08 00:00:01 not selected

Output (current date is 05/30/08):

05/29/08 13:00:00 selected 1
05/29/08 23:59:59 selected 2
05/30/08 00:00:01 selected 3
05/30/08 12:59:59 selected 4

Jean-Pierre.

asth · June 2, 2008, 5:52am

Thanks a lot.
But my problem is that my log is large- 300-400mb.
I am unable to use awk, sed. grep etc.
I need a solution in perl or shell for parsing the log for current date(24 hours)
and then searching the string

era · June 2, 2008, 8:26am

None of the tools you mentioned are sensitive to the file size. Other things being equal, they read the file one line at a time, and prints that line if certain conditions are met. (Of course you can write an awk or sed script which consumes memory for every line; but for this case, I don't think you need to.)

asth · June 3, 2008, 7:33am

Please helppppppppp
But my problem is that my log is large- 300-400mb.
I am unable to use awk, sed. grep etc.
I need a solution in perl or shell for parsing the log for current date(24 hours)
and then searching the string

era · June 3, 2008, 7:39am

I'm sorry, no offense, but I cannot type this any slower than this: grep and sed and awk do not care what size the file is. They only read it one line at a time, just like cat.

Perl is unlikely to be any faster than grep. Here is a Perl script anyway.

perl -ne 'print if m/^05/(29/08 (1[3-9]|2[0-3])|30/08 (0|1[0-2]))/' file

Notice the similarity to the egrep solution I posted before. This one is probably going to be slower, and in any event will not be much faster.

Please answer the following questions:

What have you tried?
Have you tried the solutions various people have posted to this thread?
How long did it take to complete?
How long would you like it to take?
How quickly can you simply cat the file?
If you extract just one day's worth from the file, how long does that take to cat?

asth · June 3, 2008, 8:34am

era:

I'm sorry, no offense, but I cannot type this any slower than this: grep and sed and awk do not care what size the file is. They only read it one line at a time, just like cat.

Perl is unlikely to be any faster than grep. Here is a Perl script anyway.
perl -ne 'print if m/^05/(29/08 (1[3-9]|2[0-3])|30/08 (0|1[0-2]))/' file
Notice the similarity to the egrep solution I posted before. This one is probably going to be slower, and in any event will not be much faster.

Please answer the following questions:

What have you tried?

Have you tried the solutions various people have posted to this thread?

How long did it take to complete?

How long would you like it to take?

How quickly can you simply cat the file?

If you extract just one day's worth from the file, how long does that take to cat?

What have you tried? --I have tried "cat file |/bin/awk '$1 ~ /^$date/'"
Have you tried the solutions various people have posted to this thread?--yep but as i have mentioned for simply using cat it is timong out
How long did it take to complete? more than 5 min-- i quit b4 it completed..
How long would you like it to take? a normal time as it takes for cat or grep
How quickly can you simply cat the file? I am unable to cat the file it is not at all opening
If you extract just one day's worth from the file, how long does that take to cat?I am unable to extract with the awk or grep...i ma only able to use the tail and head command.
As mentioned earlier in the thread to use chunks of file.. i am unable to create a logic for chunks and find the log of 24 hours.

era · June 3, 2008, 8:41am

The cat is useless, simply run awk '$1 ~ /^06\/01\//' file

You have not mentioned this very explicitly. I think there may be an unrelated problem here.

So if you, say, tail -n 10000 file | grep '^06/01/ ' do you get roughly what you want? How long does it take? Too long still?

era · June 3, 2008, 8:45am

Another thing: do you have very limited memory and/or hard disk space? Grepping a file that size should not be a problem on even relatively modest hardware.

vnix$ dd if=/dev/urandom of=/tmp/randomfile bs=65536 count=65536
^C  # interrupted when I got bored
23196+2 records in
23196+1 records out
1520214016 bytes (1.5 GB) copied, 317.015 s, 4.8 MB/s

vnix$ time grep '^06/01/' /tmp/randomfile 

real    0m51.461s
user    0m1.940s
sys     0m1.768s

This is a basic PATA disk which should be easy to beat if you have SCSI or SATA.

zaxxon · June 3, 2008, 8:51am

@asth
If it times out, you have another problem - does it time out with an error or just come back to the prompt or do you have to do ctrl+c or something after some minutes of boredom?
It doubt the tools you are using are the problem. Looks like you got a performance issue - something blocking the resources (disks, cpu, memory, whatever).

300-400MB of log is nothing if you simply parse it without layered loops and such stuff, which all examples posted did not - they were not complex.

asth · June 3, 2008, 9:02am

If i use tail command then does'nt take much time and i get the required output.
I have space issue thats why it is taking time.
But i have to work on these sites only so need to find out a way to divide the file in chunks and then find out the chunk of last 24 hours say.

Thanks

era · June 3, 2008, 11:29am

If you cannot read the file from the beginning, there is no way really to know. But based on experimentation you can probably find a value for tail which is likely to cover more than the last 24 hours by a good margin.

Sounds like you ought to be running some sort of rotation script in your nightly cron job to force the log file into smaller chunks. Which platform are you on? Does the application which generates this log support log rotation?