Faster way to use this awk command

SkySmart · May 23, 2012, 8:45pm

awk "/May 23, 2012 /,0" /var/tmp/datafile

the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file.

now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to get what i want.

jim_mcnamara · May 23, 2012, 11:09pm

Short command lines don't equal speed necessarily.

Your code does what all code has to do, read each line. Doing a regular expression search is extra overhead. The only speed up possible is to turn off regexp search after the first match. It has to read each line regardless of all else. See what you can do with that logic: make it skip over the regexp after the first find and just print.

SkySmart · May 24, 2012, 9:59am

i need to grep for certain strings between the point specified to the end of the file. and i need to know the amount of lines containing those strings.

thats why im concerned about speed.

Corona688 · May 24, 2012, 10:11am

What's your system? What's your shell?

If you're on Linux I'd be tempted to use ( grep -m 1 "myregex" ; cat ) < inputfile > outputfile , using GNU grep just to get to the right place in the file and cat-ing the rest, which should be about as fast as anything can get.

If that's not fast enough, the bottleneck may not be your program.

SkySmart · May 24, 2012, 10:28am

shell is bash, os is linux and sunos.

your command appears to do the trick. thank you so much. i think awk is one of the top 5 best language out there. just sad it couldn't do what u just did with this grep command.

neutronscott · May 24, 2012, 10:46am

awk 'index($0,"May 23, 2012 "),0' would do a fixed string search without using regex.

if you need to search for more terms *after* that, GNU is rather fast I hear, and -F might speed it up (fixed string search). or try using awk for it too. use time and see how different varients weigh in.

awk 'index($0,"May 23, 2012 "),0{if (index($0,"ERROR") {c++;print}} END {print "Total errors after date: " c}' file

awk '!start && index($0,"May 23, 2012 ") {start=1} !start {next}
index($0,"ERROR"){c++;print} END {print "Total errors after date: " c}' file

internally awk probably does ranges about the same as that though..

Corona688 · May 24, 2012, 10:59am

Not under SunOS, it won't, unless you install GNU grep. -m is a GNU extension.

Chubler_XL · May 24, 2012, 11:47pm

As you wish to go from a line that contains a date string to the end of the file it's probably safe to assume the lines in your file are in date order.

Knowing this it should be possible to write a program that uses a binary chop to seek to the starting line and then process from there. If the file is large this solution will be orders of magnitude faster that a sequential search.

This perl example seem to be pretty close to what I mean, the downside is that seeking into files in this manor is pretty low level and I cant really think on any elegant solution using unix scriping so it will most likley require a proper programming language like perl, python or C - you also have the added complexity of needing to compare date strings instead of straight text.

---------- Post updated at 01:47 PM ---------- Previous update was at 01:24 PM ----------

Another thought if file only has new data appended on the end - keep another text file with each date and the line number it starts on:

Jan-01-2001 1
Jan-02-2001 7311
Jan-03-2001 15779
...
May-25-2012 574983989

You can then read the file in and start processing from the line number you require:

LINE=$(grep "May-16-2012" indexfile.txt | awk '{print $2}')
sed -n $LINE',$p' bigfile.log | # <your processing here >

Scrutinizer · May 25, 2012, 6:02am

Another alternative:

awk '!p{if(/May 23, 2012 /)p=1}p' infile

Further to Corona688's approach, this should work cross platform:

{ sed '/May 23, 2012 /!d;q' ; cat ;} < infile

On Solaris you would probably need to use /usr/xpg4/bin/sed , so set your PATH variable in your script..

Yet another way to speed up might be to use of mawk, which is a faster awk in most cases.