Help in faster and quicker grepping

senkerth · June 3, 2013, 8:38am

Hi,

i have a log file of 8 GB size , i need to grep the count of a word which i give as an input , i mean to say to find the occurance of a word on that file.

grep is taking too much time ,can you please give me any command so that i can grep the word in a quicker way..

thanks,

senthil

MadeInGermany · June 3, 2013, 8:55am

What do you have right now?
grep is usually very fast, if you don't slow it down with (too many) .* things.
Also word boundaries -w or \< \> are slow in some grep implementations.
Further, grep is slow if the lines are very long, e.g. if your file is not a text file.

senkerth · June 3, 2013, 9:06am

hi

thnaks for the response.

am using this command -> grep "string" filename |wc -l

the string will be of 10 to 12 words long.

is there any alternate command ..?

MadeInGermany · June 3, 2013, 9:15am

Could you please give an example for "string".
What gives the following?:

file "logfile"
uname
type grep

senkerth · June 3, 2013, 9:32am

Hi,

file "logfile"
App.log: ASCII English text, with very long lines

uname
Linux

type grep
grep is /bin/grep

---------- Post updated at 07:02 PM ---------- Previous update was at 06:56 PM ----------

one of the example for the string is below

"'FLOW - Unable to read the message from MQ within Time Out Period "

jim_mcnamara · June 3, 2013, 9:55am

Does the word

FLOW

start at the very beginning of the line?

If true and FLOW occurs rarely then,

grep '^FLOW' filename  | grep  'FLOW - Unable to read the message from MQ within Time Out Period'

is faster because it looks only at the first 4 characters of the line

MadeInGermany · June 3, 2013, 10:03am

grep -c "string" App.log

is faster than

grep "string" App.log | wc -l

senkerth · June 3, 2013, 10:30am

hi ,

thanks for the help i will let you know if i need any help.

thnks,
senthil

RudiC · June 3, 2013, 10:49am

If your string is not a regex but a fixed string, try grep -F , switching off expensive regex matching.

MadeInGermany · June 3, 2013, 11:27am

No more suggestions.
E.g. grep '^FLOW' isn't noticeable faster - a search takes about the same time as to find the next line.
And an RE search of a simple "string" has zero overhead compared to plain search.
An 8GB file should take about 2 minutes - this is fast.
Everything else - sed, awk, perl is slower.
Such amount of log data should be written to a DB file; at least a text log file should be more often rotated!

jim_mcnamara · June 3, 2013, 1:08pm

I disagree -
Solaris 10 M4000, ksh, ^ is faster, files are 50MB ~510000 lines +/- 40 lines between them:

contents of t.shl:

cd $BANNER_HOME/logs
time grep -c '^2013-05-22 11:04' uzpplpl_ng02cprd.log.54
time grep -c '2013-05-22 11:04'  uzpplpl_ng02cprd.log.56

Results ( I ran it twice to show the effect of filesystem and disk controller caching):

$> ./t.shl
791

real    0m1.86s
user    0m0.23s
sys     0m0.29s
774

real    0m2.05s
user    0m1.07s
sys     0m0.36s

appworx> ./t.shl
791

real    0m0.39s
user    0m0.20s
sys     0m0.18s
774

real    0m1.15s
user    0m0.93s
sys     0m0.21s

Note user mode times. I know the OP is on a different box, so this may not be a fair comparison. However, expand the (red) user times by a factor of 8GB/50MB
~(20*8) gives 160
so:

 .93 * 160 = ~148
 .20 * 160 =   ~32
diff                  114

two points:

If you ran a times comparison of '^FLOW' vs 'FLOW' (in that order) on the same file your results were confounded by caching.

The user time is independent of caching and reflective of the work a regex does.

Henry Spencer wrote a white paper onthis kind of thing, I cannot find it so I cannot cite it.

YMMV.

alister · June 3, 2013, 2:01pm

No offense intended, but all of those unqualified statements are worthless. I have done some work with NFA (cached to DFA) regular expression engines, and the nature and quality of implementations varies massively.

While jim's implementation performs better with an anchor, a GNU grep 2.5.1 does much worse. It takes more than twice as long. (The tests were repeated multiple times in differing order on obsolete hardware and there was never a discrepancy.)

$ yes 'FLOWWWWWWWW' | head -n1000000 | time -p grep -c 'FLOW'
1000000
real    1.84
user    0.71
sys     0.07
$ yes 'FLOWWWWWWWW' | head -n1000000 | time -p grep -c '^FLOW'
1000000
real    4.83
user    3.70
sys     0.10

As an aside, some implementations will silently optimize depending on the contents of the pattern. A BSD example from OpenBSD :: grep.c:

	for (i = 0; i < patterns; ++i) {
		/* Check if cheating is allowed (always is for fgrep). */
#ifndef SMALL
		if (Fflag) {
			fgrepcomp(&fg_pattern, pattern);
		} else
#endif
		{
			if (fastcomp(&fg_pattern, pattern)) {
				/* Fall back to full regex library */
				c = regcomp(&r_pattern, pattern, cflags);

My point, regular expression performance is highly implementation dependent and unqualified statements are seldom valid.

Regards,
Alister

Corona688 · June 3, 2013, 2:38pm

No offense, on an 8-gig file, optimizing may be pointless, disk speed may be the limiting factor...

verdepollo · June 3, 2013, 2:57pm

I thought GNU grep was the cr�me de la cr�me of speed.

In my system (GNU grep 2.6.3) it behaves better using an anchor than without it.

The original author (who does not maintain it any longer) has thoroughly defended it. Not sure if after all these years it has been beaten by something else.

alister · June 3, 2013, 3:37pm

I'm not sure how much stock you can put in a 2 year old message written by someone that states that it's been 15+ years since they maintained the code (so 17+ at this time) and who points out the implementation has dropped what was once a default optimization (mmap). Aside from changes in the implementation with which they are (were?) familiar, there may have been changes to other implementations in the past 2 years.

GNU grep may very well be the best performing grep. I did not compare grep implementations and I wasn't suggesting that any implementation is faster than another. My point was only that implementations can differ quite a bit (even between different versions of the same implementation).

That said, GNU Anything is usually known for being quite slow (sometimes very slow). To be fair, one of GNU's primary goals (politics aside) is to support many platforms; while a BSD implementation is only concerned with its native platform. This leads to many more instances of conditional inclusion and abstractions in GNU implementations. Further, GNU projects tend to be much more liberal with feature extensions, which further complicates their code. On the flipside, the performance of some BSD implementations benefits from their lack of support for multibyte characters and locales, which some may consider a drawback.

A representative example:
GNU head (about a thousand lines)
OpenBSD head (about a hundred)

Also, proactively, I'd like to highlight the distinction between a BSD tool on a BSD kernel and its GNU counterpart on a Linux kernel. Such a comparison is apples and oranges. Both utilities would need to be run on the same kernel to achieve a meaningful comparison.

Please don't misconstrue this post as a salvo in a BSD-GNU war. It is no such thing. It is simply my view on the relative performance and complexity of BSD and GNU tools. Nothing more. Whichever you prefer, it is of no importance to me.

Regards,
Alister

Corona688 · June 6, 2013, 6:50pm

The GNU utilities can be markedly slower than alternatives when working in the UTF8 character set... This is because they actually support it, which is unusual in itself.

Forcing the locale to C can avoid this problem.

MadeInGermany · June 7, 2013, 4:36am

You are right:

But on a newer system it seems improved: