i have a log file of 8 GB size , i need to grep the count of a word which i give as an input , i mean to say to find the occurance of a word on that file.
grep is taking too much time ,can you please give me any command so that i can grep the word in a quicker way..
What do you have right now?
grep is usually very fast, if you don't slow it down with (too many) .* things.
Also word boundaries -w or \< \> are slow in some grep implementations.
Further, grep is slow if the lines are very long, e.g. if your file is not a text file.
No more suggestions.
E.g. grep '^FLOW' isn't noticeable faster - a search takes about the same time as to find the next line.
And an RE search of a simple "string" has zero overhead compared to plain search.
An 8GB file should take about 2 minutes - this is fast.
Everything else - sed, awk, perl is slower.
Such amount of log data should be written to a DB file; at least a text log file should be more often rotated!
I disagree -
Solaris 10 M4000, ksh, ^ is faster, files are 50MB ~510000 lines +/- 40 lines between them:
contents of t.shl:
cd $BANNER_HOME/logs
time grep -c '^2013-05-22 11:04' uzpplpl_ng02cprd.log.54
time grep -c '2013-05-22 11:04' uzpplpl_ng02cprd.log.56
Results ( I ran it twice to show the effect of filesystem and disk controller caching):
$> ./t.shl
791
real 0m1.86s
user 0m0.23s
sys 0m0.29s
774
real 0m2.05s
user 0m1.07s
sys 0m0.36s
appworx> ./t.shl
791
real 0m0.39s
user 0m0.20s
sys 0m0.18s
774
real 0m1.15s
user 0m0.93s
sys 0m0.21s
Note user mode times. I know the OP is on a different box, so this may not be a fair comparison. However, expand the (red) user times by a factor of 8GB/50MB
~(20*8) gives 160
so:
.93 * 160 = ~148
.20 * 160 = ~32
diff 114
two points:
If you ran a times comparison of '^FLOW' vs 'FLOW' (in that order) on the same file your results were confounded by caching.
The user time is independent of caching and reflective of the work a regex does.
Henry Spencer wrote a white paper onthis kind of thing, I cannot find it so I cannot cite it.
No offense intended, but all of those unqualified statements are worthless. I have done some work with NFA (cached to DFA) regular expression engines, and the nature and quality of implementations varies massively.
While jim's implementation performs better with an anchor, a GNU grep 2.5.1 does much worse. It takes more than twice as long. (The tests were repeated multiple times in differing order on obsolete hardware and there was never a discrepancy.)
$ yes 'FLOWWWWWWWW' | head -n1000000 | time -p grep -c 'FLOW'
1000000
real 1.84
user 0.71
sys 0.07
$ yes 'FLOWWWWWWWW' | head -n1000000 | time -p grep -c '^FLOW'
1000000
real 4.83
user 3.70
sys 0.10
As an aside, some implementations will silently optimize depending on the contents of the pattern. A BSD example from OpenBSD :: grep.c:
for (i = 0; i < patterns; ++i) {
/* Check if cheating is allowed (always is for fgrep). */
#ifndef SMALL
if (Fflag) {
fgrepcomp(&fg_pattern, pattern);
} else
#endif
{
if (fastcomp(&fg_pattern, pattern)) {
/* Fall back to full regex library */
c = regcomp(&r_pattern, pattern, cflags);
My point, regular expression performance is highly implementation dependent and unqualified statements are seldom valid.
I thought GNU grep was the cr�me de la cr�me of speed.
In my system (GNU grep 2.6.3) it behaves better using an anchor than without it.
The original author (who does not maintain it any longer) has thoroughly defended it. Not sure if after all these years it has been beaten by something else.
I'm not sure how much stock you can put in a 2 year old message written by someone that states that it's been 15+ years since they maintained the code (so 17+ at this time) and who points out the implementation has dropped what was once a default optimization (mmap). Aside from changes in the implementation with which they are (were?) familiar, there may have been changes to other implementations in the past 2 years.
GNU grep may very well be the best performing grep. I did not compare grep implementations and I wasn't suggesting that any implementation is faster than another. My point was only that implementations can differ quite a bit (even between different versions of the same implementation).
That said, GNU Anything is usually known for being quite slow (sometimes very slow). To be fair, one of GNU's primary goals (politics aside) is to support many platforms; while a BSD implementation is only concerned with its native platform. This leads to many more instances of conditional inclusion and abstractions in GNU implementations. Further, GNU projects tend to be much more liberal with feature extensions, which further complicates their code. On the flipside, the performance of some BSD implementations benefits from their lack of support for multibyte characters and locales, which some may consider a drawback.
A representative example: GNU head (about a thousand lines) OpenBSD head (about a hundred)
Also, proactively, I'd like to highlight the distinction between a BSD tool on a BSD kernel and its GNU counterpart on a Linux kernel. Such a comparison is apples and oranges. Both utilities would need to be run on the same kernel to achieve a meaningful comparison.
Please don't misconstrue this post as a salvo in a BSD-GNU war. It is no such thing. It is simply my view on the relative performance and complexity of BSD and GNU tools. Nothing more. Whichever you prefer, it is of no importance to me.
The GNU utilities can be markedly slower than alternatives when working in the UTF8 character set... This is because they actually support it, which is unusual in itself.