I did some google search on what I'm trying to do but I cannot get my code to work correctly. I have 2 files which are very large and I want to read text from file1 and search in file2 - if present, keep the records.
I've tried fgrep -f file1 file2 but it is too slow.
OK , first file , with a[$1]++ , we just set a flag for each element of first field using an array), in fact we can just use a[$1]=1 cause we only need to know that that that element is used (the amount of memory/resources needed is very low ) , then we use next to stop processing the current record and go on to the next.
The rest of the flow is quite simple , if FR!=FNR ( second file ) awk will go directly into the a[$1] statement, if the flag is set ( >= 1 ) , the statement becomes true and awk prints the whole line.
Yes - fgrep is slower, but don't forget the influence of I/O buffering when comparing the two. Then pls consider using grep in lieu of fgrep . I did a little test on a somewhat bigger file, eliminating stdout influence, and appreciating the influence of I/O buffering etc:
$ time grep -f file1 file2 >/dev/null
real 0m0.022s
user 0m0.008s
sys 0m0.012s
$ time fgrep -f file1 file2 >/dev/null
real 0m0.092s
user 0m0.088s
sys 0m0.004s
$ time awk -F"|" '
FNR==NR {f1[$1];next}
($1 in f1)
' file1 file2 >/dev/null
real 0m0.090s
user 0m0.084s
sys 0m0.004s
There's not much point in finishing faster if the results are incorrect. fgrep (and the awk solutions provided) are not equivalent to grep (without -F).
Regards,
Alister
---------- Post updated at 01:04 PM ---------- Previous update was at 12:31 PM ----------
I'm curious. What implementation are you using? The simpler string comparison of fgrep should be faster than any regular expression engine (including Ken Thompson's NFA approach).
I ran a few tests on an old machine -- P2 350 MHz -- using a quarter million line ~2.5 MB text file with text that repeats every 10 lines. When trying to match two patterns (stored in a second file), fgrep was approximately 16 times faster, ~0.2 seconds versus ~3.3 seconds.
linux with
fgrep (GNU grep) 2.12
grep (GNU grep) 2.12
mawk 1.3.3
What you say is in line with what I expected (fixed string lookup vs. pattern matching), but I enlarged the test file and found very similar results to what I posted before:
time grep -f file1 file2 >/dev/null
real 0m0.085s
time grep -F -f file1 file2 >/dev/null
real 0m0.226s
time fgrep -f file1 file2 >/dev/null
real 0m0.245s
time awk -F\| 'NR==FNR{a[$1]++;next}a[$1]' file1 file2 >/dev/null
real 0m0.217s
Pls explain you statement that results between fgrep and grep/awk are not the same or incorrect, resp.; other than the pipe symbol in file1 ORing patterns.
I observed fgrep to be over 16 times faster than grep with GNU [f]grep 2.5.1 on an ancient Debian install.
Without assurances that the OP's "patterns" won't contain any regular expression metacharacters (e.g. . ), it's possible for fgrep/awk (using string comparison) and grep (using regular expression pattern matching) to produce different output.
By default, the pipe symbol isn't special to grep, so that wasn't my concern. However, since that possibility occurred to you, you were obviously already aware of the pitfalls mentioned in the previous paragraph. You simply didn't think it was worth worrying about in this case. I'm not so lenient (probably because I've seen simplified post data waste time in the past).
Honestly, for me, the slowness of your fgrep is the most interesting aspect of this thread. I seldom use GNU and Linux, so I doubt I'll investigate it myself. But, if you are similarly curious and discover the cause, I'd love to know.
Thanks for explaining your point.
I did a similar test on a FreeBSD system, although in a VM. So just consider the relative times.
fgrep (GNU grep) 2.5.1-FreeBSD
grep (GNU grep) 2.5.1-FreeBSD
$ time grep -f file file2 >/dev/null
real 0m0.222s
$ time fgrep -f file file2 >/dev/null
real 0m0.205s
$ time awk -F"|" ' ...
real 0m0.863s
Here grep (with and without -F) and fgrep play in the same league whilst awk drags behind by a factor of four (which supports my first assumption).
BUT - the requestor posted that awk was way faster than grep... difficult to understand.
Although being curious as well, I'm not sure how to delve into that problem. I don't think it's system (I/O etc) related, it would be more the algorithms used. Being a bit thick when it comes to analysing C source, I'm afraid I'm giving up.