Using awk to read one file and search in another file

pchang · November 26, 2012, 11:04am

Hi Forum.

I did some google search on what I'm trying to do but I cannot get my code to work correctly. I have 2 files which are very large and I want to read text from file1 and search in file2 - if present, keep the records.

I've tried fgrep -f file1 file2 but it is too slow.

File1:
00000014|

File2:
00000014|XSELL_ASP||Y|
00000014|XSELL_RSP||Y|
00000014|XSELL_ISA||Y|
00000014|XSELL_THRIVE||Y|
00000132|XSELL_ISA||Y|
00000132|XSELL_RSP|0.0014960810|N|
00000132|XSELL_THRIVE|0.0078523404|N|

Output File3 should be:

00000014|XSELL_ASP||Y|
00000014|XSELL_RSP||Y|
00000014|XSELL_ISA||Y|
00000014|XSELL_THRIVE||Y|

Here's my code so far (based on examples found on forum):

awk -F"|" '
  FNR==NR {f1[$1];next}
  (($1 SUBSEP) in f1)
' file1 file2

Can someone help out?

Thanks.

Klashxx · November 26, 2012, 11:21am

Try this (and don�t forget the code tags in your posts):

awk -F\| 'NR==FNR{a[$1]++;next}a[$1]'  file1 file2

clx · November 26, 2012, 11:25am

Can you try this?

awk -F\| 'NR == FNR {a1[$1]; next} $1 in file1' file1 file2

RudiC · November 26, 2012, 11:26am

PLEASE use code tags.

A small modifation to your own code will make it do what you expect:

awk -F"|" '
           FNR==NR {f1[$1];next}
           ($1 in f1)
          ' file1 file2

, but I don't think it will be much faster than grep. Pls report back the time differences!

pchang · November 26, 2012, 11:44am

Wow - it's blazing FAST!!!!

Done within seconds.

Can you explain the code a little bit?

I can understand that file1 is being stored in an array but I cannot understand the rest.

Thanks.

---------- Post updated at 11:44 AM ---------- Previous update was at 11:43 AM ----------

rudic:

PLEASE use code tags.

A small modifation to your own code will make it do what you expect:
awk -F"|" '
   FNR==NR {f1[$1];next}
   ($1 in f1)
   ' file1 file2
, but I don't think it will be much faster than grep. Pls report back the time differences!

Not even comparable - fgrep was very slow but awk came back within seconds.

Klashxx · November 26, 2012, 12:02pm

OK , first file , with a[$1]++ , we just set a flag for each element of first field using an array), in fact we can just use a[$1]=1 cause we only need to know that that that element is used (the amount of memory/resources needed is very low ) , then we use next to stop processing the current record and go on to the next.

The rest of the flow is quite simple , if FR!=FNR ( second file ) awk will go directly into the a[$1] statement, if the flag is set ( >= 1 ) , the statement becomes true and awk prints the whole line.

Hope this helps.

Regards.

RudiC · November 26, 2012, 12:13pm

Yes - fgrep is slower, but don't forget the influence of I/O buffering when comparing the two. Then pls consider using grep in lieu of fgrep . I did a little test on a somewhat bigger file, eliminating stdout influence, and appreciating the influence of I/O buffering etc:

$ time grep -f file1 file2 >/dev/null
real    0m0.022s
user    0m0.008s
sys     0m0.012s
$ time fgrep -f file1 file2 >/dev/null
real    0m0.092s
user    0m0.088s
sys     0m0.004s
$ time awk -F"|" '
FNR==NR {f1[$1];next}
($1 in f1)
' file1 file2 >/dev/null
real    0m0.090s
user    0m0.084s
sys     0m0.004s

alister · November 26, 2012, 1:04pm

There's not much point in finishing faster if the results are incorrect. fgrep (and the awk solutions provided) are not equivalent to grep (without -F).

Regards,
Alister

---------- Post updated at 01:04 PM ---------- Previous update was at 12:31 PM ----------

rudic:

Yes - fgrep is slower
...<snip>...

$ time grep -f file1 file2 >/dev/null
real    0m0.022s
user    0m0.008s
sys     0m0.012s
$ time fgrep -f file1 file2 >/dev/null
real    0m0.092s
user    0m0.088s
sys     0m0.004s

I'm curious. What implementation are you using? The simpler string comparison of fgrep should be faster than any regular expression engine (including Ken Thompson's NFA approach).

I ran a few tests on an old machine -- P2 350 MHz -- using a quarter million line ~2.5 MB text file with text that repeats every 10 lines. When trying to match two patterns (stored in a second file), fgrep was approximately 16 times faster, ~0.2 seconds versus ~3.3 seconds.

Regards,
Alister

RudiC · November 26, 2012, 1:38pm

linux with
fgrep (GNU grep) 2.12
grep (GNU grep) 2.12
mawk 1.3.3

What you say is in line with what I expected (fixed string lookup vs. pattern matching), but I enlarged the test file and found very similar results to what I posted before:

time grep -f file1 file2 >/dev/null
real    0m0.085s
time grep -F -f file1 file2 >/dev/null
real    0m0.226s
time fgrep -f file1 file2 >/dev/null
real    0m0.245s
time awk -F\| 'NR==FNR{a[$1]++;next}a[$1]'  file1 file2 >/dev/null
real    0m0.217s

Pls explain you statement that results between fgrep and grep/awk are not the same or incorrect, resp.; other than the pipe symbol in file1 ORing patterns.

alister · November 26, 2012, 4:05pm

rudic:

linux with
fgrep (GNU grep) 2.12
grep (GNU grep) 2.12
mawk 1.3.3

What you say is in line with what I expected (fixed string lookup vs. pattern matching), but I enlarged the test file and found very similar results to what I posted before:
time grep -f file1 file2 >/dev/null
real    0m0.085s
time grep -F -f file1 file2 >/dev/null
real    0m0.226s
time fgrep -f file1 file2 >/dev/null
real    0m0.245s
time awk -F\| 'NR==FNR{a[$1]++;next}a[$1]'  file1 file2 >/dev/null
real    0m0.217s

I observed fgrep to be over 16 times faster than grep with GNU [f]grep 2.5.1 on an ancient Debian install.

Without assurances that the OP's "patterns" won't contain any regular expression metacharacters (e.g. . ), it's possible for fgrep/awk (using string comparison) and grep (using regular expression pattern matching) to produce different output.

By default, the pipe symbol isn't special to grep, so that wasn't my concern. However, since that possibility occurred to you, you were obviously already aware of the pitfalls mentioned in the previous paragraph. You simply didn't think it was worth worrying about in this case. I'm not so lenient (probably because I've seen simplified post data waste time in the past).

Honestly, for me, the slowness of your fgrep is the most interesting aspect of this thread. I seldom use GNU and Linux, so I doubt I'll investigate it myself. But, if you are similarly curious and discover the cause, I'd love to know.

Regards,
Alister

RudiC · November 27, 2012, 4:35am

Thanks for explaining your point.
I did a similar test on a FreeBSD system, although in a VM. So just consider the relative times.
fgrep (GNU grep) 2.5.1-FreeBSD
grep (GNU grep) 2.5.1-FreeBSD

$ time grep -f file file2 >/dev/null
real    0m0.222s
$ time fgrep -f file file2 >/dev/null
real    0m0.205s
$ time awk -F"|" '  ... 
real    0m0.863s

Here grep (with and without -F) and fgrep play in the same league whilst awk drags behind by a factor of four (which supports my first assumption).
BUT - the requestor posted that awk was way faster than grep... difficult to understand.

Although being curious as well, I'm not sure how to delve into that problem. I don't think it's system (I/O etc) related, it would be more the algorithms used. Being a bit thick when it comes to analysing C source, I'm afraid I'm giving up.