Best way to search for patterns in huge text files

andy2000 · January 8, 2010, 4:27pm

I have the following situation:

a text file with 50000 string patterns:

abc2344536
gvk6575556
klo6575556
....

and 3 text files each with more than 1 million lines:

...
000000 abc2344536      46575 0000
000000 abc2344536      46575 4444
000000 abc2344555      46575 1234
...

I have to extract all lines from the 3 files that match all patterns from the pattern file

Any ideas, please!!!

Andy

radoulov · January 8, 2010, 4:34pm

If I understand correctly ("all patterns" is a bit confusing) ...
Try this:

awk 'NR == FNR { p[$1]; next } $2 in p' pattern_file input_file1 input_file2 ...

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

andy2000 · January 8, 2010, 5:27pm

thank u for your quick replay...but it does not work ..

I need something faster than the following solution:

for i in `cat pattern_file`
do
  grep $i  input_file1 >> output_file1
  grep $i  input_file2 >> output_file2  
  grep $i  input_file3 >> output_file3
done

Any ideas please!!!

Scott · January 8, 2010, 5:31pm

Hi.

Is:

grep -f pattern_file input_file[123] > output_fileX

quicker?

Also, please say WHAT doesn't work with the awk solution.

Like radoulov, I don't know what you mean by "all patterns".

andy2000 · January 8, 2010, 5:54pm

scottn ,

grep -f pattern_file input_file[123] > output_fileX

I get out of memory

"all patterns" means: returns every line that contains any string pattern from the pattern file. And the results of each file should be separatly..excuse me my english

Scott · January 8, 2010, 6:08pm

Hi Andy.

You're English is not an issue - it's rather good :).

You didn't say what was wrong with the original awk (why it didn't work).

You can try:

cat input_file1 input_file2 etc | grep -f pattern_file

A slight change to the awk:

awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

(changing the filenames to suit your requirements)

andy2000 · January 9, 2010, 4:49am

scottn,

thank you for your Patience.

The first solution:

cat input_file1 input_file2 etc | grep -f pattern_file

returns virtual memory exhausted

The second one:

awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

Does not work!! It returns all the lines of "input_file". A second bad thing is that, the searched pattern is NOT always at the second position, e.g.

...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

radoulov · January 9, 2010, 10:49am

andy2000:

[...]
[...]
A second bad thing is that, the searched pattern is NOT always at the second position, e.g.
...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

Well,
you had to be more specific in your first post ...

I'd use Perl, given the file size:

perl -e'
  my %p;
  open my $PH, "<", shift or die "$!\n";
  $p{$_} = 1 while <$PH>;
  close $PH or warn "$!\n";
  
  while (my $line = <>) {
    for (keys %p) {
      print $line and last if $line =~ /$_/;
      }
    }' pattern_file input_file1 inputfile2 ...

Or Python:

python -c'
from sys import argv

pf = open(argv.pop(1), "r")
pd = {}
for l in pf:
  pd[l] = 1

for fn in argv[1:]:
  f = open(fn, "r")
  for line in f:
    for p in pd:
        if p in line:
	      print line,
	      break
			    
  f.close()
' pattern_file input_file1 input_file2 ...

andy2000 · January 9, 2010, 5:39pm

scottn,

it works fine with (fgrep):

cat input_file1 input_file2 etc | fgrep -f pattern_file

the ugly message "virtual memory exhausted" appears no more.

How could u solve the problem with awk:

awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

Does not work!! It returns all the lines of "input_file". the searched pattern is NOT always at the second position, e.g.

...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

Thank you