I have the following situation:
a text file with 50000 string patterns:
abc2344536
gvk6575556
klo6575556
....
and 3 text files each with more than 1 million lines:
...
000000 abc2344536 46575 0000
000000 abc2344536 46575 4444
000000 abc2344555 46575 1234
...
I have to extract all lines from the 3 files that match all patterns from the pattern file
Any ideas, please!!!
Andy
If I understand correctly ("all patterns" is a bit confusing) ...
Try this:
awk 'NR == FNR { p[$1]; next } $2 in p' pattern_file input_file1 input_file2 ...
Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
thank u for your quick replay...but it does not work ..
I need something faster than the following solution:
for i in `cat pattern_file`
do
grep $i input_file1 >> output_file1
grep $i input_file2 >> output_file2
grep $i input_file3 >> output_file3
done
Any ideas please!!!
Scott
January 8, 2010, 5:31pm
4
Hi.
Is:
grep -f pattern_file input_file[123] > output_fileX
quicker?
Also, please say WHAT doesn't work with the awk solution.
Like radoulov, I don't know what you mean by "all patterns".
scottn ,
grep -f pattern_file input_file[123] > output_fileX
I get out of memory
"all patterns" means: returns every line that contains any string pattern from the pattern file. And the results of each file should be separatly..excuse me my english
Scott
January 8, 2010, 6:08pm
6
Hi Andy.
You're English is not an issue - it's rather good :).
You didn't say what was wrong with the original awk (why it didn't work).
You can try:
cat input_file1 input_file2 etc | grep -f pattern_file
A slight change to the awk:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*
(changing the filenames to suit your requirements)
scottn,
thank you for your Patience.
The first solution:
cat input_file1 input_file2 etc | grep -f pattern_file
returns virtual memory exhausted
The second one:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*
Does not work!! It returns all the lines of "input_file". A second bad thing is that, the searched pattern is NOT always at the second position, e.g.
...
000000 abc2344536 46575 0000
000000 89798798798 abc2344536
000000 abc2344555 46575 1234
000000 7777777777 abc2344536
...
andy2000:
[...]
[...]
A second bad thing is that, the searched pattern is NOT always at the second position, e.g.
...
000000 abc2344536 46575 0000
000000 89798798798 abc2344536
000000 abc2344555 46575 1234
000000 7777777777 abc2344536
...
Well,
you had to be more specific in your first post ...
I'd use Perl, given the file size:
perl -e'
my %p;
open my $PH, "<", shift or die "$!\n";
$p{$_} = 1 while <$PH>;
close $PH or warn "$!\n";
while (my $line = <>) {
for (keys %p) {
print $line and last if $line =~ /$_/;
}
}' pattern_file input_file1 inputfile2 ...
Or Python:
python -c'
from sys import argv
pf = open(argv.pop(1), "r")
pd = {}
for l in pf:
pd[l] = 1
for fn in argv[1:]:
f = open(fn, "r")
for line in f:
for p in pd:
if p in line:
print line,
break
f.close()
' pattern_file input_file1 input_file2 ...
scottn,
it works fine with (fgrep):
cat input_file1 input_file2 etc | fgrep -f pattern_file
the ugly message "virtual memory exhausted" appears no more.
How could u solve the problem with awk:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*
Does not work!! It returns all the lines of "input_file". the searched pattern is NOT always at the second position, e.g.
...
000000 abc2344536 46575 0000
000000 89798798798 abc2344536
000000 abc2344555 46575 1234
000000 7777777777 abc2344536
...
Thank you