In memory grep of a large file.

shoaibjameel123 · November 16, 2015, 4:56am

I have two files:

A huge 8 GB text file. (big_file.txt)
A huge list of words approximately 8 million of them. (words_file.txt). Each word is separated by a newline.

What I intend to do is to read each word "w" from words_file.txt and search for that word in big_file.txt. Then extract two words before and after "w" from big_file.txt.

A naive way is to simply run this command below for each of the word.

while read name
do
grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" big_file.txt
done < words_file.txt

But the above code is too slow as the file is being read from the disk several times.

I then tried to store the entire big_file.txt in the memory by modifying the above code (shown below), but still this code is slow. Using "top" command, I can notice that memory usage increases and decreases as if big_file.txt is being read again and again for each "w". I want the big_file.txt to be read just once.

dump=$(<big_file.txt)
while read name
do
echo $dump | grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}"
done < words_file.txt

The big_file.txt looks something like this (posting just a small sample of the file):

in 2004 obama received national attention during his campaign to represent 
illinois in the united states senate with his victory in the march democratic 
party primary his keynote address at the democratic national convention in july 
and his election to the senate in november he began his presidential campaign
in 2007 and after a close primary campaign against hillary rodham clinton 
in 2008 he won sufficient delegates in the democratic party primaries to receive 
the presidential nomination he then defeated republican nominee john mccain 
in the general election and was inaugurated as president on january 20 2009
nine months after his inauguration obama was named the 2009 nobel peace prize laureate

The words_file.txt looks like this (just a sample):

obama
primaries
water
laureate
computer

The output that the code gives:

in 2004 obama received national
his inauguration obama was named
democratic party primaries to receive
peace prize laureate

Any suggestions how I can speed up the search and extraction? I am using BASH on Linux.

RudiC · November 16, 2015, 7:02am

Please post small but meaningful samples of your two files. And, use code not icode tags.

---------- Post updated at 13:02 ---------- Previous update was at 11:32 ----------

Would this help as a first step?

grep -C2 -fwords_file <(tr ' ' '\n' < big_file)
in
2004
obama
received
national
--
democratic
party
primaries
to
receive
--
his
inauguration
obama
was
named
--
peace
prize
laureate

One single read of each file.

bakunin · November 16, 2015, 7:41am

Wouldn't it be the easiest to put the big_file.txt to some RAMDisk and then read from there? How to make a RAMDisk is depending on your system, but i'm sure there is one. It might require areboot, though, to create one.

I hope this helps.

bakunin

shoaibjameel123 · November 16, 2015, 9:39am

Thanks for your responses. When I ran RudiC's script, I did not see any output for five minutes and the memory usage crossed 60%. I had to stop running it, but I thank RudiC for his script. I did not try RAMdisk yet. But I did stumble upon a grep regular expression which can grep for multiple words in one statement. I am sure this will help speed up because logically it only requires one file read of my large file. Simple grep command for just three words goes like this:

grep -w 'obama\|primaries\|water' big_file.txt

But when I try this in my command:

grep -owP "(?:\w+\s){0,2}obama\|primaries\|water(?:\s\w+){0,2}" big_file.txt

I do not see anything happening i.e. no output and also no error message. I only want to reduce the amount of file reads, and I am sure that it will speed up my script considerably.

drl · November 16, 2015, 10:55am

Hi.

I'm not going to post everything because I'm still thinking about it, but this version of the grep pattern seems to produce the expected output:

grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" <your_input_file>

producing:

in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named

Best wishes ... cheers, drl

shoaibjameel123 · November 16, 2015, 12:31pm

Thanks, drl. I can see some improvement using your command as I can see that all words in "obama|primaries|water" are searched at the same time. This can surely help reduce the amount of iterations needed.

I can also improve upon your script a little bit by doing this:

LC_ALL=C grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" big_file.txt

I also found another way using GNU parallel. I have tried several variations, but the variation which I am interested in is this (given here GNU Parallel):

parallel --pipepart --block 100M -a big_file.txt --fifo cat words_file.txt | parallel --pipe -L1000 --round-robin grep -f - {}

But the above code does not do what I really want, so I tried modifying to this:

parallel --pipepart --block 100M -a big_file.txt --fifo cat  words_file.txt | parallel --pipe -L1000 --round-robin grep -fowP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" -  {}

There is still some issues about how to include those word patterns from words_file.txt into this command.

RudiC · November 16, 2015, 1:02pm

Did you try the proposals on a reduced data set (just to prove the applicability)?

shoaibjameel123 · November 16, 2015, 1:10pm

Yes, I used split command to split the big_file.txt to smaller chunks of 350MB each. I then tried your command and others too on one of these smaller chunks. I noticed that your suggested case takes sometime to read and then run. I can post the memory and CPU statistics on these smaller chunks, if you want. In fact, I am thinking of using these smaller chunks so that I can run the many grep's in parallel instead of using GNU parallel.

But the reason why I pointed GNU parallel is because in that GNU parallel webapge which I pointed above has a statement below this command:

cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile

If a line matches multiple regexps, the line may be duplicated. The  command will start one grep per CPU and read bigfile one time per CPU,  but as that is done in parallel, all reads except the first will be  cached in RAM.

Don_Cragun · November 16, 2015, 4:32pm

I'm not clear on what output you hope to produce. With the sample data you provided in post #1, if you're looking for receive , are you hoping to get the output:

primaries to receive

or the output:

primaries to receive the presidential

Or, in other words, are searches limited to words on a single line or do you want searches cross line boundaries?

shoaibjameel123 · November 17, 2015, 4:27am

Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:

primaries to receive the presidential

But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:

primaries to receive

Don_Cragun · November 17, 2015, 5:18am

shoaibjameel123:

Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:
primaries to receive the presidential
But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:
primaries to receive

I had a much more complicated awk script that ignores line boundaries when looking for the two words before and after words found in words_file.txt , but this simple script seems to do what you want only needing to have words_file.txt in memory along with a single line of input from big_file.txt (so it doesn't need huge amounts of memory to run):

awk '
FNR == NR {
	W[$1]
	next
}
{	for(i = 1; i <= NF; i++) {
		if($i in W) {
			low = i > 2 ? i - 2 : 1
			high = i < NF - 2 ? i + 2 : NF
			for(j = low; j <= high; j++)
				printf("%s%s", $j, (j == high ? ORS : OFS))
		}
	}
}' words_file.txt big_file.txt

With your sample input files from post #1 in this thread, the above produces the output:

in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named
peace prize laureate

And, if additional words are added to words_file.txt :

obama
primaries
water
laureate
computer
peace
in
receive

it produces the output:

in 2004 obama
in 2004 obama received national
illinois in the united
his victory in the march
national convention in july
the senate in november he
in 2007 and
in 2008 he
sufficient delegates in the democratic
democratic party primaries to receive
primaries to receive
in the general
his inauguration obama was named
2009 nobel peace prize laureate
peace prize laureate

If this doesn't run fast enough, you can run multiple copies of this script with distinct subsets of words_file.txt concurrently and concatenate the resulting output files when they are all done. (This will give you the same output, but the order of lines in the output will be different.)

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

shoaibjameel123 · November 17, 2015, 5:32am

I just tried your code, and this is perhaps the best that I have tried so far in terms of efficiency. I am currently running multiple instances of a much slower version on my machine using GNU parallel and grep and when I compared the performance of your code with the ones that I am currently running, this one is lightening fast.

Don_Cragun · November 17, 2015, 5:44am

The REs required to match two words of context before and after a specified word using grep are much more complex than the exact word matches I'm using in the awk field in array tests. So I was hoping it would be significantly faster, but without your actual data to use as a testbed there wasn't any way to be sure. I'm glad it is working well for you.