Performance issue with 'grep' command for huge file size

arb_1984 · November 17, 2011, 12:24pm

I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is:

while read line
do
emp_name=`echo $line`
grep -e $emp_name details.txt >> output.txt
done < emp.txt

Above code is working fine and I am getting expected result. But, this code is taking too much time (I don't have exact time, more than 6 hrs, later on cancelled the script) while the file size is huge. As an example, I have details.txt of around 2.5GB and record count is around 7.5lacs and the emp.txt has 55K employee name. Can you please suggest any other option/ command which will be better to handle such huge file. Thanks.

vgersh99 · November 17, 2011, 12:27pm

could you show snippets of both files? (using code tags)

Corona688 · November 17, 2011, 12:43pm

What's your region set to? GNU grep has to do a lot more work for UTF8 than C.

emp_name=`echo $line`

I'm trying to understand the purpose of this line... Flattening whitespace?

rwuerth · November 17, 2011, 12:51pm

Don't use a loop to get this done, your processing the 2.5GB details.txt file for each name in emp.txt. So if you had 2 names in emp.txt your processing 5GB of detail.txt. 10 names = 25GB. It doesn't scale well that way.

Try this:

 
grep -F -f emp.txt details.txt

Then you are only processing details.txt once, and of course however big emp.txt is.

Using -F might also save some time. If you don't have the '-F' option look for 'fgrep'.
But being on HP-UX the standard 'grep' should have the -F option available.

arb_1984 · November 17, 2011, 3:07pm

Thank you all for your quick response !! Thanks a lot rwuertn; '-F' option is working and I am able to extract the required data within less time period.

However, the files are like:

emp.txt
------------
John
Kevin
Prakash
Susan
Ken

details.txt
-------------
HDR|Prakash D
DTL|Prakash|EMP0000010|Sr Associate|FL
HDR|Kevin T
DTL|Kevin|EMP0000004|Analyst|IL
HDR|John M
DTL|John|EMP0000184|Manager|CA

Thanks again

rwuerth · November 18, 2011, 6:02pm

What was the time savings?

Also, you said,

And proceeded to show your input files.

Was there a question there that you wanted to ask?

Is it working as you'd expect it to?

arb_1984 · November 21, 2011, 5:12pm

Nope, I do not have any further query right now. I did mention the file details as someone else was looking for the file structure.

Thanks rwuerth for your suggesstion. It is working fine. I would let you know about the saving by couple of days as full volume testing is yet pending.

arb_1984 · December 5, 2011, 11:10am

It's a huge time saving with this command. It's taking less than 5 mins to extract details around 60K employee from around 2.8GB detail file.
Thanks again for your suggestion.