Performance issue with 'grep' command for huge file size

I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is:

while read line
do
emp_name=`echo $line`
grep -e $emp_name details.txt >> output.txt
done < emp.txt
 

Above code is working fine and I am getting expected result. But, this code is taking too much time (I don't have exact time, more than 6 hrs, later on cancelled the script) while the file size is huge. As an example, I have details.txt of around 2.5GB and record count is around 7.5lacs and the emp.txt has 55K employee name. Can you please suggest any other option/ command which will be better to handle such huge file. Thanks.

could you show snippets of both files? (using code tags)

What's your region set to? GNU grep has to do a lot more work for UTF8 than C.

emp_name=`echo $line`

I'm trying to understand the purpose of this line... Flattening whitespace?

Don't use a loop to get this done, your processing the 2.5GB details.txt file for each name in emp.txt. So if you had 2 names in emp.txt your processing 5GB of detail.txt. 10 names = 25GB. It doesn't scale well that way.

Try this:

 
grep -F -f emp.txt details.txt

Then you are only processing details.txt once, and of course however big emp.txt is.

Using -F might also save some time. If you don't have the '-F' option look for 'fgrep'.
But being on HP-UX the standard 'grep' should have the -F option available.

2 Likes

Thank you all for your quick response !! Thanks a lot rwuertn; '-F' option is working and I am able to extract the required data within less time period.

However, the files are like:

emp.txt
------------
John
Kevin
Prakash
Susan
Ken

details.txt
-------------
HDR|Prakash D
DTL|Prakash|EMP0000010|Sr Associate|FL
HDR|Kevin T
DTL|Kevin|EMP0000004|Analyst|IL
HDR|John M
DTL|John|EMP0000184|Manager|CA

Thanks again :slight_smile:

What was the time savings?

Also, you said,

And proceeded to show your input files.

Was there a question there that you wanted to ask?

Is it working as you'd expect it to?

Nope, I do not have any further query right now. I did mention the file details as someone else was looking for the file structure.

Thanks rwuerth for your suggesstion. It is working fine. I would let you know about the saving by couple of days as full volume testing is yet pending.

It's a huge time saving with this command. It's taking less than 5 mins to extract details around 60K employee from around 2.8GB detail file.
Thanks again for your suggestion.