egrep is very slow : How to improve performance

We have an egrep search in a while loop.

egrep -w "$key" ${PICKUP_DIR}/new_update >> ${PICKUP_DIR}/update_record_new

${PICKUP_DIR}/new_update is 210 MB file

In each iteration, the egrep on an average takes around 50-60 seconds to search. Ther'es nothing significant in the loop other than egrep. And when we checked the timestamps, egrep is what slowing it down.

Is it possible to improve egrep's performance ? Or do we need to use perl or any other pattern search ?

Could you please help ?

Does the value of "key" and "PICKUP_DIR" change with each iteration ?

Look into the -f flag of grep.

The value of $key changes on each iteration but ${PICKUP_DIR}/new_update doesn't change

So look into the -f flag.

egrep -f <file containing the different values of $key> ${PICKUP_DIR}/new_update

In addition to the above, can you post an example of this $key? Perhaps using a regex optimizer will help. If readability to external assembly of a $key is done, you could do something like this as well.

grep -E -w "`regexopt $key`" ...

I have uploaded the $key as a screenshot as I don't have the text version right now..., it's a big string concatenated by "|".

Can you pls. tell me which is better than egrep....
grep.. perl... sed...?
And why should egrep take around 50..60 seconds in an iteration ...?
And will splitting the ${PICKUP_DIR}/new_update file into multiple files and searching each file until a match is found, help in anyway...?

Are the keys separated by a '|' ? Or is the whole thing a key in itself ?

If the keys are separated by '|', then change the file such that each key is on a new line. Then

egrep -f key.txt ${PICKUP_DIR}/new_update

I dont know if you will have any advantage in splitting up the file.

I had a feeling it was a bloaded regular expression, a regex optimizer is what you need.

You are giving egrep (which is a grep -E dedicated) a pile of 'check for this or this or this or this'. The form you have it in is quite unwieldy. If that could be reduced to this ...

TP-CAP-P[0-9]{9}-[0-9]{9}

If your not keen on the regular expression thing you can use a program like regex buddy to load your data in (or a dozen mb or so) and then test it.

TP-CAP-P123456789-103456789
TP-CAP-P124456789-103456789
TP-CAP-P123458789-123456709
TP-CAP-P123456789-123056719
TP-CAP-P123459989-123406789

and get a sense of the regex back (this is from mkregexp from just the above).

qr/(?=[1CPT])(?:1(?:23(?:4(?:5670|0678)9|056719)|03456789)|P12(?:345(?:[68]7|99)89|4456789)|(?:T|CA)P)/

My guess is that you want to do something else, but for what your doing a 30seconds isn't that bad for huge files.