How to make grep stop at first match

mpoblete · April 1, 2010, 11:25am

I use grep to check for a string that validates data in a file, it works great but the problem is that the file is becoming too big and gerp has started hurting the response time to users. Since I only need to find the first occurrence I have been looking for ways to stop grep for scanning the rest of the file (6gb) but I haven't had much luck.
One way is using the -m parameter, unfortunately thta is not avaialble in the grep version we have in our AIX (not sure if I can update it). I also found a small scrip that worked fine at the beginning but in the long run is takes even more time that grep:

while read line
do
   echo $line | grep mystring
   if [ "$?" -eq "0" ]; then 
                  break
   fi
done < mytextfile

this works great if the matches are found at the beginning of the file but when I tested with a string that is at the middle or end it took even more time than grep alone to return an asnwer and stop.

I've researched in several palces but I haven found anything else (in a simple way) that could do the trick, I tried awk and also takes longer that grep when the match is in the middle of the file.

Any suggestion would be great.

Franklin52 · April 1, 2010, 11:38am

Have you tried something like this with awk?

awk '$0 ~ var{print;exit}' var=$string mytextfile

Scott · April 1, 2010, 11:38am

Hi.

GNU grep has the -m option for that, but AIX grep doesn't have such a feature, I think.

You can use sed, if that's quick enough. Something like

$ sed -n "/something/{p;q;}" mytextfile

or

$ sed "/something/{!d;q;}" mytextfile

Your original grep solution would probably be OK if you ran it as:

grep "mystring" mytextfile

instead of in a while loop.

bakunin · April 1, 2010, 11:47am

It might help to suppress greps output if you just want to know if a certain string is in your file or not. Use the "-q" option of grep for this and check the exit code (0 means "found", >0 means "not found").

You might consider using sed instead of grep. You could tell sed to stop at the first occurrence, here is an example:

standard grep way:

if [ -n "$(grep "blabla" /path/to/bigfile)" ] ; then
     echo "found it"
else
     echo "not found it"
fi

grep with "-q" option:

if [ $(grep -q "blabla" /path/to/bigfile) -eq 0 ] ; then
     echo "found it"
else
     echo "not found it"
fi

sed instead of grep:

if [ -n "$(sed -n '/blabla/ {;p;q;}' /path/to/bigfile)" ] ; then
     echo "found it"
else
     echo "not found it"
fi

The sed command applies only to lines containing "blabla" (the search string) and will print this line ("p"), then quit the processing ("q").

I'd be really grateful if you could provide the runtime statistics of all the three variants. It would be very interesting to see how these different approaches compare.

I hope this helps.

bakunin

mpoblete · April 1, 2010, 1:00pm

I tried all the suggestions with the same string that should locate a record at the ~17% of the file (~record number 170,000), the best response time still goes to grep :

grep took 17 seconds, 3 to show the match the res 14 scanning the rest of the file.

grep mystring myfile

awk took 20 seconds

awk '/mystring{1}/' myfile.txt

then sed as scottn suggested took 28 seconds

tried awk as franklin52 suggested and took 18 seconds

so in summary I'm still good with grep but my users are not happy to wait once in a while for 17 seconds to get their screen back with the response (neither my server performance if I have many users using the same look up).
I would be so nice to find a way to stop grep after that first match, response time would go down to 3 seconds! I guess I'll have to find another source for grep that supports -m and see if I can install it.

TonyLawrence · April 1, 2010, 2:17pm

You could pipe grep to "head -1" - might even improve the speed

TRB · April 1, 2010, 2:36pm

Try (notice the -l ):

while read line
do
   echo $line | grep -l mystring
   if [ "$?" -eq "0" ]; then 
        break
   fi
done < mytextfile

EAGL · April 1, 2010, 2:41pm

Im not sure if this code is fast enough but it would find first occurance of the desired string and its a generic one

nawk -v occ="1" '$0~/mystring/{i++} i==occ{printf "occurance=%-6d, 
line=%-s\n",NR,$0;exit}' infile

use normal awk if you dont use solaris.

Regards

mpoblete · April 1, 2010, 3:09pm

head -1 will works just as grep since it waits for grep to finish to stop the process. The -l only shows the file name where the first match is and it stops.....the problem is that I need the output with the data.