I use grep to check for a string that validates data in a file, it works great but the problem is that the file is becoming too big and gerp has started hurting the response time to users. Since I only need to find the first occurrence I have been looking for ways to stop grep for scanning the rest of the file (6gb) but I haven't had much luck.
One way is using the -m parameter, unfortunately thta is not avaialble in the grep version we have in our AIX (not sure if I can update it). I also found a small scrip that worked fine at the beginning but in the long run is takes even more time that grep:
while read line
do
echo $line | grep mystring
if [ "$?" -eq "0" ]; then
break
fi
done < mytextfile
this works great if the matches are found at the beginning of the file but when I tested with a string that is at the middle or end it took even more time than grep alone to return an asnwer and stop.
I've researched in several palces but I haven found anything else (in a simple way) that could do the trick, I tried awk and also takes longer that grep when the match is in the middle of the file.
It might help to suppress greps output if you just want to know if a certain string is in your file or not. Use the "-q" option of grep for this and check the exit code (0 means "found", >0 means "not found").
You might consider using sed instead of grep. You could tell sed to stop at the first occurrence, here is an example:
standard grep way:
if [ -n "$(grep "blabla" /path/to/bigfile)" ] ; then
echo "found it"
else
echo "not found it"
fi
grep with "-q" option:
if [ $(grep -q "blabla" /path/to/bigfile) -eq 0 ] ; then
echo "found it"
else
echo "not found it"
fi
sed instead of grep:
if [ -n "$(sed -n '/blabla/ {;p;q;}' /path/to/bigfile)" ] ; then
echo "found it"
else
echo "not found it"
fi
The sed command applies only to lines containing "blabla" (the search string) and will print this line ("p"), then quit the processing ("q").
I'd be really grateful if you could provide the runtime statistics of all the three variants. It would be very interesting to see how these different approaches compare.
I tried all the suggestions with the same string that should locate a record at the ~17% of the file (~record number 170,000), the best response time still goes to grep :
grep took 17 seconds, 3 to show the match the res 14 scanning the rest of the file.
grep mystring myfile
awk took 20 seconds
awk '/mystring{1}/' myfile.txt
then sed as scottn suggested took 28 seconds
tried awk as franklin52 suggested and took 18 seconds
so in summary I'm still good with grep but my users are not happy to wait once in a while for 17 seconds to get their screen back with the response (neither my server performance if I have many users using the same look up).
I would be so nice to find a way to stop grep after that first match, response time would go down to 3 seconds! I guess I'll have to find another source for grep that supports -m and see if I can install it.
head -1 will works just as grep since it waits for grep to finish to stop the process. The -l only shows the file name where the first match is and it stops.....the problem is that I need the output with the data.