Can I speed up my grep command?

I've got at least 30,000 XML files that I'm using the grep command to get their filename. Can I use the head command to grab just the beginning 8 lines and compare that instead of parsing the whole document? It would speed things up! or maybe grep -m?

If all of those files are in the current directory, then this might work:

for file in *.xml; do
  head -8 $file | grep <whatever you need to grep> >/dev/null && echo $file
done
1 Like

If you're looking for a fixed string (rather than a match against a regular expression), use grep -F or fgrep (depending on which operating system you're using).

If you're trying to match the 1st few characters in a file and you have relatively large files and you have a lot of files that won't match, then using read or head to get the start of a file might make sense, but firing up grep and head for every input file is going to be slower than letting grep read the entire file unless the file you're processing is large.

1 Like

If you show what you're doing right now, maybe we can speed it up.

1 Like

I've got 30,000 files to process using

xmlFileNames=$(find . -name "*.xml" -exec grep -l "Status" {} \; 2>/dev/null)

The re is in the header of the xml files. Is there a way to grab just the header the pattern match instead of reading the whole files?

Try:

xmlFileNames=$(find . -name "*.xml" -exec grep -l "Status" {} '+' 2>/dev/null)

This should put more filenames into a single grep and make it more efficient.

1 Like

It didn't work Corona688...

In what way did it "not work"?

'+' didn't work in my statement. It didn't return any data then.

Please try the following commands:

uname -a
xmlFileNames=$(find . -name "*.xml" -exec grep -Fl "Status" {} '+')
printf "find w/ grep -F exit code: %d\n" $?
xmlFileNames=$(find . -name "*.xml" -exec fgrep -l "Status" {} '+')
printf "find w/ fgrep exit code: %d\n" $?

and show us the output (including any diagnostic messages that were produced).

1 Like

The error message when I tried this other code is:

find: non-terminated '-exec' argument list
Usage: find directory ... expression
find w/ grep -F exit code:2 
find: non-terminated "-exec' argument list
Usuage: find directory ... expression
find w/ fgrep exit code: 2

So, uname -a produced no output???

What operating system are you using?

Now even when I rollback the version of my script it doesn't work...And I can see the data is all there...

---------- Post updated at 11:40 AM ---------- Previous update was at 11:17 AM ----------

I know we're using mks. Other than that I don't know... I'm new here...

---------- Post updated at 11:50 AM ---------- Previous update was at 11:40 AM ----------

Now I can't even run the find command from the command line...

find: non-terminated '-exec' argument list
Usage: find directory ... expression

How can I get that back?

---------- Post updated at 12:27 PM ---------- Previous update was at 11:50 AM ----------

How can I get back the ability to use the find command please?

---------- Post updated at 12:29 PM ---------- Previous update was at 12:27 PM ----------

It must be a problem with the -exec part....idk?

---------- Post updated at 01:34 PM ---------- Previous update was at 12:29 PM ----------

Why did you put the uname -a in your code back to me?

---------- Post updated at 01:56 PM ---------- Previous update was at 01:34 PM ----------

I'm on an Windows NT server....

He asked for uname -a as this command displays your operating system and version numbers. Something is strange as the posted find command should work on most systems. What output do you get from that command.

Working on Don Cragun's solution another slight improvement is to use awk to bomb out if string isn't in first 8 lines like this
(Example is searching for the string Status):

find . -name "*.xml" -exec awk 'NR>8{quit}/Status/{print FILENAME}' {} '+'

You should be careful with NR>8 as it might quit before working on subsequent files.

EDIT: Not even sure if quitting after FNR>8 would be safe...

I assume that by now you have figured out how to find where the MKS toolkit utilities are on your ssytem again... If not, we probably can't help you.

With a Windows operating system that old and an MKS toolkit that is even older, you probably have a 16-bit shell (which leaves you with a relatively limited address space). With 30,000+ files to process, there is a fair chance that you'll overflow the shell size limit on the contents of a shell variable even if less than 10% of your xml files contain the string you're looking for. Therefore, I'd strongly suggest saving the list of files found in a file instead of in a shell variable.

find . -name "*.xml" -print | xargs fgrep -l Status > StatusXF.txt

and show us any diagnostic output produced. With any luck, StatusXF.txt will contain the list of files you want and won't take forever to run.

The gnuawk nextfile with FNR>8 works as I intended

find . -name "*.xml" -exec 'FNR>8{nextfile}/Status/{print FILENAME}' {} '+'

Of cause nextfile is a GNU extension so the less elegant compromise is using getline:

find . -name "*.xml" -exec awk '{while(FNR<8&&getline) if ($0 ~ "Status") print FILENAME}' {} '+'

How do I get it to work again?

---------- Post updated at 07:20 AM ---------- Previous update was at 07:19 AM ----------

Can I simply restart the server to get my find to work as usual?

My comment is just for that command / pipe. No need to restart the server.

Then what do I need to do please in layman terms?

---------- Post updated at 07:26 AM ---------- Previous update was at 07:25 AM ----------

If I restart the server would that fix my problem? It's a test server.