penalty for case insensitive grep

I just found out there were a big performance penalty for case insensitive "grep" on big files.

It would be understandable, except that the penalty seems to be exaggerated out of proportion.

A real example, if I only grep a single letter "V" (or "v") , without "-i", on a big file, (file is doctored so only a few "V" or "v" exist). It takes 0.157 user time to finish.

Then I grep the same letter "V", but with "-i" option, it takes 32.0 user time to finish. That is 200 times longer than without "-i" for a single character.

Can someone provide some insight why this is the case?

Thanks.

NB Phil

probably because it has to convert the case of every line. try this and see what happens.

grep -e V -e v ...
1 Like

Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.

1 Like

It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.

Is this a known fact?

is x200 the upper bound of the performance penalty, regardless the length of the input string? (thinking n-character long string will have 2^n case permutations).

Again, it's the fault of I18N. Set LC_ALL to C, you will see that the -i run is only twice as long.

1 Like

This problem has cropped up a few times recently.

The default "locale" for GNU utility programs including "grep" and "sort" has changed to "UTF" which means that mapping one "character" can take more than one character. If your file definitely does not contain "UTF" characters you can massively improve performance by changing your locale back to the basic value of "C".

To check how your system is now, type and check the output from this enquiry:

locale
1 Like

Imagine every possible language UTF8 supports, including ones where a letter's "case" has strict and complicated rules. All of that's what you're asking grep to check for when doing case-insensitive on UTF8. :slight_smile:

1 Like

Right on. Thank you very much.

Here are our locale settings:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

What can I do to set "LC_ALL = C" at per command basises, without changing it system wide?

Also, even with the "LC_ALL = C" setting, would it still take 2^n longer time to grep n-character string case insensitively, or the length of the string would not matter that mach?

LC_ALL="C" grep parameters ... Or maybe, at the top of your script:

LC_ALL="c"
export LC_ALL

I suspect that depends on the regex, but I don't think the penalty would be nearly as bad.

1 Like

Is above the typical locale setting in hardware/software companies in US? (our source code, scripts don't have UTF characters, but not sure about the tools.)

If it is, then what is the typical approach to address this case-insensitive search penalty system-wide? (I am assuming there must be valid reason to use UTF-8 here.)

It's a typical locale setting for fairly recent Linux in general. It solves a lot of interchange problems at one blow while still letting you use programs that demand plain ASCII. Of course it can cause other problems, but is mostly considered a very good thing.

It's not really a widespread penalty for everything. It's not big enough to be noticed for most uses, and the GNU utilities are unusual in obeying UTF8 as profoundly as they do anyway.

Couple ways to deal with it:

  1. Optimize your regex. '[vV]' is probably much faster than -i.
  2. Set LC_ALL="C" in user profiles for people who need it.
  3. Use other programming languages or tools for the job.
  4. Change your whole system's character set(not reccomended).
1 Like