penalty for case insensitive grep

phil518 · February 24, 2011, 5:56pm

I just found out there were a big performance penalty for case insensitive "grep" on big files.

It would be understandable, except that the penalty seems to be exaggerated out of proportion.

A real example, if I only grep a single letter "V" (or "v") , without "-i", on a big file, (file is doctored so only a few "V" or "v" exist). It takes 0.157 user time to finish.

Then I grep the same letter "V", but with "-i" option, it takes 32.0 user time to finish. That is 200 times longer than without "-i" for a single character.

Can someone provide some insight why this is the case?

Thanks.

NB Phil

frank_rizzo · February 24, 2011, 8:35pm

probably because it has to convert the case of every line. try this and see what happens.

grep -e V -e v ...

Corona688 · February 24, 2011, 10:29pm

Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.

phil518 · February 24, 2011, 10:59pm

It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.

Is this a known fact?

is x200 the upper bound of the performance penalty, regardless the length of the input string? (thinking n-character long string will have 2^n case permutations).

binlib · February 25, 2011, 8:31am

Again, it's the fault of I18N. Set LC_ALL to C, you will see that the -i run is only twice as long.

methyl · February 25, 2011, 1:06pm

This problem has cropped up a few times recently.

The default "locale" for GNU utility programs including "grep" and "sort" has changed to "UTF" which means that mapping one "character" can take more than one character. If your file definitely does not contain "UTF" characters you can massively improve performance by changing your locale back to the basic value of "C".

To check how your system is now, type and check the output from this enquiry:

locale

Corona688 · February 25, 2011, 1:45pm

Imagine every possible language UTF8 supports, including ones where a letter's "case" has strict and complicated rules. All of that's what you're asking grep to check for when doing case-insensitive on UTF8.

phil518 · February 25, 2011, 1:49pm

Right on. Thank you very much.

Here are our locale settings:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

What can I do to set "LC_ALL = C" at per command basises, without changing it system wide?

Also, even with the "LC_ALL = C" setting, would it still take 2^n longer time to grep n-character string case insensitively, or the length of the string would not matter that mach?

Corona688 · February 25, 2011, 1:50pm

LC_ALL="C" grep parameters ... Or maybe, at the top of your script:

LC_ALL="c"
export LC_ALL

I suspect that depends on the regex, but I don't think the penalty would be nearly as bad.

phil518 · February 25, 2011, 2:05pm

phil518:

Here are our locale settings:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Is above the typical locale setting in hardware/software companies in US? (our source code, scripts don't have UTF characters, but not sure about the tools.)

If it is, then what is the typical approach to address this case-insensitive search penalty system-wide? (I am assuming there must be valid reason to use UTF-8 here.)

Corona688 · February 25, 2011, 2:30pm

It's a typical locale setting for fairly recent Linux in general. It solves a lot of interchange problems at one blow while still letting you use programs that demand plain ASCII. Of course it can cause other problems, but is mostly considered a very good thing.

It's not really a widespread penalty for everything. It's not big enough to be noticed for most uses, and the GNU utilities are unusual in obeying UTF8 as profoundly as they do anyway.

Couple ways to deal with it:

Optimize your regex. '[vV]' is probably much faster than -i.
Set LC_ALL="C" in user profiles for people who need it.
Use other programming languages or tools for the job.
Change your whole system's character set(not reccomended).