Grep text matching problem with script which checks if web page contains text.

gencon · June 6, 2013, 3:14pm

I wrote a Bash script which checks to see if a text string exists on a web page and then sends me an email if it does (or does not e.g. "Out of stock"). I run it from my crontab, it's quite handy from time to time and I've been using it for a few years now.

The script uses wget to download an url and then uses grep to match the text string which I lift from the original HTML in case of markups or new lines (though the latter has never actually occured).

Today I added the text to look for and it did not get matched even though it was present in the HTML. When I copied and pasted the, identically looking, text from the downloaded HTML file into the script and tried that it worked perfectly. [Problem solved in this case but I'd like to fix things properly.]

So the character encoding seems to be the problem. Or so I thought! Grep uses utf-8 but it turned out the source HTML was utf-8 as well. What a pain, so no easy fix by using iconv to convert all downloaded files to utf-8.

Anyone know what might be happening here and what I need to do to fix this?

Many thanks.

Scrutinizer · June 7, 2013, 12:49am

It might be that there is some special character involved? Have you tried using grep -F ?

Otherwise, could you provide a sample of what goes wrong?

gencon · June 7, 2013, 2:31pm

Thanks for the suggestion but using -F with grep did not work.

However I have diagnosed the problem...

The web page in question (url below) is being searched for the string "Currently out of stock", I had a look at the HTML in a hex editor and discovered that the first space (between 'Currently' and 'out') was not 0x20 but a pair of values: 0xC2 0xA0. A web search revealed that this is known as a non-breaking space which is used as a typesetting aid (in compatible standards such as HTML) to prevent an automatic line break. For instance it might be used instead of the space in the string "100 KM" to make certain that "KM" does not get pushed onto the line below by the HTML renderer, the HTML entity is "�", thus "100�KM" could be used in the HTML.

An imperfect fix involves using '.' (any char match) in my search string. So the following regex works: "Currently.out.of.stock". The single '.' matches the non-breaking space of 0xC2 0xA0 between 'Currently' and 'out'.

However neither [[:space:]] nor [[:blank:]] work at matching the non-breaking space.

Non-breaking space on Wikipedia:
Non-breaking space - Wikipedia, the free encyclopedia

MadeInGermany · June 7, 2013, 3:27pm

Use [^[:ascii:]] instead!
To match a space or a non-breaking space, try [^[:graph:]] .

gencon · June 9, 2013, 10:00am

Good idea, [[:print:]] works too.

The problem now is that all my tests have involved testing from a terminal, running the script from crontab stops the regexes from working (though everything else works). I tried setting my $PATH in both the script and in my crontab as well as setting $SHELL to bash in the crontab and even using the absolute path to grep in the script. No joy the regexes do not work. Finally in desperation I pasted my entire 'env' variable list into the top of my crontab and the grep regex finally worked when the script was called by crontab. Can anyone think which of my 39 environment variables might be the key one making a difference? $SHELL and $PATH I can see why, but on their own they did not fix the issue, so there must be something else as well.

Here's the sorted output of env (I've trimmed the 2 long lines of $LS_COLORS and $PS1):

$ env | sort
COLORTERM=gnome-terminal
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-igzDw7xhFQ,guid=69aae5cf956c630c3ef4d55151b43fb4
DEFAULTS_PATH=/usr/share/gconf/gnome.default.path
DESKTOP_SESSION=gnome
DISPLAY=:0.0
GDM_KEYBOARD_LAYOUT=gb
GDM_LANG=en_GB.UTF-8
GDMSESSION=gnome
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
GNOME_KEYRING_CONTROL=/tmp/keyring-0Mlwje
GNOME_KEYRING_PID=2114
GTK_MODULES=canberra-gtk-module
HOME=/home/ms
LANG=en_GB.UTF-8
LESSCLOSE=/usr/bin/lesspipe %s %s
LESSOPEN=| /usr/bin/lesspipe %s
LOGNAME=ms
LS_COLORS=rs=0:di=01;34:ln=01;...
MANDATORY_PATH=/usr/share/gconf/gnome.mandatory.path
ORBIT_SOCKETDIR=/tmp/orbit-ms
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/ms/Scripts
PS1=\[\033[01;31m\]\h\[\033[00m\]...
PWD=/home/ms
SESSION_MANAGER=local/ubuntupc:@/tmp/.ICE-unix/2132,unix/ubuntupc:/tmp/.ICE-unix/2132
SHELL=/bin/bash
SHLVL=1
SPEECHD_PORT=7560
SSH_AGENT_PID=2166
SSH_AUTH_SOCK=/tmp/keyring-0Mlwje/ssh
TERM=xterm
TMPDIR=/tmp/
USER=ms
USERNAME=ms
_=/usr/bin/env
WINDOWID=67108869
XAUTHORITY=/var/run/gdm/auth-for-ms-TrT4XY/database
XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
XDG_DATA_DIRS=/usr/share/gnome:/usr/local/share/:/usr/share/
XDG_SESSION_COOKIE=caa8612e0ce52df979b3de354c360d7c-1370767284.97021-283724276

Thanks all.

MadeInGermany · June 9, 2013, 10:24am

LANG sets locale, and has an impact on character sets like [a-z] or [[:print:]] .

gencon · June 10, 2013, 12:57pm

That's it !! All is now working. Well done, it never occurred to me that LANG would have an impact on grep.

Thanks so much.