I wrote a Bash script which checks to see if a text string exists on a web page and then sends me an email if it does (or does not e.g. "Out of stock"). I run it from my crontab, it's quite handy from time to time and I've been using it for a few years now.
The script uses wget to download an url and then uses grep to match the text string which I lift from the original HTML in case of markups or new lines (though the latter has never actually occured).
Today I added the text to look for and it did not get matched even though it was present in the HTML. When I copied and pasted the, identically looking, text from the downloaded HTML file into the script and tried that it worked perfectly. [Problem solved in this case but I'd like to fix things properly.]
So the character encoding seems to be the problem. Or so I thought! Grep uses utf-8 but it turned out the source HTML was utf-8 as well. What a pain, so no easy fix by using iconv to convert all downloaded files to utf-8.
Anyone know what might be happening here and what I need to do to fix this?
Thanks for the suggestion but using -F with grep did not work.
However I have diagnosed the problem...
The web page in question (url below) is being searched for the string "Currently out of stock", I had a look at the HTML in a hex editor and discovered that the first space (between 'Currently' and 'out') was not 0x20 but a pair of values: 0xC2 0xA0. A web search revealed that this is known as a non-breaking space which is used as a typesetting aid (in compatible standards such as HTML) to prevent an automatic line break. For instance it might be used instead of the space in the string "100 KM" to make certain that "KM" does not get pushed onto the line below by the HTML renderer, the HTML entity is "�", thus "100�KM" could be used in the HTML.
An imperfect fix involves using '.' (any char match) in my search string. So the following regex works: "Currently.out.of.stock". The single '.' matches the non-breaking space of 0xC2 0xA0 between 'Currently' and 'out'.
However neither [[:space:]] nor [[:blank:]] work at matching the non-breaking space.
The problem now is that all my tests have involved testing from a terminal, running the script from crontab stops the regexes from working (though everything else works). I tried setting my $PATH in both the script and in my crontab as well as setting $SHELL to bash in the crontab and even using the absolute path to grep in the script. No joy the regexes do not work. Finally in desperation I pasted my entire 'env' variable list into the top of my crontab and the grep regex finally worked when the script was called by crontab. Can anyone think which of my 39 environment variables might be the key one making a difference? $SHELL and $PATH I can see why, but on their own they did not fix the issue, so there must be something else as well.
Here's the sorted output of env (I've trimmed the 2 long lines of $LS_COLORS and $PS1):