Regex to identify illegal characters in a perso-arabic database

gimley · August 25, 2017, 11:01pm

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters.
I have identified the character set of Sindhi which is given below:
For clarity's sake, each character is demarcated by an apostrophe followed by a comma

['','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','']

I wrote a regex in Unix to identify all words where the Sindhi character set does not exist.

^[^]+$

The syntax being: find all strings where the given characters are not found.
However the regex does not work. What went wrong? Do I need to put a comma after every character ?
I am giving below a small sample database where there are words having both legal and illegal characters





















*





-

Any help given would be greatly appreciated. Many thanks in advance

jim_mcnamara · August 25, 2017, 11:44pm

Try egrep or equivalent (extended regex or pcre), use alternation. I am not going to change locale here so example:

if [  grep -e  '(a|b|c|d|e|f)' $search_string ] ; then
  echo 'bad character'
fi

Where a b c d e f are each a bad character so presence of any or all is a bad character error.

Don_Cragun · August 26, 2017, 1:44am

Hi gimley,
Your problem statement is not clear to me. Are you trying to identify a file that only contains Sindhi characters (plus <newline> characters), a file that contains one or more characters that are not Sindhi characters and not <newline> characters, a list of lines that only contains Sindhi characters (plus <newline> characters), a list of lines that contains one or more characters that are not Sindhi characters and not <newline> characters, a list of words that only contains Sindhi characters, or a list of words that contains one or more characters that are not Sindhi characters.

Hi Jim,
The standard way to invoke grep with an ERE is:

grep -E 'extended_regular_expression' file_pathname

not:

grep -e 'extended_regular_expression' string_to_be_matched

RudiC · August 26, 2017, 3:58am

Wildly guessing on what your after, regretting that there's no output sample to test against, and hoping that UTF-8 will cover all your needs and will be correctly handled by the tools, I came up with

echo "[^'','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','']" | tr -d "'," | grep -f- file

, which greatly reduces the file's line count (i.e. identifies lines with non-Sindhi chars, lets all-Sindhi lines pass), as does

grep "[^]" file

Is that close to what you need?

If there are "illegal" (why illegal, by the way?) characters in the DB, someone must have put them in. Mayhap an upfront discriminator would be helpful here?

MadeInGermany · August 26, 2017, 4:32am

The + special meaning is in ERE that is grep -E or egrep.
Also you have a ^ negation too many.

egrep '^[]+$'

gimley · August 27, 2017, 11:46am

Sorry for the delay in responding but I was unwell and could not respond. I would like to thank all who took the trouble to answer my query.
The issue is that the Arabic code block is huge and has characters that look alike. Very often data entry operators use a character /characters which are not part of the character set of a language [in this case Sindhi] and these are what we term as illegal. When these invalid/illegal characters are part of a dictionary, the results are disastrous, especially in storage and Natural Language processing.
This is the reason of my query, I tried the solutions provided and they all work and I am really thankful to all for your help.
Why I needed a simple regex was that my text processors: Ultraedit and Notepad++ both support regexes in perl and Unix and instead of "grepping" the strings, a macro based on a regex would help me identify all such invalid strings. I am still curious why the regex did not work. Any light on the same would really help.
Many thanks once again.

MadeInGermany · August 27, 2017, 12:29pm

As I said, if a-z are allowed then you need
^[a-z]+$ (or [^a-z] ensure all are allowed, or there is not any illegal)
but not ^[^a-z]+$ (ensure there is only illegal)

Don_Cragun · August 27, 2017, 5:23pm

gimley:

Sorry for the delay in responding but I was unwell and could not respond. I would like to thank all who took the trouble to answer my query.
The issue is that the Arabic code block is huge and has characters that look alike. Very often data entry operators use a character /characters which are not part of the character set of a language [in this case Sindhi] and these are what we term as illegal. When these invalid/illegal characters are part of a dictionary, the results are disastrous, especially in storage and Natural Language processing.
This is the reason of my query, I tried the solutions provided and they all work and I am really thankful to all for your help.
Why I needed a simple regex was that my text processors: Ultraedit and Notepad++ both support regexes in perl and Unix and instead of "grepping" the strings, a macro based on a regex would help me identify all such invalid strings. I am still curious why the regex did not work. Any light on the same would really help.
Many thanks once again.

To bring what MadeInGermany said directly into your problem statement...

If the following characters are the only legal characters on a line written in Sindhi:

(note that there are no punctuation characters and no <space> or <tab> characters), then the basic regular expression (abbreviated BRE):

^[^]+$

will match a line that contains one or more non-Sindhi characters and contain no Sindhi characters.

If you want to find a non-Sindhi character, you just want the BRE:

[^]

If you want to find a line that contains one or more non-Sindhi characters, you could use the BRE:

^.*[^].*$

If you want to find a line that just contains one or more Sindhi characters, you could use the BRE:

^[]+$

If you have a list of non-Sindhi characters that are incorrectly typed into a file and the corresponding Sindhi character that should have been used instead, you might want to look at the tr utility instead of trying to use an editor to manually make all of the changes.

gimley · August 27, 2017, 8:26pm

Many thanks for your kind reply and your detailed solutions. I tested the which you provided and they work perfectly.