ksh check for non printable characters in a string

Hi All,

I am trying to find non-printable characters in a string. The sting could have alphanumeric, puntuations and characters like (*&%$#.') but not non-printable (or that is what I think they are called) which are introduced when you copy any text from DOS to unix box.

Input string1:

 TEXT="This is a sample text with supposedly non-printable character^Y."

Input string2:

TEXT="This is a sample text with supposedly non-printable character"

I've got a code from the website for bash but still this is not working.

if ! [[ "$TEXT" =~ ^[a-zA-Z0-9]+$ ]];then echo "invalid"; fi

Note: I've removed the dot also "." from "Input string2" just to check if the command works or not.

This prints "invalid" in both the cases?! Also is there any equivalent command that works in KSH? Please suggest.

-dips

Your pattern is treating <space> as a non-printable character and, since it is present in both strings, you are getting invalid for both.

For a more portable test that should work with any POSIX conforming shell, try:

if [ "${TEXT#*[![:print:]]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi

Hi Don,

Thanks for telling me that <space> is being treated like non-printable character. But I want this seach to look for all NON UTF-8 characters actually, I don't have any inkling on how to check those?

The XML file of the application takes only UTF-8 characters and anything other than this will not let the jobs run through this application. Hence is there any way to check for UTF-8 characters? Can you please suggest?

For e.g.

TEXT="This is a sample text with supposedly non-printable character^Y."

The highlighted character shown in the file is what I've in my application which when seen in unix appears to be ^Y. How to identify such characters?

-dips

If what you show as ^Y represents <CTRL> Y (0x19, "EM"), it is member of the ASCII char set which in turn is a subset of UTF-8. Although there exist byte sequences that are not valid UTF-8 characters, they should not show up in texts or HTML files, unless created by a failed transmission or conversion.
Please show us a hexdump (od -ctx1 file) of your problematic file.

This works in ksh93 or bash3:

if [[ $TEXT =~ [^[:print:]] ]] ; then echo invalid; fi

Hi,

I'll be not able to convert it to hex file.

But in a simplest manner, can I check for only alphanumeric characters plus few punctuations which I know will get passed?

if [ "${TEXT#*[![:alnum:]][.,;:'"/\()-_+=~@&*]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi

This is clearly not working. Can you please help me?
-dips

You have the syntax off a little bit, but that is close (and I assume you don't want <space> to cause a "non-printable found" either). Try:

if [ "${TEXT#*[![:alnum:] .,;:'"/\()_+=~@&*-]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi

Note that a space was added, one pair of square brackets was removed, and the minus sign was moved to the end of the non-matching bracket expression element list.

And, yes you can convert your string to hex:

printf '%s' "$TEXT" | od -t co1x1

will display your string as characters, octal bytes, and hex bytes. (If your version of od doesn't have a -t option, just use:

printf '%s' "$TEXT" | od -cb

to get character and octal byte output.)

Hi Don,

I think there is some problem with this syntax.

test.ksh

if [ "${TEXT#*[![:alnum:] .,;:'"/\()_+=~@&*-]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi
 
+ TEXT='This is a sample text with supposedly non-printable character^Y.'
./test.ksh: line 5: syntax error at line 47: `'' unmatched

Then I introduced a \ before '

if [ "${TEXT#*[![:alnum:] .,;:\'"/\()_+=~@&*-]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi

then the error was -

+ TEXT='This is a sample text with supposedly non-printable character^Y.'
./test.ksh: line 5: syntax error at line 47: `{' unmatched

then I I introduced a \ before "

if [ "${TEXT#*[![:alnum:] .,;:\'\"/\()_+=~@&*-]}" = "$TEXT" ];then echo 'no non-printables found';else echo 'non-printable found';fi

error was -

 
+ TEXT='This is a sample text with supposedly non-printable character^Y.'
./test.ksh: line 5: syntax error at line 47: `)' unexpected

Then finally I introduced \ before all the shell special characters -

 
echo ${TEXT#*[![:alnum:] .,;:\'\"/\(\)\_+=~@&\*-]}

which resulted into

Y.

But I think that's wrong because it should have resulted ^ as that's the only punctuation mark not included in the list?! :confused:

-dips

I apologize for not trying this out before I posted it.

In a BRE or an ERE special RE characters lose their special meaning when inside a bracket expression, but that is not true in a shell pattern matching expression. I'm glad you were able to figure out what was needed to make it work for you. Even here, the underscore and the asterisk do not need to be escaped.

The output from:

echo ${TEXT#*[![:alnum:] .,;:\'\"/\(\)_+=~@&*-]}

is correct. The * matched (and discarded):

This is a sample text with supposedly non-printable character

and the:

[![:alnum:] .,;:\'\"/\(\)\_+=~@&\*-]

matched and discarded the ^ just leaving

Y.

in that expansion of $TEXT . The whole point of that expansion is to find a remove one character that is not in the set of characters that you are declaring to be "non-printable" with the non-matching bracket expression. The the if statement comparing the original string and the original string with a non-printable character removed compare equal if and only there are no non-printable characters in the string.

Thank you so much Don for explaining in detail! Despite that I've one more doubt (please bear with me!)

But Y is an alphabet so wouldn't [:alnum:] matches that? and a dot . is already present in the list of allowable punctuations?

-dips

The first character in the bracket expression ( [![:alnum:] .,;:\'\"/\(\)_+=~@&*-] ) is ! so this is a NON-matching bracket expression. This bracket expression matches any single character that is NOT alphanumeric, NOT a <space>, NOT a <period>, NOT a <comma>, NOT a <semicolon>, NOT a <colon>, NOT a <single-quote>, NOT a <double-quote>, NOT a <slash>, NOT an <open-parenthesis>, NOT a <closing-parenthesis>, NOT an <underscore>, NOT a <plus_sign>, NOT an <equal-sign>, NOT a <tilde>, NOT an <at-sign>, NOT an <ampersand>, NOT an <asterisk>, and NOT a <hyphen-dash> (in this case it matches the circumflex). So, if there is a string of characters starting with any zero or more characters followed by one character in that non-matching expression, the ${var#expression} will expand to the contents of the variable var with the string up to and including the first character that matches the non-matching expression removed.

If there aren't any characters in the variable that match the non-matching expression, there is no match for the entire expression; so the variable is expanded without removing anything. And, if ${var#expression} expands to the same thing as $var , we know that no character was found in $var that you consider to be non-printable.

1 Like