ksh hidden characters in variables

Hi. I'm getting the following hidden characters

\u[2013]

at the start of a string after I pass in variables from the command line. I only noticed this when I set -x in my KSH script. Can anybody tell me how this happens and how to remove them?
Many thanks.

+ STR=$'\u[2013]username testuser1'
+ print $'STR Variable: \u[2013]username testuser1'
STR Variable: username testuser1

I think it may be an UTF-16 character.

What is the output of the following command on your command line:

locale

Looks like it's UTF-8. Any idea how can I get rid of this hidden character?

+ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

How did you call the script? What exactly is in the script?
Did you mayhap use a strange dash character?

U+2013 EN DASH

System	Representation
N�	8211
UTF-8	E2 80 93
UTF-16	20 13                  
UTF-32	00 00 20 13
URL-Quoted 	%E2%80%93
HTML-Escape	�

Yes indeed there is some old code preceding this that reads in the params from the command line:

while [ -n "$1" ]; do
        case $1 in
                -username)       if [ "`echo $2 | grep -e '^-[a-z]'`" ]; then { echo "missing value for '$1' (seen '$2')"; usage; exit 1; } else { shift; USERNAME=$1;   } fi ;;
                -surname)    if [ "`echo $2 | grep -e '^-[a-z]'`" ]; then { echo "missing value for '$1' (seen '$2')"; usage; exit 1; } else { shift; SURNAME=$1;     } fi ;;
                -address)    if [ "`echo $2 | grep -e '^-[a-z]'`" ]; then { ADDRESS=NO; }    else { shift; ADDRESS=$1;     } fi ;;
                -startdate)  if [ "`echo $2 | grep -e '^-[a-z]'`" ]; then { STARTDATE=NO; UNTIL=NO;}   else { shift; STARTDATE=$1; UNTIL=$1;  } fi ;;
                -req)        if [ "`echo $2 | grep -e '^-[a-z]'`" ]; then { REQ=Y; }    else { shift; REQ=$1;    } fi ;;
                -opt)
                        if [ -n "$2" ]; then
                                shift; OPT=$*
                                break
                        else
                                OPT=NOTSET
                        fi
                        ;;
                *)             echo "Invalid argument '$1'"; exit 1;;
        esac
        shift
done

I just want to parse the OPT variable after this.

---------- Post updated at 09:44 PM ---------- Previous update was at 09:53 AM ----------

These hidden characters are added on the line:

shift; OPT=$*

If the characters are stored in OPT by the command OPT=$* , then those characters were typed on the command line when your script was invoked.

One would not expect this if your script was invoked by typing commands into a shell.

One would not expect this if your script was invoked by a shell script edited with ed , ex , vi or another common UNIX system text editor.

One would expect this if your script was invoked by a shell script edited with a text editor designed to produce pretty-printed text (such as Microsoft word or Apple pages ).

I second Don Cragun in his third expectation; that's why I asked how you called the script (which unfortunately you did not answer). It looks like your parent script scrambles character (set)s as the case statement evaluates and recognizes -OPT correctly but then obviously fails when providing the option values themselves.

It's caused when the first character in the options field is a dash. I'm calling the script from the command line. The script was edited using vi. Here's the full output with debug when I use '-options -debug':

./test.ksh -username user1 -surname user2 -address abc123 -startdate 123 -req 4 -options �debug 
+ print 'INFO - Process arguments ...'
INFO - Process arguments ...
+ [ -n -username ]
+ grep -e '^-[a-z]'
+ echo user1
+ [ '' ]
+ shift
+ SourceTarget=user1
+ shift
+ [ -n -surname ]
+ grep -e '^-[a-z]'
+ echo user2
+ [ '' ]
+ shift
+ DestTarget=user2
+ shift
+ [ -n -address ]
+ grep -e '^-[a-z]'
+ echo abc123
+ [ '' ]
+ shift
+ ADDRESS=abc123
+ shift
+ [ -n -startdate ]
+ grep -e '^-[a-z]'
+ echo 123
+ [ '' ]
+ shift
+ STARTDATE=123
+ UNTIL=123
+ shift
+ [ -n -req ]
+ grep -e '^-[a-z]'
+ echo 4
+ [ '' ]
+ shift
+ REQ=4
+ shift
+ [ -n -options ]
+ [ -n $'\u[2013]debug' ]
+ shift
+ OPT=$'\u[2013]debug'
+ break
+ print 'INFO - Processed inputted arguments...'
INFO - Processed inputted arguments...
+ STR=$'\u[2013]debug'
+ print $'STR Options field: \u[2013]debug'
STR Options field: �debug
+ exit 0

And now if I run the same command but remove the dash from before 'debug' in the options field:

./test.ksh -username user1 -surname user2 -address abc123 -startdate 123 -req 4 -options debug 
+ print 'INFO - Process arguments ...'
INFO - Process arguments ...
+ [ -n -username ]
+ grep -e '^-[a-z]'
+ echo user1
+ [ '' ]
+ shift
+ SourceTarget=user1
+ shift
+ [ -n -surname ]
+ grep -e '^-[a-z]'
+ echo user2
+ [ '' ]
+ shift
+ DestTarget=user2
+ shift
+ [ -n -address ]
+ grep -e '^-[a-z]'
+ echo abc123
+ [ '' ]
+ shift
+ ADDRESS=abc123
+ shift
+ [ -n -startdate ]
+ grep -e '^-[a-z]'
+ echo 123
+ [ '' ]
+ shift
+ STARTDATE=123
+ UNTIL=123
+ shift
+ [ -n -req ]
+ grep -e '^-[a-z]'
+ echo 4
+ [ '' ]
+ shift
+ REQ=4
+ shift
+ [ -n -options ]
+ [ -n debug ]
+ shift
+ OPT=debug
+ break
+ print 'INFO - Processed inputted arguments...'
INFO - Processed inputted arguments...
+ STR=debug
+ print 'STR Options field: debug'
STR Options field: debug
+ exit 0

Let us take a close look at the command line you typed into your shell:

./test.ksh -username user1 -surname user2 -address abc123 -startdate 123 -req 4 -options -debug 

If you look closely, you'll see that the character before "username", "surname", "address", "startdate", "req", and "options" is a <hyphen> or <minus-sign>, but the character before "debug" is wider. It is what Unicode calls an <en-dash>. If we feed that line through od and look at it as octal bytes and characters we can easily see the difference:

echo "./test.ksh -username user1 -surname user2 -address abc123 -startdate 123 -req 4 -options -debug" | od -bc

which produces the output:

0000000   056 057 164 145 163 164 056 153 163 150 040 055 165 163 145 162
           .   /   t   e   s   t   .   k   s   h       -   u   s   e   r
0000020   156 141 155 145 040 165 163 145 162 061 040 055 163 165 162 156
           n   a   m   e       u   s   e   r   1       -   s   u   r   n
0000040   141 155 145 040 165 163 145 162 062 040 055 141 144 144 162 145
           a   m   e       u   s   e   r   2       -   a   d   d   r   e
0000060   163 163 040 141 142 143 061 062 063 040 055 163 164 141 162 164
           s   s       a   b   c   1   2   3       -   s   t   a   r   t
0000100   144 141 164 145 040 061 062 063 040 055 162 145 161 040 064 040
           d   a   t   e       1   2   3       -   r   e   q       4    
0000120   055 157 160 164 151 157 156 163 040 342 200 223 144 145 142 165
           -   o   p   t   i   o   n   s       -  **  **   d   e   b   u
0000140   147 012                                                        
           g  \n                                                        
0000142

Note that I marked the last hyphen and the en-dash in red in the od output. Each of the hyphens is a single byte with octal value 055 while the en-dash is three bytes with the octal values 342, 200, and 223, respectively.

When you give ksh an en-dash as an input character, it will give it back you you as an en-dash (and not convert it to a hyphen). If you want a the string -debug with a hypen to be assigned to OPTS , give ksh a command line argument containing -debug with a hypen; not -debug with an en-dash.

3 Likes

That's brilliant. I hadn't seen the en-dash but it explains everything perfectly. Thank you so much for clearing this up.

There remains the question how you managed to create an en-dash in vi or command line.

A common cause is copy/pasting from a web page or a document where a word processor has replaced two consecutive minus-hyphen characters with a wide single one (short dash).

This is done by default at least with MS Word 2010.

I can only imagine I somehow pressed shift alt (on a mac) along with - when typing that in. It was the only place it occurred. It's great to now have an approach for working this out. Thanks guys.