windows to dos file name conversion x'ad'

We are running a java client server application on Solaris 10. External Users from around the country attach windows files through a client and these files are stored on a unix server. Recently I've started getting files that have a hex value of ad in their names. This causes a tar command to fail in an archive process that usually happens a couple of months after the file arrives.

I've searched all over and have not been able to find anyone else with this issue. The x'ad' value is in the files actual name, not inside the file. Does anyone know how my users are doing this to me?

It may be part of a multibyte UTF8 character sequence. this glyph has that in the middle for example. What bytes comes before and after it?

I'm not sure why this would cause it to fail. 0xad may be unusual in a filename but not really verboten...

The tar throws a utf8 conversion error and fails because we have the e option specified. Here's one of the file names:

AMCINH�*_DataSheet_Confidential_Data_Highly_Restricted.xls

And here's the hex dump:

0000000 414d 4349 4e48 c2ad 5f44 6174 6153 6865
0000020 6574 5f43 6f6e 6669 6465 6e74 6961 6c5f
0000040 4461 7461 5f48 6967 686c 795f 5265 7374
0000060 7269 6374 6564 2e78 6c73 0a00

The character sequence in its entirety is c2 ad, which is valid UTF8 encoding for U+00AD, 'soft hyphen'. I think that it's a fancy formatting-hint character suggesting where to split a word when it must be linewrapped. Which is of course meaningless to a tape archive, and probably doesn't have much business being in file names in the first place.

Well that's interesting. I have never looked into this whole unicode thing before. So it encodes most things using one byte and the standard ascii charactors but it can also use two bytes to encode another 100,000 or so symbols. I guess that ought to last them awhile.

I take it various displays can't really handle them as they print differently when I cut and paste them from unix to windows and different applications under each system.

I noticed that vista has a lot of new locales defined. I have around 1000 clients out there. So far four of them have attached files with this string. I don't normally deal with them directly but I did email 2 of them to see what operating system they are using. I'm waiting for their replies. I bet they have vista.

One byte for ASCII, two to four bytes for everything else. It should last them a very long time. Unfortunately UTF8 doesn't just include new glyphs it includes new control characters. Most things ignore them, a few throw up on them, and a precious few actually process them properly.

Depends whether these various displays are variously displaying UTF-8. You can't dump UTF-8 things in another code set and have it understoood. One glorious day, everything will be UTF-8. Until then, we have this. :wink: