Change encoding, no removing special chars. inconv

mrreds · January 13, 2018, 11:58am

Hi all,

I'm using

iconv

command to change files encoding to UTF-8

If my input file has chars as

those are removed creating the file without those special chars.

I tried using

iconv -c

, but there is still the removal.

Is there a way to keep those special chars changing just the Encoding?

The final goal is to implement a script changing Encoding when files are not UTF-8

Thank you all!!

RudiC · January 13, 2018, 5:16pm

Characters that don't exist in the target char set are difficult to convert. The -c option would not necessarily help as it just silently deletes inconvertible chars.
Not sure what your OS / shell / iconv versions are. Does the latter offer this option ( man iconv )

? Would his come close to what you need?

drysdalk · January 13, 2018, 5:25pm

Hi,

I'm thinking that perhaps there is no direct or equivalent character to translate these characters to in your destination character set, and so that's why they're being dropped, maybe ?

Some testing of my own. Firstly, all I did here was copy and paste the string you provided:

$ cat test
�, �,
$ file test
test: UTF-8 Unicode text
$

and it was picked up as UTF-8, as you can see. Full disclosure: this was on a Slackware Linux 14.2 system.

So here's what happens when I try converting this to ASCII, and as mentioned I think it fails since these characters simply don't exist in any way in normal ASCII:

$ iconv -f=utf8 -t=ascii -o new.txt test.txt
iconv: illegal input sequence at position 0
$

However, if I tell iconv to transliterate only what it can, and drop what it can't, things seem to work, although I end up with question marks in the output (since there's nothing to transliterate to):

$ iconv -f=utf8 -t=ascii//TRANSLIT -o new.txt test.txt
$ cat new.txt
?, ?,
$

So I think that's the issue: they're being dropped or giving errors because there isn't anything in your destination character set that iconv regards as an acceptable replacement.

Hope this helps.

mrreds · January 13, 2018, 9:40pm

Thank you RudiC, drysdalk!

command is just displaying:

I need to convert any encoding to UTF8.

A customer is sending me files not having UTF8 (seems ANSI), I just need to assign UTF8 encoding to all files coming to my system.

RudiC · January 14, 2018, 9:01am

I don't know an ANSI char set but would be surprised if it contained codes that UTF-8 could not represent. Should you mean "ASCII", chars �, � will NOT exist in that source char set; mayhap in what is called "extended ASCII". Howsoever, Your problem now seems a bit strange to me...

Don_Cragun · January 14, 2018, 4:12pm

You need to figure out whether the file you are trying to convert from is encoded in ISO 8859-1, ISO 8859-15, Windows 1252, or some other codeset. All three of the ones listed here have the lower 128 characters with the same encodings as US ASCII and all of them contain the � and � characters, but I'm not sure if they are encoded the same way in the three listed codesets. The only way iconv can work correctly is if you correctly tell it in what codeset the file it is reading is encoded and tell it to what codeset you want the output file to be written.

Corona688 · January 15, 2018, 11:04am

Problem solved then, as ANSI can be used in UTF-8 directly without conversion.