File enconding and conversion

Fundix · May 29, 2013, 8:48am

Hi

am not a specialist about file encoding.
On an AIX 5.2.0.0, I need to check files encoding and convert somes of them to UTF-8.

I've used the following command and i think it said to me that all files are encoded using ISO8859-1

%locale charmap
ISO8859-1

I've also used iconv command that way :

iconv -f ISO8859-1 -t UTF-8 test.txt > test2.txt

Is there any command to check that test.txt is in ISO8859-1and test2.txt in UTF-8 ?

Thank You

MichaelFelt · May 29, 2013, 11:55am

The procedure looks correct. However, I do not know of any verification process.

Don_Cragun · May 29, 2013, 3:15pm

The command locale charmap is telling you (by convention) that the character mapping defining the characters in your current locale is related to ISO standard 8859-1. It says absolutely nothing about what codeset was used to encode text found in any particular file.

If a file only contains ASCII text, the ISO8859-1 and the UTF-8 encoding will be identical. If there are characters in a file with the high order bit set on one or more bytes, there are various heuristics you could try to use to determine if a given file was encoded using a particular codeset, but heuristics that could distinguish between various ISO 8859-* standard encodings would require more knowledge than just the contents of the file. Even determining that a file was encoded using UTF-8 would be impossible unless you know that the file only contains text (i.e., no binary data such as an integer or floating point value has been written into the file without converting it to text first).

The only way to use iconv to reliably convert a file from one codeset to another is to know (independently) what codeset was used when the file was created and what transformations have occurred to that file since then. If the file being converted contains some binary values and some text, you will have to know where the binary data is and just convert the text surrounding the binary data. (You can't do this with iconv, but you could use something like dd to extract the text and binary data into separate files, use iconv to convert the text files, and then create the converted output by putting the converted text files and the binary files back together. Of course, converting from 8859-* to or from UTF-8 can also significantly change the number of bytes needed to represent a string of text. If the data in the file contained binary data specifying the length of some of the text in the file, you would have to also be aware of that and modify the binary portions of the file as well as you reconstruct the output file.)