Converting Unicode file to UTF8 format

Hi,

I have a file in my desktop which is a unicode format. After this file is transferred to Unix using FTP, we are seeing some special character (like rectangle box type) at the first line. The same file is saved as UTF8 (using textpad tool, selecting encode to UTF-8 option) on my desktopand and then FTP to Unix. We could see proper data content.

I tried using the following command in Unix server (We use IBM AIX), but i got error message "Converter can not open"
iconv -f <iso89> -t <utf8> oldfile > newfile

Could you please let us know how we can automatically convert the file to UTF-8 in Unix server itself.

Thanks in advance for any help.

Thanks,
Venkat

See this
[URL="http://www.unix.com/aix/18810-dos2unix-equivalent-aix.html"]

may try this
create a one line script 'd2u' using vi as

sed -i "s/^M//g" $1

(^M - Ctrl+v Ctrl+m)

Hi,

Thanks for the reply. I have already handled replacing ^M characters in shell script. The issue with other special character (like rectangle shaped one). This character will be in the first position only in first line.

I will tell what i am doing. We get csv file from SAP server in our Unix server. This file has delimitter of tab. We need to replace tab with comma.
We have script that replaces tab with comma.

Before changing tab to comma, we opened the file from telnet and found the rectangle box in the first position in the first line.

After changing tab to comma using shell script, we opened the file from telnet and we noticed rectangle box in the first position in the first line.

When we download this file to our windows box and opened in excel by double clicking. We are seeing small small boxes and no content. If we open in notepad then we are able to see the content.

We manually removed rectangle shaped content from Unix and download into windows and opened the file in excel. This time, we are able to see the content.

We searched in google to get some help. They are telling it is something to do with encoding to UTF-8 before FTPing to Unix server. We have this capability in notepad, but we wanted to do this program without user intervention.

Please help me.

Thanks in advance.
Venkat

What is the output of

iconv -l

Hi fpmurphy,

I got the following

$iconv -l
ASCII-GR
CNS11643.1986-1
CNS11643.1986-2
GB18030
GBK
IBM-1046
IBM-1124
IBM-1129
IBM-1251
IBM-1252
IBM-1390
IBM-1394
IBM-1399
IBM-850
IBM-856
IBM-921
IBM-922
IBM-932
IBM-943
IBM-eucCN
IBM-eucJP
IBM-eucKR
IBM-eucTW
IBM-sbdTW
IBM-udcJP
IBM-udcTW
ISCII.1991
ISO8859-1
ISO8859-1-GL
ISO8859-1-GR
ISO8859-15
ISO8859-15-GL
ISO8859-15-GR
ISO8859-2
ISO8859-2-GL
ISO8859-2-GR
ISO8859-3
ISO8859-3-GL
ISO8859-3-GR
ISO8859-4
ISO8859-4-GL
ISO8859-4-GR
ISO8859-5
ISO8859-5-GL
ISO8859-5-GR
ISO8859-6
ISO8859-6-GL
ISO8859-6-GR
ISO8859-7
ISO8859-7-GL
ISO8859-7-GR
ISO8859-8
ISO8859-8-GL
ISO8859-8-GR
ISO8859-9
ISO8859-9-GL
ISO8859-9-GR
JISX0201.1976-0
JISX0208.1983-0
KSC5601.1987-0
TIS-620
UCS-2
UNICODE-2
UTF-16
UTF-16le
UTF-32
UTF-8
big5
ct
fold7
fold8
uucode
$

Thanks.
Venkat

1 Like

Hi,

I was able to successfully convert the file to UTF-8 format using the following command

iconv -f ISO8859-9 -t UTF-8 <input_file> > <output_file>

I still have one issue. We will receive file with encode type format ANSI and in some cases UTF-8.

If the file comes with encode type to ANSI, then using above command, we change the file to UTF-8. This is not an issue.

But if the file is comes with UTF-8 and if we run above command then the file special characters are not coming properly.

We need to run iconv command only if the file encode type is ANSI. If it is UTF-8 then we should not run iconv. How do we identify the encode of file in UNIX. Please help me in finding this.

Thanks.
Venkat

file <input file>

should do it.

1 Like

I found the command
enca
available on Linux
In my case, i installed on CentOS 4 the rpm
enca-1.9-4.el4.rf.i386.rpm
and the command gives following output
a) on a file supposed to be in UTF-8
[root@mini figari]# enca -L none ./fr/texte_titre.html
Universal transformation format 8 bits; UTF-8
b) on a file supposed to be in ISO8859-1
[root@mini figari]# enca -L none ./en/texte_accueil.html
7bit ASCII characters

(but ideally, enca should be configured with french language to help format recognition, unfortunately, french seems not to be included :mad:)
my 2 cents