Problem identifying charset of a file

sridhar_423 · March 8, 2009, 3:08pm

Hi all,

My objective is to find out the charset using which a file is encoded. (The OS is SunOs)
I have set NLS_LANG to AR8MSWIN1256 and spooled the file.

When viewed the file using vi, I saw the following
\307\341\321\355\307\326

I then inserted the line containing these codes in a table by setting NLS_LANG to AL32UTF8 and saw the Arabic text

Now, what are these 307, 341 .. numbers? Are these the code points? If that is the case, they should be of Windows 1256 cp as I have set NLS_LANG to AR8MSWIN1256. Also, are they in decimal/ hex/ oct?

Can anyone tell me how can i arrive at the arabic text by using those numbers?
I tried something like this in a HTML page without any luck
& #307;& #341;& #321;& #355;& #307;& #326;
& #775;& #833;& #801;& #853;& #775;& #806; (I have kept a space between & and # to avoid the browser rendering them as symbols/characters)

Thanks,
Sridhar

Yogesh_Sawant · March 9, 2009, 1:53pm

try:

$ file filename.txt

for example:

yogeshs@yogesh-laptop:~/temp$ 
yogeshs@yogesh-laptop:~/temp$ cat chars.txt 

yogeshs@yogesh-laptop:~/temp$ file chars.txt 
chars.txt: UTF-8 Unicode text
yogeshs@yogesh-laptop:~/temp$

sridhar_423 · March 9, 2009, 3:45pm

Hi Yogesh,

Thanks a lot for the reply.
I tried "file" option as well. But dont know why it displays only "text". Its not as descriptive as you have showed in your post on my unix box.

Can you please try this with a file that is generated using win 1256 cp?

Also, do you have any idea about those numbers? I found on some site that these numbers are octal. So, I have converted them into decimal and then tried &#DECIAML; in a HTML without any luck.

You can check this in your example by doing "vi chars.txt"

Any pointers in this direction would be very helpful

Thanks again
Sridhar

sridhar_423 · March 28, 2009, 4:00pm

I guess I found out what I was looking for after a series of tests
file -- This may not give correct output. In the above post, chars.txt gave utf-8 because chars.txt is saved to disk using utf-8 and utf-8 reserves first 3 bytes of the file to represent that its a unicode file which is encoded using utf-8

In my case, the file was generated using cp1256. So, if the first 512 bytes are ascii characters(I guess file checks for first 512 bytes.. i'm not 100% sure though. I simply added 1000 english characters to the beginning of the file), then it would display the file as ascii as the code points of cp1256 is same as ascii for <=127

Coming to the numbers in the file when opened using vi editor, they are the octals(base 8) of the code points. I performed the below test to confirm it

opened the file using vi and copied some of those numbers
Wrote a php program to convert the octals into decimal and print the corresponding character
As my computer uses 1256cp for representing the characters which fall outside of ascii range, it displayed arabic data. So, these numbers are nothing but the code points.

Thanks,
Sridhar