Perl script to sort an Excel file

degoor · January 13, 2010, 5:27am

Hello!

I need to sort a file that is partly in English partly in Bulgarian.

The original file is an Excel file but I converted it to a tab-delimited text file. The encoding of the tab delimited file is UTF-8.

To sort the text, the script should test every line of the text file to see if there are any Cyrillic UTF-8 characters.

Even if a single Cyrillic character is found, it will be quite enough to conclude that the line is in Bulgarian. If no Cyrillic character is found, then the text should be in English.

The UTF-8 range of Cyrillic characters is U+0400 to U+04FF

After finding a Cyrillic character the script should add at the end the same line Tab+'bg'; when no Cyrillic character is found in the line, Tab+'en' should be added to the end of the line. I need the tabs so that I can open the file back in Excel and sort the lines.

I believe the script should be best written in Perl.

My system is MS Windows XP Professional. I have cygwin+Perl and Strawberry Perl installed.

Thanks in advance!

alister · January 13, 2010, 10:28am

awk '{ print $0 "\t" (/[\320-\323]/ ? "bg": "en")}' data

The range U+0400 to U+04FF when encoded in UTF-8 will always be a two byte sequence, the first of which must have a value between 0320-0323 (octal), 0xD0-0xD3 (hex), 208-211 (decimal). Since this byte range serves no other purpose, its presence indicates a character in the relevant range.

I realize you requested a perl solution, but I'm not much of a perl hacker. If you have cygwin, you should have awk (gawk, most likely).

Regards,
alister

degoor · January 13, 2010, 12:09pm

alister, thank you very much for you help!

It seems though that the range of the Cyrillic characters is different because I am getting all lines marked with "en" which is not true.

I tried your code having coded the text in my editor 'UTF-8', 'UTF-8+BOM' and 'Unicode' but neither works - all lines got marked with "en".

Actually as far as I have this job done I don't mind any mean.

I requested Perl because it is considered to have an excellent support for Unicode and regular expressions.

Can I try your code for some other range if I change the encoding to ANSI?

alister · January 13, 2010, 12:32pm

Hello, degoor:

I tested the code on a text file containing bulgarian (utf-8 encoded) and english, and the code worked correctly. I inspected the byte values using hexdump (od would do) and they confirmed the presence of two byte sequences beginning with 0320-0323 octal. I also double checked your byte ranges using wikipedia/google and U+0400 through U+04FF are indeed cyrillic.

I suggest taking a look at the file using od or hexdump (or a windows hexeditor) to see exactly what's in the file you intend to process with perl/awk.

My sample utf-8 encoded text with cyrillic came from the Columbia UTF-8 sampler, top result when googling "utf-8 sampler". You can test my code against that to confirm that there isn't some other malfunction taking place (perhaps copy-paste mangling, etc).

alister

degoor · January 13, 2010, 3:25pm

alister:

awk '{ print $0 "\t" (/[\320-\323]/ ? "bg": "en")}' data
The range U+0400 to U+04FF when encoded in UTF-8 will always be a two byte sequence, the first of which must have a value between 0320-0323 (octal), 0xD0-0xD3 (hex), 208-211 (decimal). Since this byte range serves no other purpose, its presence indicates a character in the relevant range.

I realize you requested a perl solution, but I'm not much of a perl hacker. If you have cygwin, you should have awk (gawk, most likely).

Regards,
alister

Actually I run your code in cygwin bash, like this:

gawk '{ print $0 "\t" (/[\320-\323]/ ? "bg": "en")}' source_txt_ file > result_txt_file

Is that what you had in mind?

alister · January 13, 2010, 4:29pm

That looks fine, except for what I assume is a typo: the space between "source_txt_" and "file"

To be certain, I tested using nawk/osx and gawk/cygwin-winxp; both worked as expected.

It would help if you would share a reduced version of your source file, one english line and one bulgarian line (obfuscating any sensitive info, of course). Also, given the strange things that can happen in transit, a paste of a hex/octal dump of those bytes would be very helpful.

Example:

$ cat utf8-data 
English: I can eat glass
Bulgarian:    ,  

$ od -c utf8-data 
0000000    E   n   g   l   i   s   h   :       I       c   a   n       e
0000020    a   t       g   l   a   s   s  \n   B   u   l   g   a   r   i
0000040    a   n   :     320 234 320 276 320 263 320 260     320 264 320
0000060  260     321 217 320 274     321 201 321 202 321 212 320 272 320
0000100  273 320 276   ,     321 202 320 276     320 275 320 265  \n    
0000117

$ awk '{ print $0 "\t" (/[\320-\323]/ ? "bg": "en")}' utf8-data
English: I can eat glass        en
Bulgarian:    ,       bg

Regards,
alister

degoor · January 13, 2010, 11:07pm

alister:

Here is a sample file:

TextId	TextDescription	Explanation	Signature	Translator Question	Customer Answer	Fake
3	 !					1
10	  ,   					1
14	   ? 					1
4655	Add key		MaJ			1
4656	Module		MaJ			1
4657	Available		MaJ			1

The file is encoded UTF-8 by my editor EditPlus.

There are 7 lines in the file:
The 1st line consists of the column titles (so it is actually in English),
The next 3 line are in Bulgarian,
The last 3 lines are in English.

I got all lines marked as "en" :o

The problem might be in the settings of cygwin (I use defaults) so I shall try to set myself up some Linux system (Ubuntu probably ;)) and test the code on it.

@Franklin52: Sorry I didn't know that I can't post direct links here. I do apologise for that.

@alister: Please use the file from the link that I sent you as a PM. It should be same as the above but still ...

alister · January 14, 2010, 10:55am

Hey, degoor:

In the octal dump that follows, you can clearly tell the difference between the bulgarian and the english. The bulgarian is rendered as mostly octal byte values, and as expected there are plenty of pairs beginning with 320 and 321.

$ od -c text_sample.txt 
0000000    T   e   x   t   I   d  \t   T   e   x   t   D   e   s   c   r
0000020    i   p   t   i   o   n  \t   E   x   p   l   a   n   a   t   i
0000040    o   n  \t   S   i   g   n   a   t   u   r   e  \t   T   r   a
0000060    n   s   l   a   t   o   r       Q   u   e   s   t   i   o   n
0000100   \t   C   u   s   t   o   m   e   r       A   n   s   w   e   r
0000120   \t   F   a   k   e  \r  \n   3  \t 320 232 320 276 320 274 321
0000140  203 320 275 320 270 320 272 320 260 321 206 320 270 320 276 320
0000160  275 320 265 320 275     320 264 320 265 321 204 320 265 320 272
0000200  321 202   !  \t  \t  \t  \t  \t   1  \r  \n   1   0  \t 320 237
0000220  321 200 320 276 320 262 320 265 321 200 320 272 320 260 321 202
0000240  320 260     320 265     320 277 321 200 320 270 320 272 320 273
0000260  321 216 321 207 320 270 320 273 320 260   ,     321 201 320 270
0000300  321 201 321 202 320 265 320 274 320 260 321 202 320 260     320
0000320  265     320 236 320 232  \t  \t  \t  \t  \t   1  \r  \n   1   4
0000340   \t 320 230 321 201 320 272 320 260 321 202 320 265     320 273
0000360  320 270     320 264 320 260     320 270 320 267 321 202 321 200
0000400  320 270 320 265 321 202 320 265   ?      \t  \t  \t  \t  \t   1
0000420   \r  \n   4   6   5   5  \t   A   d   d       k   e   y  \t  \t
0000440    M   a   J  \t  \t  \t   1  \r  \n   4   6   5   6  \t   M   o
0000460    d   u   l   e  \t  \t   M   a   J  \t  \t  \t   1  \r  \n   4
0000500    6   5   7  \t   A   v   a   i   l   a   b   l   e  \t  \t   M
0000520    a   J  \t  \t  \t   1  \r  \n                                
0000530

I did however notice a little "problem" when looking at an octal dump of your file: your data uses \r\n line endings (which is expected from a DOS/Windows machine). Under the cygwin environment, when reading a line of input, \r\n pairs are consumed. By default, on output, AWK will separate records using a sole \n. Your file is being converted from DOS/Windows format to UNIX. Perhaps that is causing your postprocessing to fail?

I tested the following under Cygwin using a WinXP SP2 system and it works as expected (an octal dump of the script's output confirmed that the \r\n terminators are intact):

$ awk '{print $0 "\t" (/[\320-\323]/ ? "bg" : "en")}' ORS='\r\n' text_sample.txt 
TextId  TextDescription Explanation     Signature       Translator Question     Customer Answer Fake    en
3        !                                  1       bg
10        ,                                    1       bg
14         ?                                  1       bg
4655    Add key         MaJ                     1       en
4656    Module          MaJ                     1       en
4657    Available               MaJ                     1       en

NOTE: The AWK above will NOT give correct results on a UNIX/Linux system. It will insert a tab followed by either "en" or "bg" and a "\r" between incoming \r\n pairs.

Hopefully, we've arrived at the end of this odyssey

Regards,
alister

P.S.

In case you aren't familiar with C string escape sequences I used above, and the differing ways that unix and windows terminate a line in a text file:

\r = carriage return
\n = line feed
\t = tab

windows uses \r\n to terminate lines.
unix uses \n

degoor · January 14, 2010, 1:19pm

alister:

Thank you very much for your effort and persistence.

Your code from the very beginning was quite all right.

The problem was the cygwin locale setting. Once I set it to bg_BG.CP1251 it worked OK.

(Technically that is done through adding a line to C:\cygwin\Cygwin.bat so that it looks like this:

@echo off

C:
chdir C:\cygwin\bin
set LANG=bg_BG.CP1251
bash --login -i

)

When I used your second command line

$ awk '{print $0 "\t" (/[\320-\323]/ ? "bg" : "en")}' ORS='\r\n' text_sample.txt

I got an extra empty line after every original line so I used your initial command.

Thanks again! Great job, alister!

alister · January 14, 2010, 1:29pm

You're very welcome