[awk]Chinese words!!

ettore8888 · June 25, 2015, 6:17am

Is there a way to extract chinese words from a text written in an European Language? I want to create a glossary and finding a way would make me save time!

Thank you!

Scrutinizer · June 25, 2015, 12:45pm

I suppose if you new what part of the code set are chinese characters you could try deleting the other characters. For starters, see what something like this brings:

tr -d '[:alnum:][:punct:]' < file

ettore8888 · June 26, 2015, 5:59am

The range of chineese unicode chars is 4E00 thru 9FFF (344 270 200 thru 351 277 277) so the test should be >"\343" and <"\352" (to avoid picking up any 4 char UTF-8 codes):

{
f=0;
for ( i=1; i<=length; i++)
 if(substr($0, i, 1)>"\343" &&substr($0, i, 1)<"\352")
 print $f

but there is an error or more errors.... I can't find it/them

---------- Post updated at 04:59 AM ---------- Previous update was at 04:38 AM ----------

scrutinizer:

I suppose if you new what part of the code set are chinese characters you could try deleting the other characters. For starters, see what something like this brings:
tr -d '[:alnum:][:punct:]' < file

RudiC · June 26, 2015, 1:56pm

Well, I tried this on a download of google.hk, and it spat out an unreadable (to western eyes) line...:

awk '{for (i=1; i<=NF; i++) if ($i >= "\344" && $i <= "\351") {printf "%s", $i$(i+1)$(i+2); i+=2}}' FS="" file