Is there a way to extract chinese words from a text written in an European Language? I want to create a glossary and finding a way would make me save time!
Thank you!
Is there a way to extract chinese words from a text written in an European Language? I want to create a glossary and finding a way would make me save time!
Thank you!
I suppose if you new what part of the code set are chinese characters you could try deleting the other characters. For starters, see what something like this brings:
tr -d '[:alnum:][:punct:]' < file
The range of chineese unicode chars is 4E00 thru 9FFF (344 270 200 thru 351 277 277) so the test should be >"\343" and <"\352" (to avoid picking up any 4 char UTF-8 codes):
{
f=0;
for ( i=1; i<=length; i++)
if(substr($0, i, 1)>"\343" &&substr($0, i, 1)<"\352")
print $f
but there is an error or more errors.... I can't find it/them
---------- Post updated at 04:59 AM ---------- Previous update was at 04:38 AM ----------
Well, I tried this on a download of google.hk, and it spat out an unreadable (to western eyes) line...:
awk '{for (i=1; i<=NF; i++) if ($i >= "\344" && $i <= "\351") {printf "%s", $i$(i+1)$(i+2); i+=2}}' FS="" file