Detect lines beginning with double-byte characters (Japanese) and delete

ubbeauty · November 15, 2009, 5:48pm

Greetings,

I want to use a script (preferably awk) which determines if the first character in a line is double-byte (as in Japanese or Chinese) and deletes it.

For example:

(in the above quote, I see Japanese on my screen for two lines - with 2 characters in the first and 3 characters in the second - you may see random symbols)

becomes:

daptal · November 15, 2009, 6:06pm

If you want the end file to just have english characters then you can use this

 awk '$0 ~ /[A-Za-z]/ {print $0}' abc.txt

Note:- This way it will eliminate other languages also

HTH,
PL

ubbeauty · November 15, 2009, 6:31pm

Thanks daptal - but that's not what I need. I need exactly as stated - only detecting lines with a double byte character only in the beginning position.

jsmithstl · November 15, 2009, 8:20pm

Do you by chance know the character set the files are written in?

ubbeauty · November 15, 2009, 10:50pm

I am the one who wrote the file, so I know where the character sets came from.

Not sure I understand your question though. The non-Japanese characters are all single-byte characters (I am using vim). The Japanese characters use the "Double Byte Character Set (DBCS).

I want to keep it general so that Chinese and Korean characters are also recognized - which should work by detecting DBCS characters. There must be a straightforward way ... ?

Ygor · November 16, 2009, 12:45am

Try...

awk 'substr($0,1,1) < "\200"' file1

durden_tyler · November 16, 2009, 3:22am

perl -lne 'print if ord $_ <= 127' file

tyler_durden

ubbeauty · November 17, 2009, 10:09pm

Thanks guys, that worked great! I ran "diff" and the outputs of both the awk and perl commands are identical for my working file.

I'm new at this - but I used the command like this:

awk 'substr($0,1,1) < "\200"' file1 > file2

Question: How to interpret the "\200"?

Ygor · November 18, 2009, 8:04pm

The string "\200" represents a single character with octal value 200, which in binary is 10000000, i.e. the most significant bit is set to 1.
So, the supplied awk code prints lines where the first character's most significant bit is not set.