Detect lines beginning with double-byte characters (Japanese) and delete

Greetings,

I want to use a script (preferably awk) which determines if the first character in a line is double-byte (as in Japanese or Chinese) and deletes it.

For example:

(in the above quote, I see Japanese on my screen for two lines - with 2 characters in the first and 3 characters in the second - you may see random symbols)

becomes:

If you want the end file to just have english characters then you can use this

 awk '$0 ~ /[A-Za-z]/ {print $0}' abc.txt

Note:- This way it will eliminate other languages also

HTH,
PL

Thanks daptal - but that's not what I need. I need exactly as stated - only detecting lines with a double byte character only in the beginning position.

Do you by chance know the character set the files are written in?

I am the one who wrote the file, so I know where the character sets came from.

Not sure I understand your question though. The non-Japanese characters are all single-byte characters (I am using vim). The Japanese characters use the "Double Byte Character Set (DBCS).

I want to keep it general so that Chinese and Korean characters are also recognized - which should work by detecting DBCS characters. There must be a straightforward way ... ?

Try...

awk 'substr($0,1,1) < "\200"' file1
perl -lne 'print if ord $_ <= 127' file

tyler_durden

Thanks guys, that worked great! I ran "diff" and the outputs of both the awk and perl commands are identical for my working file.

I'm new at this - but I used the command like this:

awk 'substr($0,1,1) < "\200"' file1 > file2

Question: How to interpret the "\200"?

The string "\200" represents a single character with octal value 200, which in binary is 10000000, i.e. the most significant bit is set to 1.
So, the supplied awk code prints lines where the first character's most significant bit is not set.