grep high bit char

ayyo1234 · February 9, 2008, 10:49am

Hi -

I have file which contains high bit unicode chars like � etc.. How can I do grep to find out lines which contain copyright symbol �

I tried using

grep \x{00A9}
grep \x\{00A9\}

Thanks-

ayyo1234 · February 9, 2008, 2:44pm

Any suggestion ?

I need to use grep only..

ilak1008 · February 9, 2008, 3:44pm

Try this:

grep '�' filename

ayyo1234 · February 9, 2008, 5:49pm

How you will type '�' in unix ??? I am not sure whether you can type it in unix...

In windows I can type it using 'Alt+0169'..

ilak1008 · February 10, 2008, 3:46pm

'�' in Unix is:

Press Shift+Alt+0 simultaneously.

ayyo1234 · February 11, 2008, 12:47pm

Thanks for your reply.

However, I am not able to type � in unix

I tried shift+alt+0...

jim_mcnamara · February 11, 2008, 3:01pm

POSIX grep does not look past a nul character. 00A9 is the unicode sequence number for what you want. The first byte is 00 - the nul character.

grep will not do what you need. Cosnider wiritng something in C - reads in short integers (2 byte integers) from the file. Compare each one with 169. When you find 169 that is character offset in the file where the symbol is.

You are probably better off using a Windows editor.

Found a version og grep from mkssoftware that claims to support unicode:
grep, egrep, fgrep -- match patterns in a file

GNU grep has a -U switch to support binary character files (UTF-16, unicode, etc)

ayyo1234 · February 11, 2008, 4:01pm

Thanks Jim.

I will try to see what I can do to implement it...