regular expression foreign language

wakatana · October 13, 2009, 8:45am

Hello all,
I read somewher that regular expressions work with ASCII table so when i type

grep "[a-z][a-z]*" file_name

it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):

marek@cepi:~$ cat diakritika 
���������������
����ڊ���������

marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika 
������
��������
��
��

Why this regexp know diacritics? And why know only lower case and not "�" ??? This is strange for me. Friend told me it could be something with $LANG. So my $LANG is:

marek@cepi:~$ echo $LANG
en_US.UTF-8

Also I would ask if I want uppercase file with diacritic i type:

marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
���������������
����ڊ���������

why it not change lower to upper ?
Thanks a lot for reply
PS: I hope that characters display properly

wakatana · October 17, 2009, 9:34am

maybe i know why is "�" different it is "behind" z so regex [a-z] did not match the "�" but still many thinngs are unclear

Scrutinizer · October 17, 2009, 9:56am

If you want uppercase as well you have to specify

grep "[a-zA-Z][a-zA-Z]*" file_name

fpmurphy · October 17, 2009, 5:59pm

What you are discusiing is called a collating sequence. Do a web search for "POSIX collating sequence" for further information.

To be language-neutral, your example would be written as

grep "[[:alpha:]][[:alpha:]]*"  filename

or if you only want lowercase characters

grep "[[:lower:]][[:lower:]]*"  filename

cfajohnson · October 17, 2009, 11:32pm

wakatana:

Hello all,
I read somewher that regular expressions work with ASCII table so when i type
grep "[a-z][a-z]*" file_name
it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):
marek@cepi:~$ cat diakritika 
���������������
����ڊ���������

marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika 
������
��������
��
��
Why this regexp know diacritics? And why know only lower case and not "�" ??? This is strange for me. Friend told me it could be something with $LANG.

� comes after z, so it is not in the range you gave.

So my $LANG is:
marek@cepi:~$ echo $LANG
en_US.UTF-8
Also I would ask if I want uppercase file with diacritic i type:
marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
���������������
����ڊ���������
why it not change lower to upper ?

Probably because those characters are not part of the en_US.UTF-8 definition of [:lower:] and [:upper:].

wakatana · October 18, 2009, 12:42pm

Thank you for reply. Is there an option how to convert lowercase diacritics to uppercase ?

cfajohnson · October 18, 2009, 4:42pm

Use a locale in which they are defined. (I guess; I haven't tried it.)

wakatana · October 18, 2009, 5:04pm

How, I am newbie... please post how to or point me to somewhere on internet , thank you

Scrutinizer · October 18, 2009, 6:39pm

Perhaps quirky "y" in sed might come to the rescue ?

sed 'y/abcdefghijklmnopqrstuvwxyz�������������/ABCDEFGHIJKLMNOPQRSTUVWXYZ�����ڊ������/' diakritika

You would have to specify all the possible upper and lower case special characters.

wakatana · October 20, 2009, 6:48am

yes that works and it is probably only possibility