regular expression foreign language

Hello all,
I read somewher that regular expressions work with ASCII table so when i type

grep "[a-z][a-z]*" file_name

it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):

marek@cepi:~$ cat diakritika 
���������������
����ڊ���������

marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika 
������
��������
��
��

Why this regexp know diacritics? And why know only lower case and not "�" ??? This is strange for me. Friend told me it could be something with $LANG. So my $LANG is:

marek@cepi:~$ echo $LANG
en_US.UTF-8

Also I would ask if I want uppercase file with diacritic i type:

marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
���������������
����ڊ���������

why it not change lower to upper ?
Thanks a lot for reply
PS: I hope that characters display properly

maybe i know why is "�" different it is "behind" z so regex [a-z] did not match the "�" but still many thinngs are unclear

If you want uppercase as well you have to specify

grep "[a-zA-Z][a-zA-Z]*" file_name

What you are discusiing is called a collating sequence. Do a web search for "POSIX collating sequence" for further information.

To be language-neutral, your example would be written as

grep "[[:alpha:]][[:alpha:]]*"  filename

or if you only want lowercase characters

grep "[[:lower:]][[:lower:]]*"  filename

� comes after z, so it is not in the range you gave.

Probably because those characters are not part of the en_US.UTF-8 definition of [:lower:] and [:upper:].

Thank you for reply. Is there an option how to convert lowercase diacritics to uppercase ?

Use a locale in which they are defined. (I guess; I haven't tried it.)

How, I am newbie... please post how to or point me to somewhere on internet , thank you

Perhaps quirky "y" in sed might come to the rescue ?

sed 'y/abcdefghijklmnopqrstuvwxyz�������������/ABCDEFGHIJKLMNOPQRSTUVWXYZ�����ڊ������/' diakritika

You would have to specify all the possible upper and lower case special characters.:wink:

yes that works and it is probably only possibility :slight_smile: