Hello all,
I read somewher that regular expressions work with ASCII table so when i type
grep "[a-z][a-z]*" file_name
it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):
marek@cepi:~$ cat diakritika
���������������
����ڊ���������
marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika
������
��������
��
��
Why this regexp know diacritics? And why know only lower case and not "�" ??? This is strange for me. Friend told me it could be something with $LANG. So my $LANG is:
marek@cepi:~$ echo $LANG
en_US.UTF-8
Also I would ask if I want uppercase file with diacritic i type:
marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
���������������
����ڊ���������
why it not change lower to upper ?
Thanks a lot for reply
PS: I hope that characters display properly
maybe i know why is "�" different it is "behind" z so regex [a-z] did not match the "�" but still many thinngs are unclear
wakatana:
why know only lower case
If you want uppercase as well you have to specify
grep "[a-zA-Z][a-zA-Z]*" file_name
What you are discusiing is called a collating sequence. Do a web search for "POSIX collating sequence" for further information.
To be language-neutral, your example would be written as
grep "[[:alpha:]][[:alpha:]]*" filename
or if you only want lowercase characters
grep "[[:lower:]][[:lower:]]*" filename
wakatana:
Hello all,
I read somewher that regular expressions work with ASCII table so when i type
grep "[a-z][a-z]*" file_name
it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):
marek@cepi:~$ cat diakritika
���������������
����ڊ���������
marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika
������
��������
��
��
Why this regexp know diacritics? And why know only lower case and not "�" ??? This is strange for me. Friend told me it could be something with $LANG.
� comes after z, so it is not in the range you gave.
So my $LANG is:
marek@cepi:~$ echo $LANG
en_US.UTF-8
Also I would ask if I want uppercase file with diacritic i type:
marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
���������������
����ڊ���������
why it not change lower to upper ?
Probably because those characters are not part of the en_US.UTF-8 definition of [:lower:] and [:upper:].
Thank you for reply. Is there an option how to convert lowercase diacritics to uppercase ?
Use a locale in which they are defined. (I guess; I haven't tried it.)
How, I am newbie... please post how to or point me to somewhere on internet , thank you
Perhaps quirky "y" in sed might come to the rescue ?
sed 'y/abcdefghijklmnopqrstuvwxyz�������������/ABCDEFGHIJKLMNOPQRSTUVWXYZ�����ڊ������/' diakritika
You would have to specify all the possible upper and lower case special characters.
yes that works and it is probably only possibility