Hello:
I can't get equivalence classes to work in globs or when passing them to tr . If I understood correctly, [=e=] matches e , é , è , ê , etc. But when using them with utilities like tr they don't work. Here's an example found in the POSIX standard:
I decided to create the aforementioned files in order show the results. Here's the contents of file1 :
Estrés
Miraré
And these are the results in a GNU/Linux and a Solaris machine:
In UTF-8 é should evaluate to (U+117).
There should be a command called localedef.
There also should be a Spanish UTF-8 locale, you are calling it correctly.
Please post the output of this, which lists classes
for class in $(
locale -v LC_CTYPE |
sed 's/combin.*//;s/;/\n/g;q'
) ; do
printf "\n\t%s\n\n" $class
done
If you get correct output, then character classes exist correctly in your locale. You may need to set the environment variable POSIXLY_CORRECT on Linux.
$ for class in $(
> locale -v LC_CTYPE |
> sed 's/combin.*//;s/;/\n/g;q'
> ) ; do
> printf "\n\t%s\n\n" $class
> done
upper
lower
alpha
digit
xdigit
space
print
graph
blank
cntrl
punct
alnum
I don't see any equivalence classes, just character classes. So it means there are none defined in the locale, right?
I was not clear. You thought your locale was messed up somehow, so I started at the beginning to debug it.
Looks okay. Next, tr has problems with equivlence classes
[aªáàâãäå]
This is the long form of an equivalence class. Try it (use whatever letter is handy)
echo "aªáàâãäå" | sed 's/[aªáàâãäå...]/a/g'
On Linux this fails for me:
$ echo "aªáàâãäå" | sed 's/[=a=]/x/g'
xªáàâãäå
The tr man page I have:
Try sed and use full classes to get past GNU problems. For Solaris I have no good answers, my home version is Solaris 9, and it is not POSIX compliant.
On Linux I had the same experience, but tr also gave an error message, so it appears it only uses single byte characters and it does not understand equivalence classes, but sed worked:
$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 tr '[=ë=]' x
tr: \303\253: equivalence class operand must be a single character
$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 sed 's/[[=ë=]]/x/g'
xstrxs
Mirarx
http://xn.wikipxdia.org
I see. I was aware GNU sed had issues with multi-byte characters, like I mentioned in my first post. I was just confused why it didn't work on Solaris either.
This works! I forgot sed you can use globs with a replace script in sed .
Thank you all for your help!
Now, I assume that in this case the problem wasn't that equivalence classes didn't work, but it had something to do with tr . But I don't understand why they don't work in globs either:
$ ls -1
bin
Descargas
Documentos
Escritorio
Imágenes
Música
Plantillas
Público
Vídeos
$ printf '%s\n' *[[=u=]]*
Documentos
Estudio
Shouldn't Música and Público have appeared in the output of printf ?
It appears that it has been implemented in the system's regex engine, but that it does not work with globbing. On Linux, in bash 4 compare:
$ touch Miraré
$ for file in M*; do if [[ $file == M*[[=e=]]* ]]; then echo "$file"; fi; done
$ for file in M*; do if [[ $file =~ ^M.*[[=e=]] ]]; then echo "$file"; fi; done
Miraré
$
I do not know what is the case with Solaris 10. It may be that the equivalence classes were not specified in POSIX.1-2001 . Perhaps it was in Solaris 11, you would have to try that out...
I see. Well, on the one hand it's good they're available in regular expressions but it would be convenient to have them available as globs as well, especially for characters that I can't easily type using my keyboard without a long key combination and memorizing numbers.