Equivalence classes don't work

Hello:
I can't get equivalence classes to work in globs or when passing them to tr . If I understood correctly, [=e=] matches e , é , è , ê , etc. But when using them with utilities like tr they don't work. Here's an example found in the POSIX standard:

I decided to create the aforementioned files in order show the results. Here's the contents of file1 :

Estrés
Miraré

And these are the results in a GNU/Linux and a Solaris machine:

$ uname -a
Linux sigma 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

$ locale
LANG=es_ES.UTF-8
LANGUAGE=
LC_CTYPE="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_PAPER="es_ES.UTF-8"
LC_NAME="es_ES.UTF-8"
LC_ADDRESS="es_ES.UTF-8"
LC_TELEPHONE="es_ES.UTF-8"
LC_MEASUREMENT="es_ES.UTF-8"
LC_IDENTIFICATION="es_ES.UTF-8"
LC_ALL=

$ tr "[=e=]" "[e*]" <file1 >file2

$ cat file2
Estrés
Miraré
  
$ uname -a
SunOS solaris 5.11 11.3 i86pc i386 i86pc

$ locale
LANG=es_ES.UTF-8
LC_CTYPE="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_ALL=

$ tr "[=e=]" "[e*]" <file1 >file2

$ cat file2
Estrés
Miraré
 

Why aren't the accented e's replaced?

GNU tr doesn't support multi-byte characters, but the Solaris implementation does:

$ printf 'Estrés\n' | tr '[:lower:]' '[:upper:]'
ESTRÉS

So I don't know why it's failing on Solaris. Am I using equivalence classes correctly?
Thanks in advance.

Try the utils from /usr/xpg4/bin on Solaris...

I've tried with /usr/xpg4/bin/tr , /usr/xpg6/bin/tr and even changing the shell to /usr/xpg4/bin/sh , but none of them worked.

Oddly, the example I provided appears in the examples section of the tr manpage in Solaris...

Could it be that whoever made the Spanish locale in these systems didn't define any equivalence class?

In UTF-8 é should evaluate to (U+117).
There should be a command called localedef.
There also should be a Spanish UTF-8 locale, you are calling it correctly.

Please post the output of this, which lists classes

for class in $(
    locale -v LC_CTYPE | 
    sed 's/combin.*//;s/;/\n/g;q'
) ; do 
    printf "\n\t%s\n\n" $class
 done

If you get correct output, then character classes exist correctly in your locale. You may need to set the environment variable POSIXLY_CORRECT on Linux.

Here's the output shown in Debian:

$ for class in $(
>     locale -v LC_CTYPE | 
>     sed 's/combin.*//;s/;/\n/g;q'
> ) ; do 
>     printf "\n\t%s\n\n" $class
>  done

    upper


    lower


    alpha


    digit


    xdigit


    space


    print


    graph


    blank


    cntrl


    punct


    alnum

 

I don't see any equivalence classes, just character classes. So it means there are none defined in the locale, right?

I was not clear. You thought your locale was messed up somehow, so I started at the beginning to debug it.
Looks okay. Next, tr has problems with equivlence classes

[aªáàâãäå]

This is the long form of an equivalence class. Try it (use whatever letter is handy)

echo "aªáàâãäå" | sed 's/[aªáàâãäå...]/a/g'

On Linux this fails for me:

$ echo "aªáàâãäå" | sed 's/[=a=]/x/g'
xªáàâãäå

The tr man page I have:

Try sed and use full classes to get past GNU problems. For Solaris I have no good answers, my home version is Solaris 9, and it is not POSIX compliant.

1 Like

On Solaris 10, I tried the following, using the POSIX compliant utilities which are in /usr/xpg[46]/bin :

$ export PATH=/usr/xpg6/bin:/usr/xpg4/bin:$PATH
$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 tr '[=ë=]' x
Estrés
Miraré
http://xn.wikipedia.org
$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 sed 's/[[=ë=]]/x/g'
xstrxs
Mirarx
http://xn.wikipxdia.org

So tr did not work, but sed did

On Linux I had the same experience, but tr also gave an error message, so it appears it only uses single byte characters and it does not understand equivalence classes, but sed worked:

$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 tr '[=ë=]' x
tr: \303\253: equivalence class operand must be a single character
$ printf "%s\n" Estrés Miraré http://ën.wikipedia.org | LC_CTYPE=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8 sed 's/[[=ë=]]/x/g'
xstrxs
Mirarx
http://xn.wikipxdia.org
2 Likes

I see. I was aware GNU sed had issues with multi-byte characters, like I mentioned in my first post. I was just confused why it didn't work on Solaris either.

This works! I forgot sed you can use globs with a replace script in sed .
Thank you all for your help!

Now, I assume that in this case the problem wasn't that equivalence classes didn't work, but it had something to do with tr . But I don't understand why they don't work in globs either:

$ ls -1
bin
Descargas
Documentos
Escritorio
Imágenes
Música
Plantillas
Público
Vídeos

$ printf '%s\n' *[[=u=]]*
Documentos
Estudio
 

Shouldn't Música and Público have appeared in the output of printf ?

It appears that it has been implemented in the system's regex engine, but that it does not work with globbing. On Linux, in bash 4 compare:

$ touch Miraré
$ for file in M*; do if [[ $file == M*[[=e=]]* ]]; then echo "$file"; fi; done
$ for file in M*; do if [[ $file =~ ^M.*[[=e=]] ]]; then echo "$file"; fi; done
Miraré
$

I found this for Linux Standard Base Core Specification 4.1: Pattern Matching Notation

I do not know what is the case with Solaris 10. It may be that the equivalence classes were not specified in POSIX.1-2001 . Perhaps it was in Solaris 11, you would have to try that out...

3 Likes

I see. Well, on the one hand it's good they're available in regular expressions but it would be convenient to have them available as globs as well, especially for characters that I can't easily type using my keyboard without a long key combination and memorizing numbers.

Thanks for your help, Scrutinize!