Problem with tr command

Hi friends,
Today I found one strange behaviour of the tr command.
I used the following command:

echo "NEE"|tr [A-Z] [a-z] 

Sometimes it was giving "nee" as ouput . sometimes it was giving "NEE" as output.
Finally I used the below code:

echo "NEE"|tr "[A-Z]" "[a-z]" 

... and it gave me the correct result at every hit. I just want to know the reason behind this strange behaviour. Is it a problem of my Shell ? I am working in ksh88 version . Please help.

I can't really speak to the inconsistent output using "[A-Z] [a-z]", however I can say that I have never seen that using this method to go from upper to lower...

echo "NEE"|tr [:upper:] [:lower:]

This should be a syntax error. You must have a funny O/S if it works at all.

The correct syntax is:

echo "NEE" | tr '[A-Z]' '[a-z]'

... or to paraphrase man tr

echo "NEE" | tr "[:upper:]" "[:lower:]"

... but I prefer single quotes to make it totally obvious that the parameters are for tr not Shell.

echo "NEE" | tr '[:upper:]' '[:lower:]'

Btw. I've seen some amazing anomolies if there are files with single character names in the current directory.

Hmm

man tr on my servers do not show the need for a single or double quote.

I am using RHEL 5.3 and have been using the "[:upper:] [:lower:]" notation for years without quoting them. In fact we have a handful of Solaris and even HP-UX servers and many of our common scripts use that notation.

What OS are you using?
What version?
What version of tr?

It would be good to remember that quoting is required on that OS/Version or more importantly that version of tr.

@neelmani
Please post what Operating System and version you are running and what Shell you use.
(Prediction: The O/S going to be something unusual).

@ddreggors
Linux syntax can be wildly and inconsistently different from unix syntax and from other Linux systems syntax.
The unquoted syntax posted by the O/P is a syntax error on HP-UX.
There are much too many many Scripts running on live systems which contain faulty tr commands.
Please post your syntax and context.

Unlike character classes, tr ranges are not delimited by square brackets. tr '[A-Z]' '[a-z]' is converting [ and ] into themselves, a silent, harmless error.

The unquoted version could trigger pathname expansion in the shell, but I don't know why it would give you a syntax error. Could you please elaborate? Is it a tr syntax error? A shell error?

I think the original problem is as you alluded, the result of the shell matching single letter pathnames against the unquoted [a-z] and [A-Z] . This could alter the arguments to tr in such a way that none of the letters in NEE match the first range and so are never converted.

I would suggest trying tr A-Z a-z (valid only in the c/posix locale).

Regards,
Alister

---------- Post updated at 06:02 PM ---------- Previous update was at 05:53 PM ----------

It's not tr that requires the quoting; it's the shell (any posix-like sh). [:upper:] and [:lower:] are valid pathname expansion operations in the shell, each matching a single character name. If there's a match, instead of tr seeing a character class it will see the single character name with which the shell replaced the original character class (which to the shell was a pattern for globbing). If there's no match (probably usually the case, which is why you haven't noticed a problem), the shell doesn't modify the pattern and tr is invoked as intended.

Not to put a fine point on it, but all of those scripts have a bug and should be quoted.

Regards,
Alister

1 Like

Thanks alister, that is good to know. I have been using tr for years with that syntax and just never have seen any problems. It had always performed as expected. When you mention the shell globbing it makes a lot more sense. I guess as you said, I have just been lucky :eek:

Admittedly, I use tr less often than say grep, awk, or sed... but enough that this is valuable to know.

Here we go. The reason to use quotes round the range and and example of tr misbehaving on HP-UX if there are single character filenames in the current working directory.

echo "NEE"|tr [A-Z] [a-z]
nee
cd ..
echo "NEE"|tr [A-Z] [a-z]
tr: The combination of options and String parameters is not legal.
Usage: tr [ -c | -cds | -cs | -ds | -s ] [-A] String1 String2
       tr [ -cd | -cs | -d | -s ] [-A] String1
ls -lad ?
-rw-r--r--   1 root       sys              0 Jun  7 11:58 a
-rw-r--r--   1 root       sys              0 Jun  7 11:58 b
-rw-r--r--   1 root       sys              0 Jun  7 11:58 c
echo "NEE"|tr '[A-Z]' '[a-z]'
nee
echo "NEE"|tr 'A-Z' 'a-z'
nee
# And finally - protecting the square brackets from the Shell (which is the same as using quotes).
echo "NEE"|tr \[A-Z\] \[a-z\]
nee

P.s. My post yesterday was bad because I just happened to try the command in a directory containing single character filenames! However the HP-UX man tr does show ranges complete with square brackets and surrounded by double quote characters.

I was looking at the opengroup's tr documentation and indeed it mentions that System V tr did use square-brackets to delimit ranges; BSD systems did not.

The standard chose the BSD syntax as the lesser of two evils (most SysV scripts would continue to work fine, as opposed to breaking every BSD script using ranges).

More detailed info in the RATIONALE @ http://pubs.opengroup.org/onlinepubs/009695399/utilities/tr.html

While the range syntax just discussed is specified clearly, the source of your syntax error is the result of undefined behavior. To do its job, tr requires the second of two character strings to be at least as long as the first. When the second character string is shorter (after expanding ranges, character classes, and repetition operations), BSD-ish tr pads the second string by repeating its final character until the string lengths are equal. This padding behavior does not allow the situation which triggered the syntax error to occur. SysV-ish tr does not pad and errors out.

Padding is discussed in the APPLICATION USAGE section of the opengroup tr manual page linked above.

I have no experience with HP-UX. While the results of your tr invocations seem to indicate that your tr is POSIX-compliant, the HP-UX tr manual I consulted states that ranges can be specified with and without brackets.

From http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02273397/c02273397.pdf

If that's accurate, then it cannot be compliant as those are not equivalent expressions per the standard.

Expected result:

$ echo '[abc]' | tr '[a-c]' '[.*]'
.....
$ echo '[abc]' | tr 'a-c' '[.*]'
[...]

I would be curious to know the result of those commands on your system.

Regards,
Alister

@alister

echo '[abc]' | tr '[a-c]' '[.*]'
[...]
echo '[abc]' | tr 'a-c' '[.*]'
[...]

I can get the square brackets to translate with:

echo '[abc]' | tr '[a-c]\[\]' '[.*]'
.....

I think that this proves that square brackets are special in the syntax of tr range specifications. It is also the syntax I have used for umpteen years on assorted versions of unix. The slight variation on a Linux or BSD system is no surprise.

Finally this tr -d illustrates what I mean:

echo "[abc]" | tr -d '[a-c]'
[]

I use tr -d frequently in numeric or alphabetic data validation.
What you you get on a Linux system with that command?

2 Likes

Based on your results, HP-UX tr is not POSIX-compliant in ways that have been part of the standard for at least 15 years now (perhaps 20). I do not say this pejoratively; it's merely an observation.

I could not find IEEE Std 1003.2-1992 online, which I believe is the first standard to include the utilities (IEEE Std 1003.1-1988 only covered core system services).

The Single UNIX � Specification, Version 2
Copyright � 1997 The Open Group
tr manual page
http://pubs.opengroup.org/onlinepubs/009695399/utilities/tr.html

As I said before, I have no experience with HP-UX nor do I know what it aspires to be.

Perhaps backwards compatibility is most important to HP and its userbase. If that's the case, then it was a mistake to add support for the POSIX/BSD range syntax. a-c in historical SysV tr means three characters, a , - , and c ; it's equivalent to ac- .

If, however, HP endeavours to be POSIX-compliant, then your results are unexpected and erroneous; scripts that are compliant and work as expected on compliant systems can fail on HP-UX.

That's the expected result for historical SysV behavior, but it's not POSIX-compliant. In a POSIX tr range expression, the brackets are not special at all; [a-c] is equivalent to ][a-c .

The POSIX-compliant result is .....

The \[ and \] escape sequences are undefined in POSIX. Their use is not portable.

That gives me nothing (except for the untranslated newline emitted by echo), which is the POSIX-compliant result.

Linux or BSD is irrelevant for the purposes of this discussion. I'm simply playing POSIX lawyer at the moment ;).

It appears that HP-UX tr added support for the BSD range expression syntax that POSIX long ago adopted, a-c , but it continues to accept historical SysV syntax, [a-c] , treating them identically even though according to POSIX they mean different things (the latter includes two brackets which the former does not).

It's understandable that you've been using this syntax for a very long time without any obvious problems. With a SysV tr, the range expression behaves as you intend. With a POSIX or BSD tr, in most instances, where both strings consist of a range expression, the brackets are silently translated into identical characters. While the brackets were not intended to be members of the translation set, since they are translated into themselves, the result is correct (which is why the POSIX standard chose to go with the BSD syntax, less collateral damage). However, in other cases, for example, when only the first string contains a range expression and the second is a repetition expression, tr '[a-z]' '[.*]' , there exists a potential for a silently erroneous result. And if the tr implementation does padding on the second string, then the repetition expression isn't required for a silent error to occur, tr '[a-z]' '.' .

methyl, I greatly appreciate your responses to my questions. I realize that these are rarely encountered corner cases, but they pique my curiosity. I often learn more than I intend as I dig into them.

Regards,
Alister

2 Likes

I can only explain the issue in post #1 if the O/P has a combination of file(s) in the current working directory which causes Shell to change the tr command to something wrong but syntactically correct. Not found a way to achieve this yet.

@alister
The new(ish) thing is the Shell reacting to unquoted square brackets which is why we need to quote them in a tr .

ls ?
A  B  C

ls [A-C]
A  B  C

echo [A-C]
A B C

Edit: Noted that @alister has found a way of generating the anomoly. See subsequent posts.

It's trivial, in certain environments. I haven't used linux in a long while, but I still have a 2006 Debian install on an old laptop. It uses the en_US.UTF-8 locale by default. Here's one way to replicate OP's issue (command output bolded):


$ locale | grep COLLATE
LC_COLLATE="en_US.UTF-8"
$ mkdir test
$ cd test
$ set -x
$ echo NEE | tr [A-Z] [a-z]
+ tr '[A-Z]' '[a-z]'
+ echo NEE
nee
$ touch F
+ touch F
$ echo NEE | tr [A-Z] [a-z]
+ tr F F
+ echo NEE
NEE

Note that both [A-Z] and [a-z] matched F , because the locale's collation sequence interleaves uppercase and lowercase letters, including F in both sh glob patterns.

Since many implementations will simply ignore excess characters in the second string, they can be used to reproduce the error even when using a locale which does not interleave upper and lower case letters. Continuing where we left off, in the same working directory and environment:

$ LC_COLLATE=C
+ LC_COLLATE=C
$ echo NEE | tr [A-Z] [a-z]
+ tr F '[a-z]'
+ echo NEE
NEE

If your locale's collation isn't interleaved and if your tr implementation will not accept a longer second set of characters, then you will not be able to reproduce either result.

Regards,
Alister

It has nothing to do with tr. It'd happen anywhere you use an unquoted [a-z].

$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ echo [a-z]
a b c d e f g h i j k l m n o p q r s t u v w x y z
$

@alister
Very good explanation and example.

If it can be explained by a simple change of environment that's brilliant. I'd forgotten about certain Linux using UTF-8 and some of the other consequences like really slow sort, grep, sed etc. . There are many advantages of UTF-8/UTF-16/UTF-nn in the Word Processing world but I can't see any advantage on the default command line.

As you will all know I'm a bit of a Posix cynic (though I support the Posix ideal) mainly because I come across too many non-compliant situations. I have lost track of the "current Posix standard" (despite @jlliagre's best efforts).

I'm not sure that the Original Post is 100% one of those situations because the O/P noticed that what I believed was the portable syntax worked. The Linux/BSD amomaly with the treatment of quoted square brackets in the tr command is in my opinion a coding error.

Not sure if that quote in context because this is nothing to do with Escape Sequences (like for Printer control). If it is, then IMHO that Posix standard needs urgent correction. It's universally expected in unix syntax that a character which is part of the syntax of a command must be escaped if it is to be considered as data.

Hehehe. Understandable. You may enjoy standards

I try to keep up with POSIX enough to know when I'm using an extension, but that's about it. Most of the time, when I dig into it, it's a result of investigating some oddity in one of this forum's threads.

As far as POSIX tr is concerned, the brackets are not special so they do not require a backslash escape sequence. Real world implementations probably drop the backslash and treat the following square-bracket literally, as your implementation is doing (although in your case it's by necessity since it's trying to accomodate the historical syntax), but doing so isn't required by POSIX. A compliant implementation of tr could choose to abort with an error when it encounters \[ or \] .

Regards,
Alister

Sorry, not going to click on standards because the URL looks dubious.

How a Posix verson of tr is going to see \[ \] beats me? The Shell gets there first and eats the backslash.

Anyway, I think that this is yet another Posix mistake and we will have to code round it.

Please don't blame POSIX for your misunderstanding of single quotes:

$ echo '\[\]'
\[\]

$ echo '\\'
\\

$