How Would You Like Your Loops Served Today?

bakunin · June 13, 2012, 8:09pm

Scrutinizer and i had a discussion about loops in shell scripts and you might be interested in joining in and share your experiences:

i wrote an example script which basically employed the following logic:

cat /some/file | while read var ; do
     echo var = $var           # just do something with $var
done

Scrutinizer said, this is a UUOC. Well, in principle, he is of course right. We could write the same this way:

while read var ; do
     echo var = $var           # just do something with $var
done < /some/file

But still, i beg to differ. This is not a useless but a very sensible use of cat! Suppose the loop would not be as short as the example here, but several screenpages long. To understand what goes into "$var" one would have to scroll down to its end, then, to find out what is done with "$var", scroll back up again.

Is it only me that i hate to have to scroll up and down repeatedly? I find it a lot easier to read if i "steer my loops from the top" instead of from the (maybe far-away) bottom.

Of course, there is this alluring GNU shellnik startup called bash. In bash pipelines have some really weird side effects, like variables being local to them. The following works in both shells:

cat /some/list | while read entry ; do
     line="$line $entry"
done
echo $line           # what is in there?

But while in ksh "$line" would hold all the list entries after the loop in bash the variable would be empty! Is it only me or do you think this is counter-intuitive too?

So probably in bash one has to resort to this ugly style of meticulously telling the shell the recipe in length while being totally silent about the ingredients you want to use - until the very end. Could you imagine cook books to be written that way? It would look like:

But the question stays: do you think this - in strictest terms - UUOC should be avoided even if it has no negatie side effects or do you think the gain in clarity outweighs this?

Discuss!

bakunin

agama · June 13, 2012, 10:16pm

Pipes are not free, so from the potential of efficiency I prefer to avoid the cat construct. Further, I prefer my shell loops this way:

#!/usr/bin/env ksh

while read -u $rfd buf
do
    echo "$buf"
done {rfd}<some-file

Having the shell open the input to the while on a file descriptor other than standard input prevents me, or some future maintainer, from adding an ssh command (or similar stdin gobbling binary) and forgetting to redirect stdin from /dev/null or using an alternate mechanism (-n in the case of ssh) to prevent the binary from causing odd problems with the loop. Maybe it's me, but it seems that there have been a fair few posts on this forum that were related to a while loop's stdin being 'eaten' by a process in the loop.

Letting bash try to run this results in several errors.

A slight twist on the code above allows for the input to be defined at the top of the loop without requiring the extra cat:

exec {rfd}<data
while read -u $rfd buf
do
    echo "$buf"
done

Again, it doesn't work in bash. The loop below does, but I don't like having to pick the file descriptor value, and rfd=3; exec ${rfd}<data , which would allow me to hard code the constant only once, seems not to work (parsing and expansion order I believe). IMHO, Having the shell automatically assign an available file descriptor value just seems the right thing to do.

exec 3<data
while read -u 3 buf
do
    echo "$buf"
done

Scrutinizer · June 14, 2012, 1:44am

I have preference for whatever is simplest and the most intuitive to read..
I agree that specifying data at the end of the loop is a bit of an oddball, but:

To me it is the simplest and cleanest code, there is no need for a cat-and-pipe or an extra file descriptor
As noted above, in shells other than ksh it does not send the loop into a subshell, so not only is that more efficient, it ensures variables set inside the loop are available outside. I tend to go with what works in all shells. The way it is done in ksh is great, but it is not specified in POSIX.
Whenever I have a loop with more than 20 lines of code, I tend to start thinking about splitting it into functions with mnemonic names.
With regards to the use of redirects, I have a preference for using them only in the context in which they are used. Also, feeding them into the loop at the bottom is ideal, since the file descriptors get closed when the loop ends. With the exec examples you would need to use an explicit close afterwards.

cero · June 14, 2012, 3:08am

I prefer to redirect at the end of the loop too because I think it's the most portable construct. If the loop is several pages long I usually put a comment at the beginning that tells what is read from.

methyl · June 14, 2012, 12:05pm

I much prefer top-down flow in any programming language and adopt the modular approach with the main program logic as the simplest control flow possible.

Each time this debate comes up there is no proof that the Shell inward redirect is faster than using cat. I can't see why the Posix folks don't make cat a Shell built-in rather than try to retire the command.

Have you read the "Useful uses of cat" collection from the excellent Mascheck site:
Useful use of cat(1)
That list includes a contribution from a certain Chris F.A. Johnson !
An enhanced version of the "convert file contents into arguments" contribution came up on unix.com yesterday.

Corona688 · June 14, 2012, 1:53pm

I too wish you could do <filename while read LINE ... like you can commands, but you can't, and putting the redirection at the end is the most portable.

If I had a shell loop 3 pages long, I'd try and reduce it with functions.

jim_mcnamara · June 14, 2012, 4:41pm

I do not have an opinion, except that the scrutinizer's test is somewhat flawed, IMO:

#1. The speed difference is because of disk controller caching. Try reversing the order the commands are executed.

Or making two identical copies of the file for the test.

#2. ksh93 mmaps files in that syntactic context, and usually means a single full-on 66 MB I/O request which modern controllers can perform on one request. 66MB may well fit in disk controller cache. So that is a valid result - faster.

Also you need to remove the character special I/O:
my take using one of our compellant SAN:

# uzpplpl_ng02cprd.log.73 is 160MB 

appworx> time cat uzpplpl_ng02cprd.log.73 > /dev/null

real    0m0.02s
user    0m0.00s
sys     0m0.01s

That measures the overhead required do to a single cat. Zilch, IMO. You have to come up with a way to show me that creating one extra child is important to something like this as well.

OTH creating thousands of children for a cat call inside a loop is a very serious issue, which I think is the origin of the entire UUOC thing.

Scrutinizer · June 14, 2012, 4:49pm

I did execute all tests various times in different order, on two different platforms it produced no significant difference (per platform). There was a dramatic difference with the repeated plain cat to /dev/null test due to caching which I had also performed of course in prior to these tests (to check its effect), but this thread was not about that. It was about two different methods of feeding into a while read loop, which is what I tested and these results are reproducible.