On how to select the right tool for a given task

ghostdog74 · October 20, 2008, 7:01am

like i said, if time is a concern, i won't write shell scripts. Here's an awk version (which i hope its correct), if you consider awk an "external tool" as well.

BEGIN{
  i=1
  while ( i <= 1000 ){
    var="abcde"
    j=1
    while ( j <= length(var) ){
        if ( substr(var,j,1) == "a" ){
            printf "%s", "m"
        }else if (substr(var,j,1) == "b"){
            printf "%s", "n" 
        }else if(substr(var,j,1) == "c") {
            printf "%s", "o" 
        }else if(substr(var,j,1) == "d") {
            printf "%s", "p" 
        }else if(substr(var,j,1) == "e") {
            printf "%s", "q" 
        }else if(substr(var,j,1) == "f") {
            printf "%s", "r" 
        }else if(substr(var,j,1) == "g") {
            printf "%s", "s" 
        }else if(substr(var,j,1) == "h") {
            printf "%s", "t" 
        }else if(substr(var,j,1) == "i") {
            printf "%s", "u" 
        }else if(substr(var,j,1) == "j") {
            printf "%s", "v" 
        }else if(substr(var,j,1) == "k") {
            printf "%s", "w" 
        }else if(substr(var,j,1) == "l") {
            printf "%s", "x" 
        }else if(substr(var,j,1) == "m") {
            printf "%s", "y" 
        }else if(substr(var,j,1) == "n") {
            printf "%s", "z" 
        }else if(substr(var,j,1) == "o") {
            printf "%s", "a" 
        }else if(substr(var,j,1) == "p") {
            printf "%s", "b" 
        }else if(substr(var,j,1) == "q") {
            printf "%s", "c" 
        }else if(substr(var,j,1) == "r") {
            printf "%s", "d" 
        }else if(substr(var,j,1) == "s") {
            printf "%s", "e" 
        }else if( substr(var,j,1) == "t") {
            printf "%s", "f" 
        }else if(substr(var,j,1) == "u") {
            printf "%s", "g" 
        }else if(substr(var,j,1) == "v") {
            printf "%s", "h" 
        }else if(substr(var,j,1) == "w") {
            printf "%s", "i" 
        }else if(substr(var,j,1) == "x") {
            printf "%s", "j" 
        }else if(substr(var,j,1) == "y") {
            printf "%s", "k" 
        }else if(substr(var,j,1) == "z") {
            printf "%s", "l" 
        }
        j++
    }
    print
    i++
  }

}

output:

# time awk -f testawk > /dev/null

real    0m0.022s
user    0m0.020s
sys     0m0.000s

I don't have ksh, so i used bash

iCnt=1
while [ $iCnt -le 1000 ] ; do
     var="abcde"
     j=0
     while [ $j -le ${#var} ] ; do
          case ${var:$j:1} in
               a) printf "%s" "m" ;;
               b) printf "%s" "n" ;;
               c) printf "%s" "o" ;;
               d) printf "%s" "p" ;;
               e) printf "%s" "q" ;;
               f) printf "%s" "r" ;;
               g) printf "%s" "s" ;;
               h) printf "%s" "t" ;;
               i) printf "%s" "u" ;;
               j) printf "%s" "v" ;;
               k) printf "%s" "w" ;;
               l) printf "%s" "x" ;;
               m) printf "%s" "y" ;;
               n) printf "%s" "z" ;;
               o) printf "%s" "a" ;;
               p) printf "%s" "b" ;;
               q) printf "%s" "c" ;;
               r) printf "%s" "d" ;;
               s) printf "%s" "e" ;;
               t) printf "%s" "f" ;;
               u) printf "%s" "g" ;;
               v) printf "%s" "h" ;;
               w) printf "%s" "i" ;;
               x) printf "%s" "j" ;;
               y) printf "%s" "k" ;;
               z) printf "%s" "l" ;;
          esac
          (( j += 1 ))
     done
     echo
     (( iCnt += 1 ))
done

I think it resembles the ksh version so here's the test

# time ./test.sh >  testshell.txt

real    0m1.041s
user    0m0.956s
sys     0m0.080s

# time awk -f testawk > testawk.txt

real    0m0.022s
user    0m0.020s
sys     0m0.000s

# diff testawk.txt testshell.txt
#

I may be wrong

bakunin · October 20, 2008, 7:37am

Yes and no. Shell language has something compiled code might lack: architecture independence. A shell script (well: most shell scripts, there might be exceptions) won't care about being run in 32-bit or 64-bit environments, about being run on certain OS versions (try that with a program relying on some shared libraries), etc.

Of course you are not wrong. What was discussed about mixing external programs and shell constructs. Of course you can solve this example completely in awk, but wether you write in awk and call shell functions via the system() subfunction call or you write in shell and call awk as an external program you pay dearly for it because the exec() and fork() calls which are necessary.

As i have understood cfajohnson and perderabo this was their concern, not which tool was faster if you never leave it.

bakunin

cfajohnson · October 20, 2008, 3:26pm

If real time is a concern, using shell constructs, instead of external commands, can make a great deal of difference.

ghostdog74 · October 20, 2008, 10:31pm

An example to keep things in perspective:

If you need to parse a huge file and you need the results fast, would you use awk or the shell's construct to do it?

cfajohnson · October 21, 2008, 2:54am

To parse a large file, I would generally use awk (or occasionally sed).

If I needed to do operations on every line that could not be done with awk or sed internals, and required calling other commands, I would probably use the shell.

neked · October 21, 2008, 3:24pm

My personal perspective:

1) I am not very good at memorizing stuff, nor do I think the human brain is optimized for that task. I'm much better at making inferences and deductions instead.
2) I know what parameter substitutions are -- in general. But I can't bring myself to remember the difference between

${parameter##pattern}
and
${parameter%%pattern}

Everytime I see one of those, I have to open a terminal and make a small test to figure out which is which. And thats only an example, there are many other similar looking constructs that do not give the slightest hint to what they actually do. My memory is weak, and when I can't use my deductive and inferential powers, I end up wasting time figuring out what they do. A classic example, if you only knew minimal bash, how much time do you need to understand what this does:

path="/home/neked/testfile"
s=${path##*/}

versus:

path="/home/neked/testfile"
s=$(basename $path)

The second code provides some semantics for you to infer what the code does. The first one relies on your memory. The brief way to sum my point is that parameter substitutions are NOT easily readable.

3) Even if you were a parameter substitution guru, and you used meaningful variable names and comments to make clearer what your parameter substitution tricks do, then future maintainers of the code might not be the same. This bit me a couple of days ago: I had to spend 20 minutes debugging a bash script riddled with those parameter substitution scripts. I estimate I would've spent closer to 5 minutes if the code was written using more obvious external tools (sed, basename, awk). This is about 15 minutes of human time wasted in order to save less than a few milliseconds of CPU time. Especially since the whole script runs only once a night, and does not exceed 0.030 seconds runtime on my modest 7 years old computer.

4) The conclusion for me is that parameter substitution should only be used when and only if the need arises. Anything else is premature optimization at the cost of more developer hours debugging and maintaining the code. If I have a script that takes 4 seconds to execute, which could be optimized into running within less than a second using parameter substitution, that still would not -- on its own -- make a convincing case to use param substitution. For a convincing case to be made, the need to reclaim the additional seconds of CPU time must be established and weighed against the loss in human seconds needed to maintain and develop the code.

neked · October 21, 2008, 3:42pm

I've just substituted all occurrences of parameter substitutions with equivalent sed and basename commands in the script I refer to above, the execution time went up from an average of 0.030 seconds to 0.061 seconds. That means parameter substitution use shaved an average of 0.03 seconds per run. Considering that I've had to spend roughly 15 more minutes understanding the code to debug it, this means the code has to run 15*60/0.03 times to pay off the extra time I invested to debug it. That's 30,000 times. Considering that it runs once a night, this means we'll have to wait more than 80 years!

Even after 80 years, those 0.020 seconds shaved off CPU time per night are not as worthy to me as 15 minutes of my time which I could've spent aternative activity redacted

neked · October 21, 2008, 6:10pm

Ironically, this post sounds more interesting after the censorship.

ghostdog74 · October 21, 2008, 8:14pm

you shouldn't use shell then. Try Python.

Perderabo · October 21, 2008, 10:42pm

This thread is moving pretty far away from the forum rules, so I'll close it.