Make script faster

AlbertGM · December 22, 2011, 5:02am

Hi all,

In bash scripting, I use to read files:

cat $file | while read line; do
...
done

However, it's a very slow way to read file line by line.
E.g. In a file that has 3 columns, and less than 400 rows, like this:

I run next script:

cat $line | while read line; do  ## Reads each line
   grup=`echo "$line" | cut -d " " -f3`;  ## Takes third column
   if [ "$grup" == "27" ]; then  // ## If column == "27" prints column 2. 
      exp=`echo "$line"  | cut -d " " -f2`; 
      echo $exp; 
   fi; 
done;

Using "time" command it lasts:

It's a huge waste of time to read only less than 400 rows. Is there any way to make it faster?
Occasionally I used awk to process a file line by line, and it is much faster. Why? Any hint to read a file in bash?

Thanks a lot

Albert.

Franklin52 · December 22, 2011, 5:06am

Try this:

awk '$3==27{print $2}' file

vivek_d_r · December 22, 2011, 5:11am

@albertGM how to check the run time consumed by a script....?

itkamaraj · December 22, 2011, 5:15am

use time command

$ time echo "hello"
hello
real    0m0.000s
user    0m0.000s
sys     0m0.000s

vivek_d_r · December 22, 2011, 5:19am

consider if there is script called myscript.sh.. so to find the execution time i need to run it like this..?

 
$time ./myscript.sh

and also in real, user and sys which one is the actual time?

rbatte1 · December 22, 2011, 6:40am

You spend lots of time looping round and demanding in cut again & again. Each time, you start a new process so the system spends effort there. The awk answer is probably the way to go if you are comfortable, however you can simplify you script by using the read statement better:-

cat $line | while read first second third rest; do  ## Reads each line into separate variables
   if [ "$third" == "27" ]; then  // ## If column == "27" prints column 2. 
        echo $second; 
   fi; 
done;

I did an "Ask Jeeves" search with +bash +read specified and got quite a few examples.

As for the time command, have a read of the man page. The main figure though is real as this will be the elapsed time you will experience.

I hope that this helps.

Robin
Liverpool/Blackburn
UK

Franklin52 · December 22, 2011, 7:19am

rbatte1:

You spend lots of time looping round and demanding in cut again & again. Each time, you start a new process so the system spends effort there. The awk answer is probably the way to go if you are comfortable, however you can simplify you script by using the read statement better:-
cat $line | while read first second third rest; do  ## Reads each line into separate variables
   if [ "$third" == "27" ]; then  // ## If column == "27" prints column 2. 
   echo $second; 
   fi; 
done;
I did an "Ask Jeeves" search with +bash +read specified and got quite a few examples.

As for the time command, have a read of the man page. The main figure though is real as this will be the elapsed time you will experience.

I hope that this helps.

Robin
Liverpool/Blackburn
UK

Useless Use of Cat, this is suffice:

while read first second third rest; do  ## Reads each line into separate variables
   if [ "$third" == "27" ]; then  // ## If column == "27" prints column 2. 
        echo $second; 
   fi; 
done < $line

AlbertGM · December 22, 2011, 8:08am

Hi,

vivek d r, I actually didn't make an script. I just ran whole script from command line, preceded by time, ie:

rbatte1, I undestand that you recommend me to learn awk for scripting?
I ran command Franklin52 said (using awk), and results where awesome.

And using redirect (while.... < $file) results were also very good:

Why is there so much difference in performance using redirect rather than using pipes as I did? Could be because using redirection whole script runs in one shell, while using pipes (cat $file | while...) use several shells?

And even more, why awk (which is a program) has better performance than bash built-in commands?

Thank you very much for all answers, they help me a lot. And sorry for my English

Albert.

rbatte1 · December 22, 2011, 8:26am

The main reason is that your original had the following logic:-

Start a process to read a line from the input
Start a process to perform the cut *1[*]Do a compare looking for value 27
If we match, start a process for another cut *2[*]Display the result
Start from top to read next line

For a 400 line file, you are forcing 400 cut processes to be run for *1 and another set for the cut in *2
Depending on your shell, you might start 400 read processes, plus 400 echo statements in *1 and more for *2 for each line matching value 27.

All of this generates vast amounts of work just in the overheads. I'm not very good with awk myself but it all runs in a single process so is excellent if you can invest the time to get into the syntax. My variation removed many of these processes, but probably could still be improved. Every process launch requires memory to be allocated, perhaps logs to be written, paging/swap space to be altered etc, so before it actually does anything, there is a significant processing overhead - and then there may be end-of-process overheads too.

The use of the cat at the front makes it more readable for some, although I'm sure purists may not agree. I suppose it depends how you describe your logic in your mind before writing code. I just tried to follow your logic with a few tweaks so it doesn't become too different and need documentation or lots of work on your part to decipher, but it's the difference between thinking:-

Working on this file, I will do these things to it, versus
Do these things on this input file

I hope that this clarifies and helps,

Robin
Liverpool/Blackburn
UK

Franklin52 · December 22, 2011, 8:28am

I hope this will help:

Ksh built-in functions

AlbertGM · December 22, 2011, 8:48am

Yes, indeed. Both really help!
Thanks.

By the way, I forgot to tell you I was using cygwin to run those commands and scripts, although it probably doesn't make any difference in all you told me.

Albert.