Grep causing long delay (batching) whilst piping

spudtheimpaler · July 2, 2008, 4:14pm

Hi all.

I have a problem at work which I have managed to break down into a simple test scenario:

I have written a monitoring script that outputs every second the status of various processes, but for now, lets just print the date

input.sh:

while true
do
  date
  sleep 1
done

This will produce results ok for logging, but I have also started on a script that takes this input and processes it on the fly. For now, lets use...

output.sh

while read in
do
   echo "input = $in       output = " `date`
done

Now, make them executable etc and run them, piping input to output:

No surprises so far.

Here is the but... say I only wanted certain pieces of the input to work with? lets add an arbitrary grep to do the filtering:

spudtheimpaler@spudslaptop:~$ ./input.sh | grep 2008 | ./output.sh
## long pause ensues
input = Wed Jul 2 21:04:20 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:21 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:22 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:23 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:24 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:25 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:26 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:27 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:28 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:29 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:30 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:31 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:32 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:33 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:34 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:35 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:36 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:37 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:38 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:39 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:40 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:41 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:42 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:43 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:44 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:45 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:46 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:47 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:48 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:49 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:50 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:51 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:52 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:53 BST 2008 output = Wed Jul 2 21:04:55 BST 2008
input = Wed Jul 2 21:04:54 BST 2008 output = Wed Jul 2 21:04:55 BST 2008

So that is my problem. Unsurprisingly, if you just grep the results of input.sh they are displayed every second.

I was wondering if, assuming I've been clear, anyone knew a reason for this and how to avoid it? My output.sh is supposed to update every second also, and currently it is displaying the updates in batches after long pauses. Is there a better way than while read to get the input from a pipe? Maybe that's it? I'm not a complete novice, but this is my first dealing with writing scripts that can be piped to.

I've tried with taking the delay out, although the delays are much reduced, they still exist. I've also tried the search here, other places, and google, but haven't found anything (though the search terms are pretty ubiquitous)

Thanks for your time.

Regards,
Mitch.

Annihilannic · July 2, 2008, 8:55pm

Hi Mitch,

I had exactly the same problem with a script I wrote to put timestamps on some tail -f output - I kept getting bunches of lines with the same timestamp instead of them being processed in real time.

I ended up replacing grep with a small perl script:

./input.sh | perl -nwe '
        BEGIN {
                # From the open() section on the perlfunc man page.
                # Makes output unbuffered.
                select(STDOUT); $| = 1;
        }
        /2008/ && print;
' | ./output.sh

Hope that helps.

spudtheimpaler · July 3, 2008, 4:24am

Annihilannic,

I appreciate the response, and it's good to know I'm not alone with the problem, but I'm writing this for what could be many boxes, including production servers where perl wont be available. That is of course secondary to the fact that there is already a tool that, as far as I know, should work fine. Apart from your workaround, did you find any information on *why* it might be happening?

Thanks again for your time.

Regards,
Mitch.

ghostdog74 · July 3, 2008, 6:14am

why do you need to do things in separate scripts? do your processing in the while loop itself. Its just a program design problem.

spudtheimpaler · July 3, 2008, 8:33am

Because:

the actual 'input' script outputs something like

which is primarily for logging purposes. The 'output' script was going to attempt to take the processes as they were output, and display running averages. As there are, in reality, many processes this will monitor and you would only need to actively monitor one t a time, I used a simple grep on it, which is when i noticed the issue.

Can I work around it? Sure. I can log to a file and tail that, for example, but I'm more curious as to why this is happening. This is less of a 'I have a problem and need a workaround' and more of a 'this is unexpected behaviour, does anyone have an explanation?' kind of query.

Cheers,
Mitch.

ghostdog74 · July 3, 2008, 8:43am

what i meant is ( if you are able to modify your input script)

while some condition
do
 # somewhere here produces  
 # the input script output
 # at the same time 
 # store running averages. 
done

spudtheimpaler · July 3, 2008, 9:44am

ghostdog74:

what i meant is ( if you are able to modify your input script)
while some condition
do
 # somewhere here produces  
 # the input script output
 # at the same time 
 # store running averages. 
done

I see what you mean, and you aren't wrong. Thanks

To be honest at this point I am just curious as to why it is behaving as such, rather than how to workaround it. It's a learning exercise

jim_mcnamara · July 3, 2008, 1:29pm

It may be that scheduling has the pipe writer going possibly until the written data gets to PIPE_MAX, or it finishes writing. Then grep reads the the whole pipe contents.

That would account for grep producing a burst of output all with one date.
There is no guarantee for every pipe write there is one pipe read

Annihilannic · July 3, 2008, 7:26pm

I didn't, but I strongly suspect it is optimisation within grep (perhaps one of the most optimised pieces of software on the planet, given its long history and prevalence).

Another portable, if clunky and inefficient, workaround would be to use a shell while/read loop and run grep on each individual line:

./input.sh | while read line ; do echo "$line" | grep 2008 ; done | ./output.sh

spudtheimpaler · July 4, 2008, 8:59am

It seems that many agree it is buffering, either in the pipes or grep. It seems it only happens in particular combinations as well, e.g. if the ./output.sh is ommited grep doesn't buffer. I suspect the shell is a lot more aware of itself than I had considered, especially in choosing where to buffer.

Thank you all for your help. I appreciate the time and have learnt something.

Cheers again!

Mitch.