Explanation for interesting sed behaviour?

Gavster · September 24, 2009, 5:22am

This is my first post so hi to you all. I have browsed these forums in the past and what a great community and resource this is! Thanks to all the contributors ... I look forward to being able to give something back.

In the meantime, I have a little conundrum concerning sed. My very simple script is as follows ...

for file in `find . -type f -print | xargs file | grep ASCII | cut -d: -f1`
do
cat $file | sed 's/phrase/substitute/' > $file
done

I know this isn't the best way of doing things and now use the -i switch with sed to achieve the same end which works fine.

When this version is run, it appears to randomly leave the odd file empty (i.e. - zero bytes in size). I've run it on a directory containing literally only two files ... the first few runs go fine, then suddenly, one of the files becomes zero bytes. The other follows some random (small) number of runs later.

Problem is, I'm being hassled to provide an explanation as to why this happens. My guess is that it's got something to do with the interaction between the tool and the OS (Linux) and the way the files are streamed between cat, sed, and the redirection but I don't have any real evidence to back this up.

I was hoping somebody here would be able to provide a more concrete explanation of why I might be seeing this behaviour.

Many thanks in advance.

Gavin

pludi · September 24, 2009, 5:50am

I think you've got yourself a nice little race condition. In most cases, cat can read the whole file before the redirection from sed opens the file (and thus truncates it). But ever so often, be it because of the file size, scheduling, or cosmic rays, it's not fast enough. Then sed truncates the file before cat has a chance to read it (in part or fully).

I'd suggest you rewrite it to this:

mv ${file} ${file}.TMP
sed 's/phrase/substitute/' ${file}.TMP > ${file}
rm ${file}.TMP

The difference to the '-i' switch is that it's portable across all versions of sed.

cabrao · September 24, 2009, 6:00am

for file in `find . -type f -print | xargs file | grep ASCII | cut -d: -f1`; do
     perl -pi -e 's/phrase/substitute/g' $file 
done

thuldai2 · September 24, 2009, 6:05am

I like to do something like this

sed 's/phrase/substitute/' ${file} > ${file}.TMP && /bin/mv -f ${file}.TMP ${file}

The '&&' makes the second part (mv -f) only execute when the first part worked fine, thus preventing you from accidentally overwriting the original file.

Gavster · September 24, 2009, 7:07am

Thanks guys for the quick replies.

In response to pludi's explanation, I have been going through a mental experiment with pencil and paper to figure out the sequence of events with this line of code:cat $file | sed 's/phrase/substitute/' > $file
I can understand how this might result in truncated files (which I have also seen). Essentially, if the redirection begins writing back to the file before the cat command had finished buffering it, I can see how we could loose the end of the file.

But that still doesn't explain (in my mind at least) how I could end up with a file of zero bytes in size. Surely, for this to happen, the redirection would have written (opened) the file before cat had even started reading it?!?! Is this possible?

Unfortunately, my knowledge of process scheduling and file IO in Linux is extremely limited so I'm not entirely sure.

Gavin

pludi · September 24, 2009, 7:34am

I'm no expert, either, by any means, but here's my interpretation of it:

The shell fork()s off a new process, redirects stdout to a pipe, and then exec()s cat
Meanwhile, since the forked process runs in parallel, a second process is fork()ed off, has stdin redirected to use the same pipe, stdout redirected to a file, and exec()s sed
If the first exec is delayed for any reason it's possible that the file redirection/trucation takes place before cat can even start to read the file. When it gets around to reading it, it sees an empty file.

Gavster · September 24, 2009, 10:20am

Many thanks pludi ... that makes more sense now. I had wondered about the parallelism of the statement but wasn't entirely sure how it would be treated.

I think I had assumed that the implied dependency of the output process on the input process would be understood by the scheduler but maybe it isn't that clever.

Gavin