sed internal working

guruprasadpr · February 4, 2011, 11:34am

Hi Experts
Say I have a huge text file. I want to add a header line to the file. We can get it done in many ways. One is using a temporary file, other way say using 'sed -i' which edits the file in-place. sed is always recommended for better performance. My question is: Internally sed also might be shifting all the lines one line below and inserting a new line at the top. Am I right? If so, how can the performance of sed be faster?

Guru

Perderabo · February 4, 2011, 1:28pm

# ls -li data
3180921 -rw-r--r-- 1 root root 42 Feb  4 09:33 data
# sed -i 's/2/B/' data
# ls -li data
3180920 -rw-r--r-- 1 root root 42 Feb  4 13:24 data
#

The inode changed. It's a different file with the same name. sed copied the data to a new file to make the change. Then it deleted the old file and renamed the new file.

guruprasadpr · February 4, 2011, 8:01pm

Hi Perderabo
I understand it now. Say to add a header line to a file, I can write a header to a new file and then copy the contents of the old file. The performance-wise sed should not be much better than using this temporary file method. Then why is sed considered better for performance? Is the performance just in terms of avoinding the avoiding usage of 2 to 3 commands in place of 1 sed command?

Guru

ghostdog74 · February 4, 2011, 10:23pm

it depends. If you are talking about using sed to manipulate files versus the shell, then sed can be the better one in performance. (speed) because its designed to do that. On the other hand, it also depends on how you use it. For example, chaining different sed commands together might not be efficient than just doing in a single sed invocation.

Perderabo · February 5, 2011, 9:49am

The most important factor to writing fast shell scripts is to minimize the number of external programs invoked. It is a big deal to create a new process, go find the file containing sed or awk, open it, read it in, transfer control it, wait for it to exit, reclaim the process' resources and deliver the return code to the shell. When I need to manipulate a string I would rather code a dozen internal shell operations that invoke a single copy of sed. This is especially true of string manipulation in a loop.

Once a decision hs been made to invoke an external program, some are faster than others, but this is a relatively minor consideration. sed can tackle fewer jobs than awk. It's a much smaller program than awk. awk always tries to crack a line into fields whether or not this is useful. So sed can outperform awk on the tasks that sed can easily do. tr is smaller still and can outperform sed on the even smaller setset of tasks that tr can handle. But I rarely spend a lot of time worrying about stuff like this. They are all external programs. I try to avoid as many as possible.

Of all of the shell scripts I have posted on this site, the most frequently used is datecalc. And datecalc invokes no external programs at all.
datecalc

guruprasadpr · February 5, 2011, 10:56am

I got your point. Thanks Perderabo.

Guru

fpmurphy · February 5, 2011, 7:14pm

echo "header" > newfile
cat file >> newfile