Good sed Book?

MIA651 · November 12, 2012, 4:12pm

I am beginning to write many korn shell scripts these days, and was wondering what book is good as far as sed goes. I know there is a book on both sed and awk from O'Reilly, but was wondering if there is a decent book on sed alone.

I have this for awk, which has been around for a while but still seems to be the favorite amongst many developers:

The AWK Programming Language: Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger: 9780201079814: Amazon.com: Books

Any advice will be greatly appeciated!

alister · November 12, 2012, 4:27pm

Standard sed doesn't support that many functions and all of its functionality is covered by a reasonably brief document: sed @ opengroup

If you feel that you need more examples, you can grep through your system's start-up scripts, search the web for sed one-liners and longer scripts, and peruse this very forum.

While I did nothing more than skim them briefly, these two links seem like a good source of examples:

http://sed.sourceforge.net/grabbag/scripts/

Regards,
Alister

DGPickett · November 12, 2012, 4:55pm

Well, sed is so simple, I taught myself. I find there are two flavors of script, loopers and filters. Loopers read most or all lines but the first using N, so they can see multiple lines in the buffer, and loop using branch features. Filters just do things on each line, as one mainly thinks of sed doing.

Awk has variables, arrays, hash tables, a sense of fields and records divorced from the line feed, which sed lacks, but is complex enough that one might just as well learn PERL, which is more able and orthogonal. Adding to the pifalls of awk is that there was an old awk and new awk, sometimes nawk, so examples may vary in dialect.

Most of sed is also usable in grep, vi, ex, and on the command line of ksh and bash when in 'set -o vi' mode, my favorite. I started on regex line editors with qedx in MULTICs, then used them in GCOS with University of Waterloo FRED the friendly editor, before I arrived in UNIX/vi land.

Regex are extended in several flavors, so what works in sed is more than basic grep but less that egrep/"grep -E" and that is different from awk.

You can make executable pure sed or awk files if the first line is "#!/bin/sed -f". I usually write sed right on the command line, barefoot inside ', like this extra blank (all white space) line remover, a classic looper (\t is a real tab, [f a real form feed):

sed '
  s/[ \t\f]*$//
  :loop
  $b
  N
  s/[ \t\f]*$//
  s/^\n$//
  t loop
  P
  s/.*\n//
  t loop
 '

The '$b' ensures that N is not run at EOF, losing a line on some early, defective sed implementations. Early proprietary sed was fixed buffer but faster than GNU and later sed's with realloc()-able indirect buffers. I often kept both around for script compatibility, renaming GNU sed 'gsed'. In one app, I turned whole pages into lines using 'tr' to swap line feeds and form feeds, so I could mark up the pages in sed to insert them into the database as big strings. I am not shy about running multiple sed's in a row, as looper sed and filter sed are not friends, and get some multiprocessing in the deal:

sed '
  s/[ \t\f]*$//
 ' | sed '
  :loop
  $b
  N
  s/^\n$//
  t loop
  P
  s/.*\n//
  t loop
 '

alister · November 12, 2012, 5:40pm

N not printing the pattern space when EOF is encountered is not a sign of a defective implementation. It is a sign of POSIX compliance.

To quote POSIX:

That is how nearly all sed implementations work. GNU sed is the defective exception (with regard to historical practice and the standard), not the rule.

Years ago, intending to report this "bug", I found that apparently enough people had reported it that the Free Software Foundation felt it necessary to address the matter in their sed bug reporting page, Reporting Bugs - sed, a stream editor:

Regards,
Alister

DGPickett · November 15, 2012, 3:28pm

So, it is a area where behavior is not trustworthy, and thus I never go there! Files with no line feed right before EOF tend to have that last "line" ignored by sed. Maybe that's POSIX, too. I think EOF, new line and form feed should all be treated as end of line, but it is a bit late, never mind those MAC people with just carriage return and the DOS people with both. Both made sense for teletype: the cariage took more time to return 80 columns than the platen to rise one line, so it was sent on its way first.

Since sed is pretty easy about white space, I put sed on different lines than shell, indented meaningfully, and so have never needed one or more -e options! You, too, are worthy of well formatted code, reducing your errors, potential confusion and that of future maintainers.

I have never used 'G' and 'H' or space exchange but g and h are nice for parsing situations where something is missing, so you want to annotate the original line with an error prefix and write it to a reject log. You h it on the way in, in case of rejections, and upon rejection, g it before annotation on the way out. Similarly, usually I do not use 'D', but 's/.*\n//' so the second line is not released.

The 't' is a great time saver, as the s can both modify and recognize '/../' what had been there with one regex search. Just make sure, especially in a looper, that it gets cleaned out before reuse, as the flag reflects all s since the last t or automatic read.

I have been warming up to the -n and 's/.../.../p' lately, as it fits many situations (frequency not variety), but initially I ignored them as I was interested in the most versatile tactics.

I would note that many sed flavors do not tolerate comments # whatever, which is a shame. Inline documentation can help maintenance. In C, C++, JAVA, shell and SQL I like the switch/case/when/then/else, as each case can be commented neatly!

In data warehouses and similar places with crushingly big data sets, sed's lack of temp files and near-C speed are very well respected. It has a very important role to play in a pipe-oriented shell programming paradigm, where there are no intermediate or temp files, or any temp files are managed by the tool like sort. This results in lower latency and pipeline parallel multiprocessing, as many steps run concurrently.

Using literal '|' and named pipes (/sbin/mknod p p -- one of those p's is a file name), especially the self-managed named pipes '<(...)' and '>(...)' in bash and luckier systems' ksh, you can build a tree of pipelines working one or many inputs to produce one or many outputs. (On unlucky systems, bash makes named pipes somewhere under /var/tmp that accumulate, a bug I reported.) Unfortunately for sed, the self-managed named pipes '<(...)' and '>(...)' are parsed as words in ksh (according to David Korne) and probably bash; they have virtual spaces around them that you cannot erase without passing them through a shell function call or the like. Life is sometimes excessively complicated! In the following example, the first '>(...)' after a 'w' command in $1 does not work, might resolve to, essentially, ' /dev/fd/3 ' (the writable fd number from a pipe() call), so the pipe's file name '/dev/fd/3' is unrecognized sed command line option $2, the next part of the sed script is $3 and the next named pipe, perhaps '/dev/fd/5', is $4:

$ sed '
  /xyz/w '>( sort -u >file_1 )'
  /abc/w '>( sort -u >file_2 )'
 .... '

alister · November 15, 2012, 5:20pm

If a non-empty character sequence doesn't end with a newline, POSIX sed considers it invalid input. The result of such a scenario is undefined and implementation dependent.

POSIX Definitions:

POSIX sed man page:

Regards,
Alister

DGPickett · November 16, 2012, 11:33am

Yes, it is just a bit of a contrast with vi, which generally shows that last line and just warns and adds the missing line feed on the save.

$ echo '1
2\c'>no-final-lf
$ vi no-final-lf
1
2
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
"no-final-lf" [Incomplete last line] 2 lines, 3 characters
:wq
1
2
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
"no-final-lf" 2 lines, 4 characters 
$ od -bc no-final-lf       
0000000   1  \n   2  \n
        061 012 062 012
0000004
$

hanson44 · March 6, 2013, 3:47am

You might try sed book "Definitive Guide to sed": sed book - Definitive Guide to sed