The force flows strong in me, LOL!
Actually you came very close. What you didn't get was the part i left out in my little introduction, so here is part two:
Grouping
To combine several characters or metacharacters into a single expression which you can handle together there is grouping: it works like grouping in mathematical expressions:
(x+y+z) * 3 =
The * 3
affects all that is inside the brackets as a single entity. The same works for regular expressions, just that the brackets are "escaped" (you put a backslash in front of them, otherwise they would be simple characters) and you can do really cool things with it:
/\(aa\)*/
Because the asterisk now addresses what is inside the brackets this matches any even number of a's (zero, two, four, ...), but not an odd number. Try the following file:
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay
and apply this sed-command to it. Watch the output:
sed -n '/x\(aa\)*y/p' /your/file
Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /\(aa\)*/
on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!
Grouping also has another use: you can use it for so-called backreferences. Backreferences are are parts of the matched line which you can use in a substitution command to put the matched part back into the substituted portion.
The most basic backreference is the &
, but let us first examine the "s"-command of sed
:
sed 's/<regexp1>/<regexp2>/'
This will scan the text (line by line) and try to match <regexp1>
. Whenever it does, it substitutes <regexp2>
for it, then the line is shipped to output.
"&" now can be used in <regexp2>
to put there everything regexp1
has matched. Lets try something very simple: the regexp to match everything in a line is /^.*$/
. We want to output all the input but put =>
and <=
around every line. Here it is:
sed 's/^.*$/=> & <=/' /some/file
Cool, no?
Another form of backreference is "\n" where "n" is a number: 1, 2, 3, ... It will signify the portion of the <regexp1>
, which is surrounded by the first (second, third, ...) pair of brackets. Suppose the input file from above with the "xa*y"-lines. Suppose we would want to exchange the first and last characters (and suppose they weren't fixed "x"s and "y"s). Here it is:
sed 's/^\(.\)\(a*\)\(.\)$/\3\2\1/' /path/to/file
We use the grouping here only to fill our various backreferences: first, we split the input into three parts: ^\(.\)
(beginning of line, followed by a single character), \(a*\)
(any number of a's) and \(.\)$
(again a single character, followed by the line end). In the substitution part we put them together reversed, first the third part, then the second one (the a's), then the former first part.
Most of the original sed
-script should be clear by now, but we need to establish a few more things for the last bits:
When you write a substitute-command like about it is implied that it should be applied to every line. In fact, sed works like this:
- read the first/next line of input and put it into the so-called "pattern space"
- apply the first command of the script to this pattern space, it might change it (or not)
- apply the next command of the script to the changed pattern space, changing it further (or not)
- and so on, until the last command. If sed was started without the "-n" option print the pattern space now to output
- if this was not the last line of input, go to the start again and repeat
- if it was the last line, exit.
Ranges
Coming back to the substitute-commands: in their simplest form they are applied to every line. Here is some input file:
old
= old1
== Start ==
= old2
old3
== End ==
old4
= old5
The following will change all the "old" strings to "NEW":
sed 's/old/NEW/' /path/to/file
NEW
= NEW1
== Start ==
= NEW2
NEW3
== End ==
NEW4
= NEW5
But we could limit this command to only take place on lines starting with a "=":
sed '/^=/ s/old/NEW/' /path/to/file
old
= NEW1
== Start ==
= NEW2
old3
== End ==
old4
= NEW5
The first regexp /^=/
works like an "if"-statement: if the line (or something in it) matches the expression, then the substitute-command is applied, otherwise not.
There is also another form, where you can define a range of lines where the following command(s) are applied:
sed '/^== Start.*$/,/^== End.*$/ s/old/NEW/' /path/to/file
old
= old1
== Start ==
= NEW2
NEW3
== End ==
old4
= old5
Instead of regexps you can also use line numbers. This will apply the substitute-command only on lines 1,2 and 3:
sed '1,3 s/old/NEW/' /path/to/file
Was that all? No! One last thing: modifiers. Per default a substitute-command only changes the FIRST occurrence of a pattern:
$ echo "old old old" | sed 's/old/NEW/'
NEW old old
If you add some number at the end, this is the number of matching instance, which will be changed. If you add a "g" (global) all occurrences will be changed:
$ echo "old old old" | sed 's/old/NEW/'
NEW old old
$ echo "old old old" | sed 's/old/NEW/2'
old NEW old
$ echo "old old old" | sed 's/old/NEW/g'
NEW NEW NEW
Finally, there is one more modifier: "p". It prints the result of the substitution to the output. So far we have only had scripts consisting of only one command so that hasn't affected us but look above how sed
works: what a command gets is basically what the command before has produced:
echo "white white white" | sed 's/white/blue/g
s/blue/green/g
s/green/red/g'
red red red
The second command would do nothing if they would get the input text without the first command already processing it and the same goes for the third command. but suppose you want to have the intermediary steps displayed: you can use the p-modifier for that (note that for the last line the "p" is implied):
echo "white white white" | sed 's/white/blue/gp
s/blue/green/gp
s/green/red/g'
blue blue blue
green green green
red red red
The p-modifier comes especially handy when you switch off the automatically implied printing at the end with the "-n" switch for sed
: This way you do not need to filter out lines you do not want, you just print explicitly the ones you are interested in - a technique we used to filter out all lines not interesting in your text.
OK, was that all? No, not even close! sed
is such a mighty tool i still am finding new ways to use it every day.
But - hey, in for a penny, in for a pound - here is a last one: you can use the ranges i talked about above and apply more than one command to them by using curly braces:
sed '/<regex1>/,/<regex2>/ {
s/<regex3>/<regex4>/
s/<regex5>/<regex6>/
s/<regex7>/<regex8>/
}' /path/to/file
Now, the three substitutions will only be applied to a range of lines starting with <regex1> and ending with <regex2>. You can also negate/invert that:
sed '/<regex1>/,/<regex2>/ ! {
s/<regex3>/<regex4>/
s/<regex5>/<regex6>/
s/<regex7>/<regex8>/
}' /path/to/file
Apply the three substitutions to all lines except for a range of lines starting ..... Of the same goes for the other forms of range specifications i showed you above.
I hope this helps.
bakunin