Of course, but we would really appreciate it if you could obey the forum rules and post code (any code, data and output) in CODE-tags or - if they appear in running text, like commands - in ICODE-tags. For instance write the command ls --full-time -rt |sed 's/\(.*\)\(\..*\+.*\) *\(.*\)$/\1 \3/g'
like this.
Back to your question:
sed 's/\(.*\)\(\..*\+.*\) *\(.*\)$/\1 \3/g'
First, the basic command:
sed 's/<something>/\1 \3/g
We replace something (in fact every instance of something, because of the "g" at the end) by \1 \3
. \1
and \3
are so-called "back-references". They work like variables: you search for something in the search part (the "<something>") and whatever you have found is put into the variable. The "first" and the "third" such found things will be put into the result, effectively deleting the second.
Now, lets have a look at the "something" which the input line is broken up into:
\(.*\)\(\..*\+.*\) *\(.*\)$
Whatever is between "\(" and "\)" is put into such a backreference, hence we see three such pairs (marked bold) and a few characters in between:
\(.*\)
\(\..*\+.*\)
*
\(.*\)
$
Let us first deal with the things outside the bracket pairs: *
is a space, followed zero or more spaces. The asterisk means "zero or more of the character (in fact "regex", but in this case the regex is only a single character) before", hence "one or more of this character" is expressed by first such a character, then the same character with the asterisk:
x* # zero or more x'es, hence even no x at all
xx* # one or more x'es, hence at least one x
The $
means "end of line" and is a way of "anchoring" a regular expression. If you search for a group of characters they could appear anywhere in a line. If you want to specifically search for a word appearing at the beginning or the end of a line these anchors (there is ^
for "beginning of line" and $
for "end of line") are the means to express that.
To sum up so far, the search expression means:
\(??\)\(??\)<one or more spaces>\(??\)<end-of-line>
For the "??" parts:
\(.*\)\(\..*\+.*\) *\(.*\)$
The dot ( .
) means "any character", therefore, in conjunction with the asterisk, which means "any number of what precedes me", "any number of any character" - the first bracket pair pretty much mathces everything in any length.
If this would be the whole regex it would match the complete line. But because it isn't the second brackets pair is in fact limiting it:
\(\..*\+.*\)
This matches a literal dot character (because the dot has a special meaning to sed
if you want to match only a real literal dot you need to "escape" it - precede it with a backslash: "." = "any character "\." = "a literal dot character". Analogous for "\+" (escaped "+" character), hence: the meaning of the regexp inside the bracket pair is: a literal dot, followed by anything, followed by a literal "+", followed by anything.
You should now be able to decipher the rest and put together what it means in context. One thing you need to know, though: regexps are always "greedy" meaning that if there are several ways to match something always the longest possible match
is used. For example, here is some input and a regexp. The matched part is marked bold:
aBxyzBbla-foo-BsomethingBandsomemore
a.*B
Notice that "aB" would also have been a valid match for a.*B
, but the longest possible is the one i marked. Therefore will the first regexp part i.e. skip over the first literal dot (after the filemode field: drwxrwxrwx.
) and only go for the second one.
@drl: I think you could forego the "g" at the end, because you anchor the regexp at the end-of-line anyway.
I hope this helps.
bakunin