Understanding sed

panyam · March 1, 2013, 3:57am

Hi,

can some one suggest me,how "sed" is managed to delete the second field here.

Any explanation on , how the below code is working would be appreciated.

sed 's/^\([^:]*\):[^:]:/\1::/'  /etc/passwd

sed 's/[^:]*:/:/2'  /etc/passwd

Scrutinizer · March 1, 2013, 4:06am

Hi, the first one deletes the second field only if it consists of one character... It replace the first field and a colon + a character and a colon with the first field (that was captured with \( .. \) and recalled by \1 ) and two colons

The second replaces any number of non-colons and a colon by a colon, the number two means the it does this with the second occurrence on a line...

panyam · March 1, 2013, 4:16am

Hi Scrutinizer,

Thanks for the reply. I got your second correctly.

Your first answer is slightly confusing me!..

Let's say the sample input is:

abcd:x:panyam:panyam:Panyam:512

echo "abcd:x:panyam:panyam:Panyam:512" | sed 's/^\([^:]*\):[^:]:/\1::/' gives me:

abcd::panyam:panyam:Panyam:512

\([^:]*\) matches : abcd
[^:] matches :x

\1 : prints only "abcd"..

Now, how the rest of the line is coming in output as it is? I mean

"panyam:panyam:Panyam:512"

.

Is it because sed prints the non matching patterns as it is?

Is my understanding correct?

Scrutinizer · March 1, 2013, 4:27am

Hi Panyam, yes the rest of the line remains unaltered. It is not part of the substitution, so it gets printed like it is.

gary_w · March 1, 2013, 10:20pm

An attempt to further explain the regular expression, with an ulterior motive of setting up for a question.

Given:

sed 's/^\([^:]*\):[^:]:/\1::/'

search for a pattern in the string matching:

^    = start of the line
\(   = Start of first remembered pattern
[^:] = followed by any character that is not a :
*    = followed by any number of the previous character class
       (characters that are not colons)
\)   = end of first remembered pattern
:    = followed by a colon
[^:] = followed by any character that is not a colon
:    = followed by a colon

Replace with:

\1   = the first remembered pattern (the first field)
::   = followed by 2 literal colons

In other words, replace the first 2 colon separated fields
with the first field and 2 colons (deletes the 2nd field).

Question: if this sed command was in a script, could it be commented like I did above in the code? Can a sed regex be multi-line with comments?

One could also do:

s/:[^:]*:/::/

alister · March 1, 2013, 10:54pm

Nope. You can have comments in a sed script, but not within a regular expression. What you are asking is possible with perl if you use the /x regular expression modifier.

Regards,
Alister

---------- Post updated at 10:54 PM ---------- Previous update was at 10:49 PM ----------

s/:[^:]*/:/ would work just as well, unless it's necessary to prevent the last field in a line without a trailing colon from matching, or even s/[^:]*//2 .

REgards,
Alister

Scrutinizer · March 2, 2013, 4:30am

That is what I would think too, but this does not work like that everywhere.. This works wiith GNU sed and sed on AIX7 and with regular sed on Solaris, but not with /usr/xpg4/bin/sed on Solaris nor with sed on HPUX and OSX and some other UNIX flavor.

In those cases where it does not work, the desired effect was obtained when s/[^:]*//3 was used instead (and for the 3rd field s/[^:]*//5 and so on).

How can this be? What I think this may have to do with how the respective regex engines interpret a zero match after a previous match. The first match of

echo aaa:bbb:ccc:ddd:eee:fff | sed 's/[^:]*//'

renders

:bbb:ccc:ddd:eee:fff

On this every engine agrees. After the first match the engine arrives after the previous match and before the first colon. But what then constitutes the next match? For GNU sed and some other mentioned above this apparently means the next iteration of non-colon characters after the first colon. But the other engines apparently interpret zero repetitions of the non-colons before the colon as the next match, which constitutes an empty string and which I guess could be labeled as a "strict" interpretation of [^:]* .

Anyway, it seems safest to include one colon in the match line in the OP's second example, or insist a pattern of 1 or more non-colons, i.e. sed 's/[^:][^:]*//2' or sed 's/[^:]\{1,\}//2'

Regards,

S.

alister · March 2, 2013, 3:45pm

Discrepancies like these are why every now and then a space probe crashes into the surface of a planet instead of safely entering orbit.

Thank you for living up to your nick, Scrutinizer. Your observations make us all better.

From POSIX: 9.3.6 BREs Matching Multiple Characters:

GNU sed 4.1.5:

$ echo :BEFORE: | sed 's/[^:]*/AFTER/'
AFTER:BEFORE

I expect :AFTER: . It's behaving as if the match is anchored.

Do other implementations (including newer GNU sed) agree in this as well? I don't have access at the moment to my *BSD stuff, only a windows machine and an old linux laptop.

Regards,
Alister

RudiC · March 3, 2013, 4:21am

FreeBSD 9.0-RELEASE
$ echo :BEFORE: | sed 's/[^:]*/AFTER/'
AFTER:BEFORE:

Linux 3.5.0-26-generic
GNU sed version 4.2.1
$ echo :BEFORE: | sed 's/[^:]*/AFTER/'
AFTER:BEFORE:

Scrutinizer · March 3, 2013, 5:25am

@alister:
There seems to be no difference with a single (first) substitution or a global substitution, but only with the flag with a specific number. The difference is about a zero match after a previous match (was it already covered by a previous match or not?). Compare:

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/'
AFTER:BEFORE:
$ echo :BEFORE: | sed 's/[^:]*/AFTER/'
AFTER:BEFORE:

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/2'
:AFTER:
$ echo :BEFORE: | sed 's/[^:]*/AFTER/2'
:AFTER:

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/3'
:BEFORE:AFTER
$ echo :BEFORE: | sed 's/[^:]*/AFTER/3'
:BEFOREAFTER:

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/4'
:BEFORE:
$ echo :BEFORE: | sed 's/[^:]*/AFTER/4'
:BEFORE:AFTER

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/5'
:BEFORE:
$ echo :BEFORE: | sed 's/[^:]*/AFTER/5'
:BEFORE:

$ echo :BEFORE: | gsed 's/[^:]*/AFTER/g'
AFTER:AFTER:AFTER
$ echo :BEFORE: | sed 's/[^:]*/AFTER/g'
AFTER:AFTER:AFTER

$ echo x:BEFORE: | gsed 's/[^:]*/AFTER/2'
x:AFTER:
$ echo x:BEFORE: | sed 's/[^:]*/AFTER/2'
xAFTER:BEFORE:

alister · March 3, 2013, 9:31am

That sed implementation isn't self-consistent. Using numeric s-command flags, there are 4 matches:

scrutinizer:

$ echo :BEFORE: | sed 's/[^:]*/AFTER/'
AFTER:BEFORE:

$ echo :BEFORE: | sed 's/[^:]*/AFTER/2'
:AFTER:

$ echo :BEFORE: | sed 's/[^:]*/AFTER/3'
:BEFOREAFTER:

$ echo :BEFORE: | sed 's/[^:]*/AFTER/4'
:BEFORE:AFTER

Using the s-command's g flag, there are only 3:

Regards,
Alister

Scrutinizer · March 3, 2013, 4:05pm

I agree this seems to be inconsistent, but it appears to be the case in several sed implementations...

alister · March 3, 2013, 6:35pm

So then we are all in agreement: every sed implementation sucks. This thread can now be closed.

Regards,
Alister

hanson44 · March 6, 2013, 4:46am

The correct output (GNU sed) is:

echo :BEFORE: | sed 's/[^:]*/AFTER/3'
:BEFORE:AFTER

How could the output possibly be:

:BEFOREAFTER:

What's the possible logic? What sed version are you using?

All the other outputs make sense. It's only the one with n=3 that seems off.

Scrutinizer · March 6, 2013, 5:03am

@hanson44: Any of the sed versions described in post #7