Sed regex problem

thiuda · November 11, 2010, 11:27am

Hi,

I tried to extract the time from `date` with sed.
(I know it works with `date +%H:%M:%S` as well)

I got three solutions of which just one worked. I thought "+" should repeat the previous expression 1 or more times and {n} should repeat the previous expression n times.

$ date
Thu Nov 11 17:19:03 CET 2010
$ date | sed 's/^.*\([0-9]+:[0-9]+:[0-9]+\).*$/\1/'
Thu Nov 11 17:19:05 CET 2010
$ date | sed 's/^.*\([0-9]{1}:[0-9]{1}:[0-9]{1}\).*$/\1/'
Thu Nov 11 17:19:11 CET 2010
$ date | sed 's/^.*\([0-9][0-9]:[0-9][0-9]:[0-9][0-9]\).*$/\1/' 
17:19:16

I just want to know what's wrong with the examples.

Linux 2.6.31-22-generic #68-Ubuntu SMP Tue Oct 26 16:38:35 UTC 2010 i686 GNU/Linux
bash shell

ctsgnb · November 11, 2010, 11:44am

[ctsgnb@shell ~]$ date | sed 's/^.*\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*$/\1/'
09:44:17

thiuda · November 11, 2010, 12:16pm

Why do I have to escape something like * and . to get the character and escape + and { to get the regex?

Isn't that a kind of counter logic?

Thanks for the help

ctsgnb · November 11, 2010, 12:37pm

you could more simply use

date | sed 's/^.*\(..:..:..\).*$/\1/'

thiuda · November 11, 2010, 12:44pm

sure, I could also use

date | awk '{print $4}'

anbu23 · November 11, 2010, 1:11pm

# date | cut -d" " -f4
12:10:52

ctsgnb · November 11, 2010, 1:57pm

but the best would still be date +%H:%M:%S

---------- Post updated at 07:57 PM ---------- Previous update was at 07:17 PM ----------

The + sign is part of extended set of metacharacters used by egrep and awk and maybe supported by some new sed version but is not part of old standard implementation of sed.

[0-9]+

could just be written such as

[0-9][0-9]*

but i encountered the following case where the * match the shortest instead of the longest (on FreeBSD)
(i don't know whether this behaviour could be tweaked somewhere...) :

[ctsgnb@shell ~]$ date | sed 's/^.*\([0-9][0-9]*:[0-9][0-9]*:[0-9][0-9]*\).*$/\1/'
1:55:42
[ctsgnb@shell ~]$ uname -a
FreeBSD <anonymized> 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Sun Jul 25 16:41:25 MDT 2010     <anonymized>:/usr/obj/usr/src/sys/CJB  amd64
[ctsgnb@shell ~]$

ooops, i just got aware that ...
... in fact the first * match the longest so that make the second one match the shortest this could lead to tricky unexpected result
so... the space matter :

[ctsgnb@shell ~]$ date | sed 's/^.*\([0-9][0-9]*:[0-9][0-9]*:[0-9][0-9]*\).*$/\1/'
2:27:24
[ctsgnb@shell ~]$ date | sed 's/^.* \([0-9][0-9]*:[0-9][0-9]*:[0-9][0-9]*\).*$/\1/'
12:27:29
[ctsgnb@shell ~]$

Scrutinizer · November 11, 2010, 4:47pm

You could even leave out the start/end markers:

sed 's/.*\(..:..:..\).*/\1/'

Since the : are sufficient as anchors...

thiuda · November 11, 2010, 9:23pm

Yeah, sure there are plenty of ways to get the time from date. My problem was that I did not know that sed treads + and { primarily as a normal character and not as a regex.

Maybe we should start an offtopic thread with all ways to extract the time from `date` with *nix like systems or even all kind of OS

Scrutinizer · November 12, 2010, 1:51am

First of all welcome to this forum.

There was another, more important problem with your sed attempts and that was the greediness of sed, as ctsgnb pointed out. My post was related to that and made a point about anchoring the regex. It was perhaps primarily meant for ctsgnb, but it may be of use to you too, so I reckon it certainly is related to this thread and it is not off topic. Please also note that this is not "your" thread as the OP. A thread may be of use to you as the OP, to those who participate, to anyone who reads along and to those who use it as a reference or land there as the result of a query.

--

Further, it is not so much that sed treats + and {} as normal characters. It is that modern sed adheres to POSIX BRE (Basic Regular Expressions) and as such curly brackets need to be escaped, but + is not even supported. Only GNU sed has an extension that supports it as \+ .

If you wish to use ERE (Extended Regular Expressions) you can use GNU sed with the -r option and then + and {} can be used without the escape, since { and + are part of ERE.