hergp, alister, you are right. It is because of the *, so that zero matches is a match too and therefore there is always a match (and a substitution) in the second column on every row....
I'm curious. Did you actually benchmark that solution versus a simpler one like mine?
What little might be gained from avoiding some substitutions could be lost due to the more complicated regular expression which now must evaluate a character class and capture and in the case of substitutions refer to a backreference. Additionally, if most of the data does require substition, your more complicated approach could be further slowed.
My instincts tell me there's not much to be gained, but, naturally, I could be wrong. Still, the improvement would have to be non-trivial for me to discard the simpler, more readable solution.
My curiosity got the better of me. I created two files with a million lines each. One file consists of lines that never require any substitution. The other of lines that always require substitution. I then tested the solutions on each.
$ time sed 's/|0*/|/' data-without-0 > /dev/null
real 0m2.006s
user 0m1.898s
sys 0m0.072s
$ time sed 's/^\([^|]*|\)00*/\1/' data-without-0 > /dev/null
real 0m0.942s
user 0m0.863s
sys 0m0.066s
Substitution necessary:
$ time sed 's/|0*/|/' data-with-0 > /dev/null
real 0m2.136s
user 0m2.031s
sys 0m0.077s
$ time sed 's/^\([^|]*|\)00*/\1/' data-with-0 > /dev/null
real 0m12.654s
user 0m12.320s
sys 0m0.137s
While the more complicated solution shows some improvement when no substition is required at all, about 1 second per million lines, it exhibits a much larger degration if substitution is required by all lines. Based on my brief testing (insert all the usual caveats about benchmarking here ;)), I would not choose your approach unless the data set is massive AND there are few lines within it requiring modification.
Your solution is very simple and ingenuous, and you get the terse and simple award.
My alternative was annotated that, for big data sets 0* may be slower because it always hits and a hit seems to have extra cost to copy or move parts of the buffer. Some regex have metachar + just to facilitate situations like 00*. If you add a second zero so it only triggers occasionally, then you have to anchor to ^, which is also a cost. The costs definitely vary with data, and probably vary with different sed versions and systems.
Your solution is not as extensibly general, as a tutorial, being only good for the second field.
It is good to be aware of the alternatives and tradeoffs.
You assume there is always a non-zero digit after the zeros. Preserving just the low order digit for |00000| but clearing just leading digits for |01020| and allowing variable field width as well takes a more complex regex.
I am overwhelmingly thrilled that you did the benchmark. So few read the manual or, better yet, write up something and try it. The sed man page for UNIX SVR3 lied about how a greedy wild card worked, which I immediately saw and my office mate was amazed was a correct anticipation!
In the internet age, you cannot say 'whatever', you have to Google and quote something.
A real developer writes something and tries it, because man pages lie, or are vague, or are talking about something else. And sometimes we have people hip-shooting their wiki-ignorance.
Data can make a big difference. Trying to remove spaces leading '| *' is faster then '| *' because a delimited file usually has many columns. Removing trailing spaces ' *|' and even ' *|' is slower because of the huge number of spaces, so I always do leading first (to remove spaces in the empty fields), and sometimes pipe many sed together to lighten the load. Finally, I wrote a C utility, all state varialbes and getchar/putchar(), to do the really big sets really fast, because I was way past 2 megs. Even at 2 megs, the cache and VM hits make a big difference. And to think the first H200 came with 2 or 4K ram -- how time flies.
---------- Post updated at 05:07 PM ---------- Previous update was at 05:03 PM ----------
Yes, I was scratchng around for that one, the 'save any first digit after the zeros' thing. It is that greedy wildcard and left to right that makes it work, so it feels too loose, but it goes!
---------- Post updated at 05:13 PM ---------- Previous update was at 05:07 PM ----------
Good point, I have neglected the trailing number not g case, as it only allows you access to one column, and so I have yet to use it in the real world. If sed had better delimited field stuff, without becoming awk, it would be great! The limit of 99 in \{99\} is a pain, too.
Speed in sed is so good, it is really a tractor trailer of a tool!