Remove leading zeroes in 2nd field using sed

pchang · October 4, 2010, 10:01am

Hi Forum.

I tried searching the forum but couldn't find a solution for my question.

I have the following data and would like to have a sed syntax to remove the leading zeroes from the 2nd field only:

Before:

2010-01-01|123|1|1000|2000|500|1500|600|700
2010-01-01|0456|1|1000|2000|500|1500|600|700

After:

2010-01-01|123|1|1000|2000|500|1500|600|700
2010-01-01|456|1|1000|2000|500|1500|600|700

Any help will be greatly appreciated.

Thank you.

hergp · October 4, 2010, 10:04am

sed 's/\([^|]*|\)0*/\1/' infile >outfile

Scrutinizer · October 4, 2010, 11:43am

Hi, you need to fix it to the beginning of the line, otherwise it may alter other fields:

sed 's/^\([^|]*|\)0*/\1/' infile

kurumi · October 4, 2010, 11:50am

$ ruby -F"\|" -ane '$F[1].gsub!(/^\s*0+/,"");print $F.join("|")' file

hergp · October 4, 2010, 12:05pm

Thanks, that's right, Scrutinizer. Though I did not find any difference between with and without fixation, no matter what I tried.

Franklin52 · October 4, 2010, 12:19pm

It isn't necessary IMHO, this part:

\([^|]*|\)

selects the line until the first "|":

2010-01-01|

alister · October 4, 2010, 12:40pm

The following will handle any leading zeroes (if any) after the first "|" in the line. Any other fields with leading zeroes are not modified.

sed 's/|0*/|/'

Regards,
Alister

pchang · October 4, 2010, 12:50pm

thank you guys for all your proposed solutions. Will test them out.

Scrutinizer · October 4, 2010, 2:09pm

hergp, alister, you are right. It is because of the *, so that zero matches is a match too and therefore there is always a match (and a substitution) in the second column on every row....

DGPickett · October 4, 2010, 2:44pm

A minor trick for more speed -- no sub if no zero:

sed 's/^\([^|]*|\)00*/\1/' infile

Do you want the units zero preserved?

sed 's/^\([^|]*|\)0\{1,99\}\([0-9]\{1,99\}|\)/\1\2/' infile

The first variable count metastring '\{1,99\}' takes precedence,
as sed processes greedily left to right,
but has to stop and leave the last digit.

Scrutinizer · October 4, 2010, 2:56pm

Good observation DGPickett, I guess to preserve the 0, this would work too:

sed 's/^\([^|]*|\)00*\([1-9]\)/\1\2/' infile

alister · October 4, 2010, 4:13pm

I'm curious. Did you actually benchmark that solution versus a simpler one like mine?

What little might be gained from avoiding some substitutions could be lost due to the more complicated regular expression which now must evaluate a character class and capture and in the case of substitutions refer to a backreference. Additionally, if most of the data does require substition, your more complicated approach could be further slowed.

My instincts tell me there's not much to be gained, but, naturally, I could be wrong. Still, the improvement would have to be non-trivial for me to discard the simpler, more readable solution.

Regards,
Alister

Scrutinizer · October 4, 2010, 4:24pm

I also suspect there there is little to be gained speedwise, however I think the preservation of zero values is a good point.

alister · October 4, 2010, 4:31pm

Hi, DGPickett:

My curiosity got the better of me. I created two files with a million lines each. One file consists of lines that never require any substitution. The other of lines that always require substitution. I then tested the solutions on each.

$ jot -w '2010-01-01|123|1|1000|2000|500|1500|600|' 1000000 > data-without-0
$ jot -w '2010-01-01|0123|1|1000|2000|500|1500|600|' 1000000 > data-with-0
$ wc -l data*; ls -lh data*
 1000000 data-with-0
 1000000 data-without-0
 2000000 total
-rw-r--r--   1 xxxxxx  xxxxxx       45M Oct  4 16:18 data-with-0
-rw-r--r--   1 xxxxxx  xxxxxx       44M Oct  4 16:17 data-without-0

No substitution necessary:

$ time sed 's/|0*/|/' data-without-0 > /dev/null

real    0m2.006s
user    0m1.898s
sys     0m0.072s
$ time sed 's/^\([^|]*|\)00*/\1/' data-without-0 > /dev/null

real    0m0.942s
user    0m0.863s
sys     0m0.066s

Substitution necessary:

$ time sed 's/|0*/|/' data-with-0 > /dev/null

real    0m2.136s
user    0m2.031s
sys     0m0.077s
$ time sed 's/^\([^|]*|\)00*/\1/' data-with-0 > /dev/null

real    0m12.654s
user    0m12.320s
sys     0m0.137s

While the more complicated solution shows some improvement when no substition is required at all, about 1 second per million lines, it exhibits a much larger degration if substitution is required by all lines. Based on my brief testing (insert all the usual caveats about benchmarking here ;)), I would not choose your approach unless the data set is massive AND there are few lines within it requiring modification.

Regards,
Alister

DGPickett · October 4, 2010, 4:41pm

Your solution is very simple and ingenuous, and you get the terse and simple award.

My alternative was annotated that, for big data sets 0* may be slower because it always hits and a hit seems to have extra cost to copy or move parts of the buffer. Some regex have metachar + just to facilitate situations like 00*. If you add a second zero so it only triggers occasionally, then you have to anchor to ^, which is also a cost. The costs definitely vary with data, and probably vary with different sed versions and systems.

Your solution is not as extensibly general, as a tutorial, being only good for the second field.

It is good to be aware of the alternatives and tradeoffs.

It is wise to benchmark when the run time rises.

Scrutinizer · October 4, 2010, 4:49pm

I think alister's solution is extensible though. For example, this would take care of the 3rd column:

sed 's/|0*/|/2' infile

and this of the 4th

sed 's/|0*/|/3' infile

and so forth

DGPickett · October 4, 2010, 4:50pm

Re: sed 's/^\([^|]*|\)00*\([1-9]\)/\1\2/' infile

You assume there is always a non-zero digit after the zeros. Preserving just the low order digit for |00000| but clearing just leading digits for |01020| and allowing variable field width as well takes a more complex regex.

Putting in commas takes looping or repetition.

Scrutinizer · October 4, 2010, 4:56pm

Fair enough, I did not think of 0000 for instance, but then this would do, wouldn't it?:

sed 's/^\([^|]*|\)00*\([0-9]\)/\1\2/' infile

DGPickett · October 4, 2010, 5:13pm

I am overwhelmingly thrilled that you did the benchmark. So few read the manual or, better yet, write up something and try it. The sed man page for UNIX SVR3 lied about how a greedy wild card worked, which I immediately saw and my office mate was amazed was a correct anticipation!

In the internet age, you cannot say 'whatever', you have to Google and quote something.

A real developer writes something and tries it, because man pages lie, or are vague, or are talking about something else. And sometimes we have people hip-shooting their wiki-ignorance.

Data can make a big difference. Trying to remove spaces leading '| *' is faster then '| *' because a delimited file usually has many columns. Removing trailing spaces ' *|' and even ' *|' is slower because of the huge number of spaces, so I always do leading first (to remove spaces in the empty fields), and sometimes pipe many sed together to lighten the load. Finally, I wrote a C utility, all state varialbes and getchar/putchar(), to do the really big sets really fast, because I was way past 2 megs. Even at 2 megs, the cache and VM hits make a big difference. And to think the first H200 came with 2 or 4K ram -- how time flies.

---------- Post updated at 05:07 PM ---------- Previous update was at 05:03 PM ----------

Yes, I was scratchng around for that one, the 'save any first digit after the zeros' thing. It is that greedy wildcard and left to right that makes it work, so it feels too loose, but it goes!

---------- Post updated at 05:13 PM ---------- Previous update was at 05:07 PM ----------

scrutinizer:

I think alister's solution is extensible though. For example, this would take care of the 3rd column:
sed 's/|0*/|/2' infile
and this of the 4th
sed 's/|0*/|/3' infile
and so forth

Good point, I have neglected the trailing number not g case, as it only allows you access to one column, and so I have yet to use it in the real world. If sed had better delimited field stuff, without becoming awk, it would be great! The limit of 99 in \{99\} is a pain, too.

Speed in sed is so good, it is really a tractor trailer of a tool!