sed or awk editing help

ygemici · November 1, 2010, 3:27pm

Allright then why do you try to add sed code with non-loop?

with non-loop

# echo '     ,          abc,             ,  sd   ,      ,   ,     ,      ' | sed -r 's/(  *,|,|^)  *(,|$)/\1\2/g;s/,  *,/,,/g;s/,  *$/,/' |od -bc
0000000 054 040 040 040 040 040 040 040 040 040 040 141 142 143 054 054
          ,                                           a   b   c   ,   ,
0000020 040 040 163 144 040 040 040 054 054 054 054 012
                  s   d               ,   ,   ,   ,  \n
0000034

with loop

# echo '     ,          abc,             ,  sd   ,      ,   ,     ,      ' | sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta' | od -bc
0000000 054 040 040 040 040 040 040 040 040 040 040 141 142 143 054 054
          ,                                           a   b   c   ,   ,
0000020 040 040 163 144 040 040 040 054 054 054 054 012
                  s   d               ,   ,   ,   ,  \n
0000034

Scrutinizer · November 1, 2010, 3:48pm

@ygemici: There is no fundamental difference between having a loop that you go through twice and two or more consecutive search and replace commands and now you come with one with 3 search and replace commands. I have been trying to explain that many times in this thread. One more example: I could write this:

sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta'

or this:

sed -r 's/(,|^)  *(,|$)/\1\2/g;s/(,|^)  *(,|$)/\1\2/g'

It does not matter: they are practically equivalent. The first one is shorter though.

DGPickett · November 1, 2010, 3:55pm

A loop saves pasting it in a second time, runtime versus very minor coding effort. Each pass is relatively expensive, so if 1 is not enough and 2 is enough, looping an extra time to 3 tries is silly (and you always have a dry pass). Actually, sometimes 2 is silly:

$ sed -r '
  s/(,|^)  *(,|$)/,,/g
  t n
  b
  :n
  s/(,|^)  *(,|$)/,,/g
 ' <<! | cat -e
aaa,   bbb,ccc   ,    ,    ,dddd
    ,ee  ee,   
!
aaa,   bbb,ccc   ,,,dddd$
,,ee  ee,,$
 
$

Of course, I have embraced some very fat data sets!

Scrutinizer · November 1, 2010, 4:30pm

At the third time there are no further matches (dry pass as you call it) and so that is a relatively cheap pass. I found only a small performance difference between one or the other (<8%). Another consideration is that using the loop means short code that is easy to understand.

---------- Post updated at 21:22 ---------- Previous update was at 21:05 ----------

We could simplify the two statements since we can do the first field in the first run and the last field in the second run:

sed -r 's/(,|^)  *,/\1,/g;s/,  *(,|$)/,\1/g'

This runs 20% faster.

---------- Post updated at 21:30 ---------- Previous update was at 21:22 ----------

OK, this really is lots and lots faster; it only takes 1/5th of the time !!

sed -r 's/^  *,/,/;s/,  *,/,,/g;s/,  *,/,,/g;s/,  *$/,/'

but it is not because of the absence of the loop, since:

sed -r 's/^  *,/,/;:a;s/,  *,/,,/g;ta;s/,  *$/,/'

this is only 10% slower than this fast solution.

Obviously we can now drop the -r:

sed 's/^  *,/,/;s/,  *,/,,/g;s/,  *,/,,/g;s/,  *$/,/'

sed 's/^  *,/,/;:a;s/,  *,/,,/g;ta;s/,  *$/,/'

But that made no difference in performance.

==========
Apparently it is the lack of capturing groups, alternation and back references that makes all the difference!!

DGPickett · November 1, 2010, 4:53pm

I created some extra , with that last one. Yes, lack of iteration speeds things up! Does the t n b :n improve things over just two passes?

$ sed '
  s/^  *,/,/
  s/,  *$/,/
  s/,  *,/,,/g
  t n
  b
  :n
  s/,  *,/,,/g
 ' <<! |cat -e
aaa,   bbb,ccc   ,    ,    ,dddd
    ,ee  ee,   
!
aaa,   bbb,ccc   ,,,dddd$
,ee  ee,$ 
$

PS: How big did you expand the data set for the timings?

Scrutinizer · November 1, 2010, 4:54pm

DG, not lack of iteration, but lack of capturing groups, alternation and back references is what is speeding things up dramatically is what I am finding.

The "t n b :n" bit in your suggestion is on par with my last part with the ta bit, i.e. 10% slower than without the loop.

I used 128K lines

DGPickett · November 1, 2010, 5:04pm

Iteration in the alternation? Yes, the (|) looks like a time sponge. So, are back references implicitly slow?

Scrutinizer · November 1, 2010, 5:26pm

Not even iteration in the alternation. The real culprit appears to be the use of capturing groups! To test this I used this:

sed -r 's/^  *,/,/;s/,  *,/,,/g;s/,  *,/,,/g;s/,  *$/,/'

and changed it to this:

sed -r 's/(^)  *,/,/;s/(,)  *,/,,/g;s/(,)  *,/,,/g;s/,  *($)/,/'

So only capturing groups without alternation or back references.

The latter took 5x as long!!

ygemici · November 2, 2010, 3:27am

scrutinizer:

@ygemici: There is no fundamental difference between having a loop that you go through twice and two or more consecutive search and replace commands and now you come with one with 3 search and replace commands. I have been trying to explain that many times in this thread. One more example: I could write this:
sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta'
or this:
sed -r 's/(,|^)  *(,|$)/\1\2/g;s/(,|^)  *(,|$)/\1\2/g'
It does not matter: they are practically equivalent. The first one is shorter though.

I know already this..But for me , important issue is correct results and at the same time how fast so not shorter..

Scrutinizer · November 2, 2010, 3:36am

And you did gather from this thread that there is not a real speed difference between the two?

ygemici · November 2, 2010, 4:33am

I tried this with the big file like size 5-6 Mb..
As a result, each of two methods run properly

Thanks for your suggestions

Regards
ygemici - @sed lover

Scrutinizer · November 2, 2010, 4:41am

Good, additionally we found a method that is 5x as fast:

sed 's/^  *,/,/;s/,  *,/,,/g;s/,  *,/,,/g;s/,  *$/,/'

It turns out this is not at all due to the lack of loops, but apparently because no grouping with ( ) or  is being used.

ygemici · November 2, 2010, 5:22am

scrutinizer:

Good, additionally we found a method that is 5x as fast:
sed 's/^  *,/,/;s/,  *,/,,/g;s/,  *,/,,/g;s/,  *$/,/'
It turns out this is not at all due to the lack of loops, but apparently because no grouping with ( ) or  is being used.

Of course this will more fast so because of in this not used buffers but we have compare --loop with group-- and --non-loop with group--

I tested it again..

with loop with group

real    0m4.140s
user    0m3.005s
sys     0m0.899s

with non-loop with group

real    0m3.851s
user    0m2.368s
sys     0m0.918s

with non-loop with non-group

real    0m2.184s
user    0m0.678s
sys     0m0.650s

regards
ygemici

Scrutinizer · November 2, 2010, 11:14am

Which is consistent with my findings. There is not much difference between using several replace statement and using a loop, but there is a significant gain when groups are not being used.

You state this is obvious ("of course") why then didn't you use it in the first place, especially since your main concern is speed?

I do not think it was obvious at all and I cannot think of a good reason why grouping would have to take up so much time. I am glad I found out that it does though.

ctsgnb · November 2, 2010, 3:55pm

Maybe the parsing Step, indeed, when ambiguous grouping is specified, it goes through a kind of "auto completion" step.
Also maybe using a memory copy instead of a memory mapping?.

DGPickett · November 2, 2010, 4:08pm

Like many features, pretty but not fast, so you avoid them for heavy lifting if possible, but sometimes you need them even when they are slower.

Of course, if the user wanted to trim leading and trailing, these 4 would run fast, but the ones searching for space first should go last, s sed works his reges left to right, so it is best to have a selective string on the left end (I guess an Arabic/Hebrew language sed would work right to left?):

sed '
  s/^  *//
  s/,  */,/
  s/  *,/,/
  s/  *$//
 '

---------- Post updated at 04:08 PM ---------- Previous update was at 04:00 PM ----------

Early sed had a limited line size, and was faster with less indirection, but gnu sed and later sed's seem to have very big or realloc()'d buffers. I don't think many programmers use a mmap()'d tmp file for the buffer. Sed is a pipe-oriented stream editor, so it would not be able to map the input file all the time, and even so, it could not write there, and of course it needs to scan intermediate product on multi-command scripts. So, I am not thinking memory map and sed at the same time. I suspect as it rewrites a line, it has an input pointer and an output pointer, and if the line is expanding, then when the pointers collide, there must be either mid-substitute moves in one buffer or copying between two buffers. I am not going to read the code, though!

ctsgnb · November 2, 2010, 4:14pm

Yeah ! Thanks for this precisions DGnitPick

By the way, here is a thread in which i put an example few days ago of what i call "ambiguous grouping"
http://www.unix.com/shell-programming-scripting/143949-substring-using-cut-awk-sed-3.html\#post302464388 see in post #21

Consider how it behaves with ambiguous matching and how the \1 and & are auto-completed and \2 also if last appearing in the line (that was on SunOS 5.9, but i got the same results on a GNU linux machine) :

# echo MPMTR20100706043000.txt|sed -n -e 's/\([0-9][0-9]\).*\(3[0-9][0-9]\)/\1,\2/p'
MPMTR20,3000.txt
# echo MPMTR20100706043000.txt|sed -n -e 's/\([0-9][0-9]\).*\(3[0-9][0-9]\)/\1,\2,&/p'
MPMTR20,300,20100706043000.txt

DGPickett · November 2, 2010, 4:21pm

Hey, I love mmap(), and mmap64() more, a golden door into unlimited VM and RAM use and random access to application data. But sed is all about the (expandable) microcosm of two lines. The early small buffer sed activity fit in the small L1 caches of earlier days. I rarely nit pick below the bit level -- the pixel, maybe, but not the bit! It is about seeing all the choices, weighing all the choices, and making informed choices, investing in knowing the right technique for the next time, investing in yourself, investing in your new friends.

ctsgnb · November 2, 2010, 4:45pm

there must be either mid-substitute moves in one buffer or copying between two buffers

Since i am not a C coder, so i trust your intuition about that dude , we have no better answer so far

DGPickett · November 2, 2010, 4:58pm

If you have Solaris, the truss -u'' option shows you more than you want to know about the libc and other calls a running proc is making. JAVA object creation does a lot of memcpy()! The truss or tusc commands are very educational, even if you do not have the code, do not read C/C++, even if the process is already running! It shows all the kernel calls even without the -u'' feature.

sed or awk editing help

========== Apparently it is the lack of capturing groups, alternation and back references that makes all the difference!!

So only capturing groups without alternation or back references.

The latter took 5x as long!!

==========
Apparently it is the lack of capturing groups, alternation and back references that makes all the difference!!