Optimizing find with many replacements

f77hack · April 16, 2018, 3:22pm

Hello,

I'm looking for advice on how to optimize this bash script, currently i use the shotgun approach to avoid file io/buffering problems of forks trying to write simultaneously to the same file. i'd like to keep this as a fairly portable bash script rather than writing a C routine.

in a nutshell, there are many conditions in a file that i'm looking to replace strings. any particular file may have some, none or all of the requirements to replace a string.

currently

Longstring='lots of stuff'
spushd $HOME/somepath

gfind . -depth -name "somefile" -type f -writable -exec gsed -i '1{/^#./! s/.*/'"$Longstring"'/}' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/ts=4/ts=2/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/sw=4/sw=2/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/tab-width: 4/tab-width: 2/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/mode: tcl/mode: _tcl/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/c-basic-offset: 4/c-basic-offset: 2/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/^\s*(size.*)$/\1/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/^\s*(md.*)$/\1/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/^\s*(rmd.*)$/\1/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/^\s*(sha.*)$/\1/g' {} \;
  gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r 's/^(python.versions.*)$/python.versions 27 36/g' {} \;
  spopd

as you can see, these operations are sequential which can take quite a while.

should i modify the find to do depth first?

can i fork the find and avoid file io problems?

spawn different processes?

thanks

Don_Cragun · April 16, 2018, 4:04pm

Just to be clear, is it true that you want the output of the sed from the 1st find to be written to standard output (and not be included in the changes made to updated files) while all of the other find s run sed s that will make updates to the files and not write anything to standard outputa?

Why not run all of the sed commands in the last 10 invocations of sed in a single invocation of find -exec ing sed ?

And, why not use two -exec s in a single invocation of find instead of invoking find eleven times?

f77hack · April 16, 2018, 4:40pm

Thanks for help.

All of the sed is basically a large 'OR' boolean.

Any particular file, could have any sed condition, 1..."#conditions", so the the sed needs to search for a condition in the file before moving on to the next file.

Basically, the script has expanded over time and now it's getting to the point where I'd like to refactor it.

That is sort of the question, is it more efficient to let find search a massive amount of files and let sed chew on one condition at a time? Which it does now, which is basically unrolling the loops in your suggestion about concatening the sed to two exec commands?

or as you suggest, find pauses its search while let sed grind on one file searching all the conditions at once?

Say average files to search is ~ 100,000 files, average size ~40k/~100k

Thanks for the thoughts.

bakunin · April 16, 2018, 5:06pm

I think what Don is trying to tell you is: this command

gfind . -depth -name "somefile" -type f -writable

will find some list of files. Since it is repeated eleven times it will find (and hence process) eleven times the same list of files.

So you could put all the changes in the different sed -scripts into one sed-script and write something like:

gfind . -depth -name "somefile" -type f -writable -exec gsed -i -f /some/where/script {} \;

where /some/where/script would contain

1 {
     /^#./! s/.*/"$Longstring"/
   }
s/ts=4/ts=2/g
s/sw=4/sw=2/g
s/tab-width: 4/tab-width: 2/g
....

I have to admit you would have to work a bit to get the variable "$Longstring" passed properly, but this minor issue aside you should be a lot faster: you recurse the filesystem only once (instead of eleven times) and you call sed only once instead of eleven times for each file.

I hope this helps.

bakunin

f77hack · April 16, 2018, 7:31pm

@bakunin yes, this is exactly what i was looking to do.

Thank you.

P.S. How would expand the script if instead of "somefile" but an array of "somefiles=()"? Would you spawn off

gfind

?

Chubler_XL · April 16, 2018, 8:16pm

Using bash and depending on how big the somefiles[] is (don't want to blow out the command line):

 gfind . \( -false ${somefiles[@]/#/-o -name } \) -type f ...

MadeInGermany · April 17, 2018, 1:35am

You can still have an embedded sed script.
All shells but (t)csh can have a multiline string

echo 'two
lines'

So the following should work

gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r '
    1{/^#./! s/.*/'"$Longstring"'/}
    s/ts=4/ts=2/g
... 
    s/^(python.versions.*)$/python.versions 27 36/g
' {} \;

@Don, not true, -i outputs to file, given in all the sed invocations.

Don_Cragun · April 17, 2018, 4:32am

madeingermany:

You can still have an embedded sed script.
All shells but (t)csh can have a multiline string
echo 'two
lines'
So the following should work
gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r '
   1{/^#./! s/.*/'"$Longstring"'/}
   s/ts=4/ts=2/g
... 
   s/^(python.versions.*)$/python.versions 27 36/g
' {} \;
@Don, not true, -i outputs to file, given in all the sed invocations.

Hi MadeInGermany,
Yes, you're correct. Neither sed -i nor sed -r are included in the standards so I seldom use either of them. I confused the meanings of -i and -r in gsed . (I mistakenly thought gsed 's -i performed case insensitive pattern matching and -r created backups. )

On the BSD-based sed that I use on macOS, there is no -r option and the commands:

Longstring='lots of stuff'
sed -i '1{/^#./! s/.*/'"$Longstring"'/}' filename

would be a request to use 1{/^#./! s/.*/lots of stuff/} as a filename extension on standard input as a backup filename, use filename as an editing command, and process text read from standard input, producing the diagnostic message:

sed: 1: "filename": invalid command code f

Most of the other commands would fail to perform as expected on macOS (after translating gsed to sed ) because the -r option that you want would be interpreted as the extension to be used on the backup files created (not as a request to use EREs instead of BREs when interpreting substitution search patterns).

Looking more closely at the gsed man page and the 1st post in this thread, it would seem that the code presented there could be made to run much more quickly if that script was replaced by:

Longstring='lots of stuff'
spushd $HOME/somepath

gfind . -depth -name "somefile" -type f -writable -exec gsed -i -r '1{/^#./! s/.*/'"$Longstring"'/}
    s/ts=4/ts=2/g
    s/sw=4/sw=2/g
    s/tab-width: 4/tab-width: 2/g
    s/mode: tcl/mode: _tcl/g'
    s/c-basic-offset: 4/c-basic-offset: 2/g
    s/^\s*(size.*)$/\1/g
    s/^\s*(md.*)$/\1/g
    s/^\s*(rmd.*)$/\1/g
    s/^\s*(sha.*)$/\1/g
    s/^(python.versions.*)$/python.versions 27 36/g' {} +

spopd

which would invoke find once and gsed just a few times (maybe only once depending on the number of pathnames to be processed and the lengths of those pathnames) instead of invoking find 11 times and invoking gsed 11 times for each pathname processed.

But, of course, since I don't have gsed on my system, this suggestion is totally untested.

MadeInGermany · April 17, 2018, 5:16am

Some sed versions (BSD ?) have introduced option -E that does the same as the -r in GNU sed.
IMHO the -E is more intuitive because it is used for grep in the same way: use ERE instead of BRE.