I'm looking for advice on how to optimize this bash script, currently i use the shotgun approach to avoid file io/buffering problems of forks trying to write simultaneously to the same file. i'd like to keep this as a fairly portable bash script rather than writing a C routine.
in a nutshell, there are many conditions in a file that i'm looking to replace strings. any particular file may have some, none or all of the requirements to replace a string.
Just to be clear, is it true that you want the output of the sed from the 1st find to be written to standard output (and not be included in the changes made to updated files) while all of the other find s run sed s that will make updates to the files and not write anything to standard outputa?
Why not run all of the sed commands in the last 10 invocations of sed in a single invocation of find-exec ing sed ?
And, why not use two -exec s in a single invocation of find instead of invoking find eleven times?
Any particular file, could have any sed condition, 1..."#conditions", so the the sed needs to search for a condition in the file before moving on to the next file.
Basically, the script has expanded over time and now it's getting to the point where I'd like to refactor it.
That is sort of the question, is it more efficient to let find search a massive amount of files and let sed chew on one condition at a time? Which it does now, which is basically unrolling the loops in your suggestion about concatening the sed to two exec commands?
or as you suggest, find pauses its search while let sed grind on one file searching all the conditions at once?
Say average files to search is ~ 100,000 files, average size ~40k/~100k
I have to admit you would have to work a bit to get the variable "$Longstring" passed properly, but this minor issue aside you should be a lot faster: you recurse the filesystem only once (instead of eleven times) and you call sed only once instead of eleven times for each file.
Hi MadeInGermany,
Yes, you're correct. Neither sed -i nor sed -r are included in the standards so I seldom use either of them. I confused the meanings of -i and -r in gsed . (I mistakenly thought gsed 's -i performed case insensitive pattern matching and -r created backups. )
On the BSD-based sed that I use on macOS, there is no -r option and the commands:
Longstring='lots of stuff'
sed -i '1{/^#./! s/.*/'"$Longstring"'/}' filename
would be a request to use 1{/^#./! s/.*/lots of stuff/} as a filename extension on standard input as a backup filename, use filename as an editing command, and process text read from standard input, producing the diagnostic message:
sed: 1: "filename": invalid command code f
Most of the other commands would fail to perform as expected on macOS (after translating gsed to sed ) because the -r option that you want would be interpreted as the extension to be used on the backup files created (not as a request to use EREs instead of BREs when interpreting substitution search patterns).
Looking more closely at the gsed man page and the 1st post in this thread, it would seem that the code presented there could be made to run much more quickly if that script was replaced by:
which would invoke find once and gsed just a few times (maybe only once depending on the number of pathnames to be processed and the lengths of those pathnames) instead of invoking find 11 times and invoking gsed 11 times for each pathname processed.
But, of course, since I don't have gsed on my system, this suggestion is totally untested.
Some sed versions (BSD ?) have introduced option -E that does the same as the -r in GNU sed.
IMHO the -E is more intuitive because it is used for grep in the same way: use ERE instead of BRE.