Need help optimizing this piece of code (Shell script Busybox)

I am looking for suggestions on how I could possibly optimized that piece of code where most of the time is spend on this script. In a nutshell this is a script that creates an xml file(s) based on certain criteria that will be used by a movie jukebox.

Example of data:

$SORTEDTMP= it is a file created using find that contain all the movies (200+) on my drive. ex:
/USB/movies/science-fiction/alien/alien.avi
/USB/movies/comedy/funnymovie/funnymovie.avi
....etc

$MOVIESPATH= string = "/USB/movies/science-fiction/"

Movieinfo = Just an xml file containing data about the movie; I only care about extracting the movie title if the file exist vs using the filename as the movie title which may or may not be accurate.

All the other jpg, bmp may or may not exist and appropriate action are taken based on their existence.

Currently the created xml file is being appended for every movies that match the grep MOVIESPATH which could include all the movies. This part of the code is called up many times to create an xml files for all the different movie category including "All movies". I am not sure if I can store all the data in a temporary place in memory and before I leave this loop write the file would be faster ????? I don't know and do not have much experience.

So I'll ask again does anyone sees anything there that could be optimized that would significantly increase the running speed of this piece of code.

 
grep "$MOVIESPATH" $SORTEDTMP | while read plik
do
  MovieName=`basename "$plik" | sed 's/\(.*\)\..*/\1/'`
  DirectoryPath=`dirname "$plik"`
  if [ -e "$DirectoryPath/MovieInfo.nfo" ];
  then
     MOVIEINFO="$DirectoryPath/MovieInfo.nfo"
     MOVIETITLE=`grep "<title>.*<.title>" "$MOVIEINFO" | sed -e "s/^.*<title/<title/" | cut -f2 -d">"| cut -f1 -d"<"`
  else
     MOVIETITLE=$MovieName
  fi
  if [ -e "$DirectoryPath/folder.jpg" ];
  then
     MOVIEPOSTER=$DirectoryPath/folder.jpg
  elif [ -e "$DirectoryPath/${MovieName}.jpg" ];
  then
     MOVIEPOSTER=$DirectoryPath/${MovieName}.jpg
  else  
     MOVIEPOSTER=/usr/local/etc/srjg/nofolder.bmp
  fi
 
  if [ -e "$DirectoryPath/about.jpg" ];
  then
     MOVIESHEET=$DirectoryPath/about.jpg
  elif [ -e "$DirectoryPath/0001.jpg" ];
  then
     MOVIESHEET=$DirectoryPath/0001.jpg
  elif [ -e "$DirectoryPath/${MovieName}_sheet.jpg" ];
  then
     MOVIESHEET=$DirectoryPath/${MovieName}_sheet.jpg
  else  
     MOVIESHEET=/usr/local/etc/srjg/NoMovieinfo.bmp
  fi
  echo -e '<Movie>
  <title>'$MOVIETITLE'</title>
  <poster>'$MOVIEPOSTER'</poster>
  <info>'$MOVIESHEET'</info>
  <file>'$plik'</file>
  </Movie>' >> $RSS 
done
 

Any help appreciated. Thank you.

The performance problem is not due to your redirection -- that's not ideal, but doesn't really HURT you that much.

It's mostly due to running grep | sed | awk | cut | kitchen | sink repeatedly...

For instance I might do something like

grep "$STRING" | awk ... | while read ITEM1 ITEM2 ITEM3 ; do ... ; done

Running one awk for the whole batch is much faster than running 10,000 awks for 10,000 lines. And you may be able to use awk for more than one thing here, fed into several variables a loop, which would help you do more things yet with one program. You could probably even move the grep inside awk.

You can also move your redirection out of the loop completely:

grep "$STRING" | awk ... | while read ITEM1 ITEM2 ITEM3 ; do echo something ; done >> appended_file

For any further processing needed, it's much more efficient to use shell builtins than awk | cut | sed | kitchen | sink. This being busybox and not bash the builtins you have available will be rather few, though, but it's still not impossible.

Unfortunately, as you have not posted your input data, I have no idea what data needs to be processed into what. What those XML files look like, and exactly what the output from that grep looks like, would be good.

Unfortunately this script need to run on a media player and awk is not available on it. Of course I could always copy it there but I rather use the default tools available on the media player because it makes it easier to distribute.

The input file for this loop as I mentioned on the first post is the list of all my movies on my hard drive created using the find command:

 
find "$MOVIESPATH" | egrep -i '\.(asf|avi|dat|divx|flv|img|iso|m1v|m2p|m2t|m2ts|m2v|m4v|mkv|mov|mp4|mpg|mts|qt|rm|rmp4|rmvb|tp|trp|ts|vob|wmv)$' | egrep -iv "$FILTER" > $TMP

Then sorted and saves as $SORTEDTMP

hence input is:

/USB/movies/science-fiction/alien/alien.avi
/USB/movies/comedy/funnymovie/funnymovie.avi
/USB/movies/science-fiction/Ironman/IronMan.avi
etc ....

grep output is based on $MOVIESPATH so if it is equal to "/USB/movies/science-fiction/" then two output will be produce by the grep command:

/USB/movies/science-fiction/alien/alien.avi
/USB/movies/science-fiction/Ironman/IronMan.avi

Thank you for your inputs.

Snappy46

:confused: busybox almost certainly comes with awk.

If there's not an explicit link made, you can use it with busybox awk. Try it and see. awk will make everything easier if you have it.

Lastly, you forgot to post data from your XML files. I can't do anything without seeing the data I'm supposed to transform, with or without awk.

You would think that awk would be available but unfortunately they are using an old version of busybox 1.1.3 that they recompile themself to be as small as possible since it is flash into an embeded device (media player) that only has 128 MB flash nand. Trust me it is not available by issuing busybox by itself it displays all the command that are available and awk is not one of them or trust me I would definitely use it; it would make my life a lot easier. :wink:

Input to the script $SORTEDTMP is not an XML files but a list of all the movies on my movie directory as explained in previous post.

The output is XML and can be deducted from the code. For example for the input of the previous post and $MOVIEPATH = "/USB/movies/science-fiction/" the output created by the code would be:

<Movie>
<title>alien</title>
<poster>/USB/movies/science-fiction/alien/folder.jpg</poster>
<info>/USB/movies/science-fiction/alien/about.jpg</info>
<file>/USB/movies/science-fiction/alien/alien.avi</file>
</Movie>
<Movie>
<title>IronMan</title>
<poster>/USB/movies/science-fiction/IronMan/folder.jpg</poster>
<info>/USB/movies/science-fiction/IronMan/about.jpg</info>
<file>/USB/movies/science-fiction/IronMan/IronMan.avi</file>
</Movie>
etc.......

For all the movies in the /USB/movies/science-fiction/directory.

I left out the xml header that is created before the loop and the footer which is created after the loop; but in case anyone is interested here it is:

 
echo -e '<?xml version="1.0" encoding="UTF-8"?>
<Jukebox>' > $RSS
 
echo -e "</Jukebox>" >> $RSS

As you can see all the <movie> elements are appended to the file $RSS = jukebox.xml on every loop iteration; I am not sure how much time consuming this is ...... anyway if all the data was sent to some buffer/array/variable and then create the file $RSS=jukebox.xml after the loop is done be more efficient versus a write file operation on each loop iteration ???? Would that even make it faster ???? If so what would be the best method to do that.

Any ideas on this specific subject ....

Thank you.

If none of the input is XML, then what is

MOVIETITLE=`grep "<title>.*<.title>" "$MOVIEINFO" | sed -e "s/^.*<title/<title/" | cut -f2 -d">"| cut -f1 -d"<"`

for?

Ah!!! You got me there; this XML file contain description for the movies ex: tiltle, synopsis, rating etc.... The only information I am looking for in that XML if it is available is the movie title I do not care about the rest of the information. In a lot of cases that XML file does not even exist and the movie title is derived from the basename filename with the .avi, .mpg etc.... removed

see this line: MovieName=`basename "$plik" | sed 's/\(.*\)\..*/\1/'`

Well at least here's one thing for improvement I could just do this:

 
if [ -e "$DirectoryPath/MovieInfo.nfo" ];
  then
     MOVIEINFO="$DirectoryPath/MovieInfo.nfo"
     MOVIETITLE=`grep "<title>.*<.title>" "$MOVIEINFO" | sed -e "s/^.*<title/<title/" | cut -f2 -d">"| cut -f1 -d"<"`
  else
     MOVIETITLE=`basename "$plik" | sed 's/\(.*\)\..*/\1/'`

Instead of assigning "Moviename" when not even required.

Just saved a "basename" and "sed" call if MovieInfo.nfo file do exist.

Cheers !!!

I'm assuming the XML data for that looks like what you posted before?

Working on something.

Yes, but extracting the title from the MovieInfo.nfo file is pretty fast it's all those if / test statement and writing to the file that seem to take a long time. For example if my drive contain approximately 200 movies (not really that much considering some people have thousand) divided in 9 movies genres the loop would be run like 10 times using the same list but different MOVIESPATH. For example

first run = All the movies 200 index to create XML file
Second run = genre science-fiction maybe 30 movies indexes to create XML file
Third run = genre romance maybe 20 movies indexes to create XML file
etc.....

Obviously the total of movies for the Second run to the last run (10) will include a total of 200 movies.

I hope I am making sense and thanks again for your help/time.

Here's a version tested in busybox which uses almost pure shell builtins:

MOVIESPATH="./moviedir/"
SORTEDTMP="./movie"
OLDIFS="$IFS"
RSS="rssfile"

# I just do this to have any info at all...
find ./moviedir -iname '*.avi' > "$SORTEDTMP"

grep "$MOVIESPATH" "$SORTEDTMP" | while read LINE
do
        MOVIEPATH="${LINE%/*}"  # Shell builtins instead of basename
        MOVIEFILE="${LINE##*/}" # Shell builtins instead of basedir

        if ! [ "$MOVIEPATH/$MOVIEFILE" = "$LINE" ]
        then
                echo "Error processing line" >&2
                continue
        fi

        # Initialize defaults, replace later
        MOVIETITLE="${MOVIEFILE/.*}"  # Strip off .ext
        MOVIESHEET=/usr/local/etc/srjg/NoMovieinfo.bmp
        MOVIEPOSTER=/usr/local/etc/srjg/nofolder.bmp

        if [ -e "$MOVIEPATH/MovieInfo.nfo" ]
        then
                # Look for lines matching <title>
                while read LINE
                do
                        # Strip out <title> to make it shorter.
                        SHORT="${LINE/<title>}"
                        # If it's not shorter, it didn't have <title>
                        [ "${#SHORT}" = "${#LINE}" ] && continue

                        LINE="${LINE//<title>}"  # Strip out <title>
                        LINE="${LINE//<?title>}" # Strip out </title>

                        MOVIETITLE="$LINE"
                        break   # Found <title>, quit looking
                done <"$MOVIEPATH/MovieInfo.nfo"
        fi

        # Check for any files of known purpose inside the movie's folder.
        for FILE in "$MOVIEPATH"/*
        do
                [ -e "$FILE" ] || break # No files exist?

                case "${FILE##*/}" in
                "folder.jpg")           MOVIEPOSTER="$FILE"     ;;
                "${MOVIENAME}.jpg")     MOVIEPOSTER="$FILE"     ;;
                "about.jpg")            MOVIESHEET="$FILE"      ;;
                "0001.jpg")             MOVIESHEET="$FILE"      ;;
                "${MOVIENAME}_sheet.jpg")       MOVIESHEET="$FILE" ;;
                *)      ;;
                esac
        done

        # Print it all in one whack with a here-document.
        cat <<EOF
<Movie>
<title>$MOVIETITLE</title>
<poster>$MOVIEPOSTER</poster>
<info>$MOVIESHEET</info>
<file>$MOVIEFILE</file>
</Movie>

EOF
        # Note:  OVERWRITES $RSS
done > $RSS

---------- Post updated at 02:02 PM ---------- Previous update was at 01:57 PM ----------

What sort of disk are you writing to?

Shell builtins are hundreds of times faster than calling an external utility to operate on tiny amounts of data.

I think I've reduced the number of if/else's by using a case, too.

Also, you were reopening $RSS dozens of times, which probably didn't help.

1 Like

That is awesome I did not expect anyone to do all the dirty work for me. Hopefully this script will work fine on the old default busybox available on the media player.

I forgot to mentioned on my previous post that the whole indexing process All movies + all genres (200 movies) takes about 3to4 minutes. Hopefully your way of doing things will cut that down some.

The jukebox.xml file created by the loop is normally stored/written to the media player (internal drive) or externally connected USB drive; it all depends where the movies are located. The jukebox.xml is store in the genre root directory or the movie directory for the All movies.

I can wait to try this out. I will post the results once I did.

Again thank you. :slight_smile:

---------- Post updated at 11:27 PM ---------- Previous update was at 04:55 PM ----------

Hi Corona,

Here's your script with some minor changes to make it work for me.

#MOVIESPATH="./moviedir/"
#SORTEDTMP="./movie"
#OLDIFS="$IFS"
#RSS="rssfile"

# I just do this to have any info at all...
#find ./moviedir -iname '*.avi' > "$SORTEDTMP"

grep "$MOVIESPATH" "$SORTEDTMP" | while read LINE
do
        MOVIEPATH="${LINE%/*}"  # Shell builtins instead of dirname
        MOVIEFILE="${LINE##*/}" # Shell builtins instead of basename
        MOVIENAME="${MOVIEFILE%.*}"  # Strip off .ext       

        if ! [ "$MOVIEPATH/$MOVIEFILE" = "$LINE" ]
        then
                echo "Error processing line" >&2
                continue
        fi

        # Initialize defaults, replace later


        MOVIETITLE="$MOVIENAME"
        MOVIESHEET=/usr/local/etc/srjg/NoMovieinfo.bmp
        MOVIEPOSTER=/usr/local/etc/srjg/nofolder.bmp

  if [ -e "$MOVIEPATH/MovieInfo.nfo" ];
  then
     MOVIEINFO="$MOVIEPATH/MovieInfo.nfo"
     MOVIETITLE=`grep "<title>.*<.title>" "$MOVIEINFO" | sed -e "s/^.*<title/<title/" | cut -f2 -d">"| cut -f1 -d"<"`
  fi

#        if [ -e "$MOVIEPATH/MovieInfo.nfo" ]
#        then
                # Look for lines matching <title>
#                while read LINE
#                do
                        # Strip out <title> to make it shorter.
#                        SHORT="${LINE/<title>}"
                        # If it's not shorter, it didn't have <title>
#                        [ "${#SHORT}" = "${#LINE}" ] && continue

#                        LINE="${LINE//<title>}"  # Strip out <title>
#                        LINE="${LINE//<?title>}" # Strip out </title>

#                        MOVIETITLE="$LINE"
#                        break   # Found <title>, quit looking
#                done <"$MOVIEPATH/MovieInfo.nfo"
#        fi

        # Check for any files of known purpose inside the movie's folder.
        for FILE in "$MOVIEPATH"/*
        do
                [ -e "$FILE" ] || break # No files exist?

                case "${FILE##*/}" in
                "folder.jpg")           MOVIEPOSTER="$FILE"     ;;
                "${MOVIENAME}.jpg")     MOVIEPOSTER="$FILE"     ;;
                "about.jpg")            MOVIESHEET="$FILE"      ;;
                "0001.jpg")             MOVIESHEET="$FILE"      ;;
                "${MOVIENAME}_sheet.jpg")       MOVIESHEET="$FILE" ;;
                *)      ;;
                esac
        done

        # Print it all in one whack with a here-document.
        cat <<EOF
<Movie>
<title>$MOVIETITLE</title>
<poster>$MOVIEPOSTER</poster>
<info>$MOVIESHEET</info>
<file>$MOVIEPATH/$MOVIEFILE</file>
</Movie>

EOF
        # Note:  OVERWRITES $RSS
done >> $RSS

I could not get the function to extract the title from the MovieInfo.nfo file to work so I just inserted the one I already had in my script just to test the difference between the two script. The results were very surprising to me to say the least.

The original script as provided in my first post took 1 minutes and 49 second to process my 200 movies. The new source code you provided took 2 minutes and 46 seconds almost a whole minute longer ???? I could not believe the results so I tried it again with the same results. It would appear that using sed/cut/grep etc to get the work done is faster than using the built-in substitution command ????? I was quite shock, something in there seem to take a lot of time to accomplish.

I still think that I can use some of your code process to cut down further the process time ... well maybe. I would think that creating the file only once would be faster than appending the file for every movie element. I will try to introduce some of those step you have one at a time and see what makes thing go faster and what make things go slower.

I learned a lot from your inputs/code so again thank you.

Snappy46

Throwing away nearly all the speed gain in the process :wall:

If you could post the literal contents of one of those files, that'd be good. I built it to work with your test data. If it's actually any different, then I really need to see what it is.

The built-in substitution operator is hundreds of times faster than calling an external utility. At the very least the builtins instead of basename and basedir ought to be better. The rest of my code may have depended on certain assumptions about your data.

Are these folders full of irrelevant files? If so, the for FILE in "$MOVIEPATH"/* loop will waste a lot of time. Come to think of it, since we're only interested in .jpg, you can make it for FILE in "$MOVIEPATH"/*.jpg

---------- Post updated at 09:29 AM ---------- Previous update was at 09:20 AM ----------

Here's a version which works without trawling every file in the folder. You can replace the long if-else chain with two for's. It also helps make the lists longer without making your code longer (though testing for too many things will slow you down in any case).

for FILE in "folder.jpg" "${MOVIENAME}.jpg"
do
        [ ! -e "$MOVIEPATH/$FILE" ] && continue
        MOVIEPOSTER="$MOVIEPATH/$FILE"
        break
done

for FILE in "about.jpg" "0001.jpg" "${MOVIENAME}_sheet.jpg"
do
        [ ! -e "$MOVIEPATH/$FILE" ] && continue
        MOVIESHEET="$MOVIEPATH/$FILE"
        break
done

---------- Post updated at 09:35 AM ---------- Previous update was at 09:29 AM ----------

And you can strip out this completely:

        if ! [ "$MOVIEPATH/$MOVIEFILE" = "$LINE" ]
        then
                echo "Error processing line" >&2
                continue
        fi

I just put it there in case your input data was radically different from what I assumed it was.

I know this is temporary I plan on figuring out why this does not work; I am getting "bad substitution" on the busybox version that I am using on those lines:

SHORT="${LINE/<title>}"
LINE="${LINE//<title>}" # Strip out <title>
etc....

Can not remember which ones exactely but more than one of those substitution caused a problem. I had the same issue with that line:

 
="${MOVIEFILE/.*}"  # Strip off .ext
 

which I fixed by changing to this:

 
="${MOVIEFILE%.*}"  # Strip off .ext
 

I did not have much time to play around with that last nights and was curious about the change in processing time so I jump right in with the old code. Apparently my wife's needs are more important than working on this script. :eek:

I will provide sample of the input files (SORTEDTMP, MovieInfo.nfo) and output file (jukebox.xml) hopefully tonight time permitting; I am not home right now and do not have access to those file. Agree built-in substitution should be faster.

Yeah there is quite a few files on there some relevant some not so much. I think you just nailed it why it takes so much more time. I will use those new "for loop" you provided and I have a feeling that I will finally see some speed increase compare to the old script.

Thanks again from one fellow Canadian to another.

Cheers!!!

---------- Post updated 10-06-11 at 12:17 AM ---------- Previous update was 10-05-11 at 01:51 PM ----------

Ok when using the for loop and deleting the unnecessary if statement as indicated in your previous post I now can process my 200 movies in 1 minutes and 10 seconds .... wow!! That is an improvement of about 40 sec compare to the original script. That makes more sense; I guess the case statement + the number of files were definitely slowing things down.

Now trying to get the procedure to pull the title from the MovieInfo.nfo file to work but I am still stuck on one thing that does not want to work. In my version of busybox we must use "#" for deletion from the left to match; and "%" for deletion from the right to match. Once I figure that out you would think that thing would have when pretty smoothly but of course it did not. I can easily delete the <title> but I am unsuccessful in deleting the </title> ... I think that the "/" forward slash is creating a problem ???? Funny thing is that it works fine in the interactive mode but refuses to work in the script ????

This works fine in the shell:

export foo="<title>hello</title>
echo "${foo%</title>}"
or
echo "${foo%%</title>}"
or
echo "${foo%<?title>}"
or
echo "${foo%<*title>}"

all those returns "<title>hello" but none of those combination will remove the </title> when used in the script ???? This is driving me crazy any ideas??????

But this works fine in the script to remove the <title>

LINE="${LINE#<title>}"  # Strip out <title> 

By the way I also removed that line since it is not really needed and just used SHORT for the Strip out </title>

Here's the procedure that works fine except for the strip put </title>.

       if [ -e "$MOVIEPATH/MovieInfo.nfo" ]
        then
                # Look for lines matching <title>
                while read LINE
                do
                        # Strip out <title> to make it shorter.
                        SHORT="${LINE#<title>}"
                        # If it's not shorter, it didn't have <title>
                        [ "${#SHORT}" = "${#LINE}" ] && continue

#                        LINE="${LINE#<title>}"  # Strip out <title>
                        LINE="${SHORT%%</title>}" # Strip out </title>

                        MOVIETITLE="$LINE"
                        break   # Found <title>, quit looking
                done <"$MOVIEPATH/MovieInfo.nfo"
        fi

Thanks

# strips from the beginning of the string, % from the end. ## and %% are the same when you're removing a literal string, so might as well use % and #. See string operations. Not all shells have all of these. Some don't have any of them. Developers on embedded busybox are more fortunate than some people using 'real' shells, it has at least a few :smiley:

VAR="${VAR#<title>}" # Strip out <title>
VAR="${VAR%</title>}" # Then strip out </title>

---------- Post updated at 03:17 PM ---------- Previous update was at 03:08 PM ----------

YOu could also try putting quotes around "${VAR%"</title>"}"

---------- Post updated at 03:20 PM ---------- Previous update was at 03:17 PM ----------

globbing should also work inside it, so:

VAR="${VAR#<[^>]*>}" # Remove first <...>
VAR="${VAR%<[^>]*>}" # Remove last <...>

---------- Post updated at 03:21 PM ---------- Previous update was at 03:20 PM ----------

Also: If it works in the prompt and not in your script, then your data may not be what you really think it is. You feed it a nice "<title>stuf</title>" and it works but the data actually has stuff after <title> perhaps. PLEASE post a sample of your input data!!

Hi Corona,

None of the combination you provided works with that stupid busybox version that reside on the media player to remove the </title> .... what a pain this is.

Anyhow here's some sample of input/output files and the full source code in case you are interested. The code is pretty horrific I am sure but I am not a programmer this is a hobby and I am still learning. I plan on implementing some of the new tricks I learned from you to some other part of the code to improve things further.

Here's the inputs files:

SORTEDTMP="sortedrss.list"

sortedrss.list
MovieInfo.nfo

Here's the output file (created with old/new version without the extra </title>)

jukebox.xml

And finally here's the full source code as it stands at this particular moment:

jukeboxscript

Thanks for your help/time

Snappy46

There's extra whitespace all over the place in there. <title>'s not necessarily at the beginning, and </title>'s not necessarily at the end. That messes up all of those string replacements.

How about this? It uses the set -- trick, abusing the special IFS variable to split a string into $1, $2, ... based on <> tag characters. It won't matter what's at the beginning or end.

STR="  <title>The Adjustment Bureau</title>   "

OLDIFS="$IFS" # Save for later
# Because of IFS, this will split into "   ", "title", "the adjustment bureau", ...
# and we feed those into set -- to save those into $1, $2, ...
# If you were using $1, $2, ... for commandline arguments, better save them
# somewhere else first because this will wipe them out.
#
# $STR must not have quotes around it!  We're depending on the
# shell's own argument splitting here.
IFS="<>" ; set -- $STR ; IFS="$OLDIFS"
# Get rid of the "   "
while [ "$#" -gt 0 ] && ! [ "$1" = "title" ] ; do shift ; done
# The title is now $2.
MOVIETITLE="$2"

Post Deleted ..... Brain fart!