Optimizing the Shell Script [Expert Advise Needed]

I have prepared a shell script to find the duplicates based on the part of filename and retain latest.

    #!/bin/bash
    if [ ! -d dup ]; then
        mkdir -p dup
    fi
    NOW=$(date +"%F-%H:%M:%S")
    LOGFILE="purge_duplicate_log-$NOW.log"
    LOGTIME=`date "+%Y-%m-%d %H:%M:%S"`
    echo "$LOGFILE"
    echo "Started at $LOGTIME " >> $LOGFILE
    echo "Before File Count " >> $LOGFILE
    cd /tmp/sathish/GB/
    ls -l | wc -l >> $LOGFILE
    for i in `find /tmp/sathish/GB/ -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;
    do
    if [! -z $i]; then
    echo "No Duplicates Identified" >>$LOGFILE
    fi
    echo "$i $dup">>$LOGFILE
    mv -v $i dup;
    done
    echo "Ended at $LOGTIME " >> $LOGFILE
    echo "After File Count " >> $LOGFILE
    cd /tmp/sathish/GB/
    ls -l | wc -l >> $LOGFILE

I recently tested this script in test server.

    Time Taken	22 Min
    Before File Count	227874
    After File Count	58137
    Duplicates Moved to Dup folder	169737

I am unable to implement this production server as it consumes more cpu, any way to optimize this query.

Truly appreciate your expertise advise to minimize the cpu during this process.

Suggestions most welcome

Without making any other changes, you can probably remove the sort (which I imagine is quite expensive, run it with strace to see) as your awk is reading them all anyway, and deciding if a duplicate is found. You could also consider running the script with nice, or putting in sleeps, to reduce the CPU usage.

You might also want to eliminate costly process creations by mv ing several files in one command.

This is probably only a tiny difference, but you if you change:

awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'

to:

awk -F_ 'dup[$1]++'

it might consume slightly less CPU cycles.

Given this long running time, and considering that for each $i the LOGFILE is being opened and closed, I would try to put everything into a single process. This means doing the majority of the work not in Shell language, but some other languages. Ruby, Perl or Python all have an equivalent to find and sort, so I would expect a noticeable speedup.

Don't underestimate the power of the dark side rovf :smiley:

Problem here lies in to many nesting stuff and many pipes not the language.
for i in $( find | grep | sed | awk )

When you see shell code looking like bar code, something is fishy :slight_smile:

Replace that with ls and awk magic, and stuff should happen much faster then now.

The OP should provide representative example of his input data and desired output to help further.
Also answer common question like operating system and shell.

Regards
Peasant.

2 Likes

If the long runtime is caused by the many files that find needs to traverse then there is hardly anything that can be done.
But maybe it is due to a misbehavior of a special character.
The following is a bit safer, and contains some further optimizations, like using a filedescriptor for logging rather than open-append-close each time, and sorting on key field 1 only, ...

#!/bin/bash
PATH=/bin:/usr/bin:/usr/sbin:/sbin
NOW=$(date +"%F-%H:%M:%S")
LOGFILE="purge_duplicate_log-$NOW.log"
LOGTIME=`date "+%Y-%m-%d %H:%M:%S"`
cd /tmp/sathish/GB/ || exit
mkdir -p dup || exit
echo "$LOGFILE"
exec 3>>"$LOGFILE" # open it once, the shell will close it at exit
echo "Started at $LOGTIME " >&3
echo "Before File Count " >&3
ls | wc -l >&3
dups=$(find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg -k 1,1 | sed 's/[^ ]* //' | awk -F"_" 'dup[$1]++')
if [ -z "$dups" ]
then
  echo "No Duplicates Identified" >&3
else
  set -f # no wildcard globbing, only word splitting
  for i in $dups
  do
    mv -vf "$i" dup/
  done >&3 2>&1
  set +f
fi
echo "Ended at $LOGTIME " >&3
echo "After File Count " >&3
ls | wc -l >&3