Fine Tune - Huge files/directory - Purging

Hi Expert's,
I need your assitance in tunning one script. I have a mount point where almost 4848008 files and 864739 directories are present. The script search for specific pattern files and specfic period then delete them to free up space. The script is designed to run daily and its taking around 3 complete days to complete. So the task of tunning came to me.
Initial the Script has 43 find command to delete files 5 find commands to delete the empty directory.

find ${PD} -type f -name '*(WEEK)*' -mtime +14 -exec rm {} \;
find ${PD} -type d -name '[0-9][0-9][0-9][0-9]*' -exec $rmdir {} \; > /dev/null 2>&1

Two ways I took to tune this script.
1)Combine all seraching pattern/time to reduce the find command like below

find ${PD} -type f \( -name '*(WEEK)*' -o -name '*(MON)*' -o -name '*(TUE)*' \
-o -name '*(WED)*' -o -name '*(THU)*' -o -name '*(FRI)*' -o -name '*(SAT)*' \
-o -name '*(SUN)*' -o -name '*(WEEKLY)*' \) -mtime +14 -exec rm {} \;

So I got only 7 find command for files and one command for directory. I think(not tested the approach yet) this reduce the search and reduce the time too.
2) Since the first one has -exec command along with the find command what i think is it will take more time, So second approach what I have is finding the files which i need to delete and the delete it with the below loop.

find ${PD} -type f \( -name '*(WEEK)*' -o -name '*(MON)*' -o -name '*(TUE)*' \
-o -name '*(WED)*' -o -name '*(THU)*' -o -name '*(FRI)*' -o -name '*(SAT)*' \
-o -name '*(SUN)*' -o -name '*(WEEKLY)*' \) -mtime +14 -print > remove.log
cat remove.log | while read ENTRY
do
if [ -f $ENTRY ]; then
rm -f $ENTRY
elif [ -d $ENTRY ]; then
rmdir $ENTRY
fi
done

So, What I request is please let me know the pros & cons on approach 1 & 2. Also please let me know find -exec will take more time or not.
Thanks
Senthil

combining search patterns into one find command is a good idea.
Storing the filenames into a file and then looping through the contents of the file is slower than doing -exec, so unless you want to keep a log of what was deleted, it's reduntant.

Faster than doing -exec would be piping the output of find to xargs(1) like this:

find $PD <all options you need> | xargs rm

which would call rm only once for many files, as opposed to -exec, which will invoke rm for every file.

Calling find on a mountpoint is not ideal -- if at all possible, i'd recommend running the same find command on the machine that physically contains the filesystem.

1 Like

I would still go with option 1. However if the list of files to be removed is very large then you might get an error thats bevause it might go beyond the string which can be handled by the rm command.

-exec won't create any problem. Its as good as runing rm command.

1 Like

@mirni/vidhyadhar
Since I'm combining the find commands by mtime(now its came only 7 mtime) so the removal list wont come big and I am reomving that each and every time. Also if I use the xargs at last is it fine?

 
find ${PD} -type f \( -name '*(WEEK)*' -o -name '*(MON)*' -o -name '*(TUE)*' \
    -o -name '*(WED)*' -o -name '*(THU)*' -o -name '*(FRI)*' -o -name '*(SAT)*' \
    -o -name '*(SUN)*' -o -name '*(WEEKLY)*' \) -mtime +14 -print | xargs rm

Also I'm runing the script where the mount is physicaly mounted.

That looks good. No need for the -print switch, but it shouldn't influence performance.

You misunderstood. If the directory tree is on machine A's hard drive, and it's mounted on machine B's /mnt, running

find /mnt

on machine B is much slower than running

find /dirThatsExprted

directly on machine A (e.g. through ssh).

Can you please replace this line : -o -name '*(MON)' with below code
-o -name '*(MON|TUE|WED|THU|FRI|SAT|SUN)
'

Hope this works 4 u :o:o

---------- Post updated at 05:29 AM ---------- Previous update was at 05:25 AM ----------

I think , by having xargs in command will add burden on tunning, since first it will add files in buffer then it will remove where as in direct command , it will keep removing once it finds the file/dir.

1 Like

@Mirni,
By the statement from mann2719 xargs will delay the process, so shall I use the -exec flag?
@mann2719,
For Mtime +14 I'm having 9 seraching pattern, Shall i combine them in one like

-name '*(MON|TUE|WED|THU|FRI|SAT|SUN|WEEK|WEEKLY)*' 

..? The file will have name like

2011_(MON)(RERUN).CSV

Also if the find has more than one search pattern is it will loop for more than once or loop for once and search for all pattern.

yes , just try all nine in a single row and let us know the o/p.

mann1279,

The o/p is not coming like what expected, its taking lot of CPU usage also pattern has the () .
Is it *(MON)* , *(TUE)* equal to the *(MON|TUE)*

xargs will not burden anything. The difference between -exec and |xargs can be significant if there are a lot of arguments, with xargs being the winner. I already said this in the previous post, but let me re-iterate in more detail:

 find . -exec rm {} \;

will fork() a process for each file. If find returns a million files, you will end up with a million rm commands. This is much more expensive than doing

 find . | xargs rm

because this construct will run rm only once for many files; for how many depends on your system -- it's defined in limits.h.
Try it for yourself if you don't believe me:

$ ls | wc
  37883   37883  367719
$ time find . -maxdepth 1 -type f -exec cp {} dump \;

real    1m16.008s
user    2m0.508s
sys    0m37.818s

$ time find . -maxdepth 1 -type f | xargs cp -t dump

real    0m1.197s
user    0m0.268s
sys    0m0.712s

Newer shells also provide a '+' version of -exec, which basically does the same as xargs -- feeds the command as many arguments as it can:

 $ time find . -maxdepth 1 -type f -exec cp -t dump {} +

real    0m1.050s
user    0m0.256s
sys    0m0.660s
1 Like

No it won't, because they can run simultaneously -- where find by itself cannot.

For find performance with -exec you can also use

-exec rm -f {} \+

instead of

-exec rm -f {} \;

With + you will achieve similar performance as with xargs, with minimal code change.

Thanks to all, I'm finalized go with xargs and testing old/New modified script.

Old script took - 3 days 9 hours 4 minutes 0 seconds

New script i just now triggered and i will post the finding s sooner.

mann1279,

The o/p is not coming like what expected, its taking lot of CPU usage also pattern has the () .
Is it *(MON)* , *(TUE)* equal to the *(MON|TUE)*

New script takes only 10 hours and 35 min 23 seconds. But it takes always 15 -21 % of CPU. Earlier its too less than 10 , so any thought on this.?but still testing on with exec + will post the final result to all.

To all,

The script throws some error because its unable to delete the file with spaces, so i refered the site and found one solution for that

Now I am using

 
xargs -I{} rm {}

Is there any performance degration or skipping of files happens ..?

I usually use

find $DIR -print0 | xargs -0 command

construct, which uses NULL to separate the arguments, and whitespaces are no longer special. You can look into that, if it's supported on your system. I recommend doing a benchmark yourself, since it may vary from system to system, but I wouldn't expect to see huge differences in performance.

1 Like

print0 is not working in my system

 find: 0652-017 -print0 is not a valid option.

Thanks

You only get -print0 with GNU find/xargs.

The -I{} solution ought to work with the same performance unless you have a newline in a filename. If you do, only print0 can handle that without a hitch.

Hi admin,
When i Used the above xargs command I'm getting below error "xargs -I{} rm {}"

xargs: Missing quote:

When I checked this the file name contains double quote and some contains single quote. How to overcome this. Do i will go back to -exec rm command again with + symbol?
Thanks
sample file names

 
./D�PENSES PAR INDUS.    THALES GROUP "EURO"-THALES (CYCLIQUE).HTML
./PRINCIPAUX FOURNI'S.    THALES (CYCLIQUE).HTML

 ... | xargs -d '\n' rm

should work as long as your filenames don't contain newlines!