Fastest way calculating directory

rufino · October 25, 2013, 6:16am

Hi expert,

Is there any fastest way to calculate recursive directory, and I have total 600 directories have 100000 files and 10 directory approximately 9000000 - 10000000 each files per directory. currently using this command "du -k --max-depth=0" to get the size but very slow it take 24 hours until now not yet done. is there any work around something snapshot that size of that folder and sum up the additional file if anyone upload to that directory. anyway my goal is to calculate fastest way.

Thanks more power

jim_mcnamara · October 25, 2013, 6:39am

df -h

will run much faster.

When you have gigantic directories many kinds of filesystems perform very poorly. du reads information on a per file basis, in your case millions of file reads (calls to stat). df gets information stored in the kernel about whole filesystems. One read per filesystem.

At some point you should attempt to reorganize your directories so that you don't have what appears to me to be an unmanageable mess.

bakunin · October 25, 2013, 7:48am

Any access to such a file system will tend to be slow because of the many necessary calls to stat() , not only the reading of the (basically) inode-structure by du . To minimize not only your problem at hand but all similar problems regarding this filesystem i suggest you move the vital FS information (that is: inodes and the like - all the metainformation) to some high-bandwidth storage, like an SSD.

I saw a similar problem (backup/restore of a huge GPFS with ~500TB of data) solved by introducing a 150GB SSD holding just the metadata. It reduced the necessary time from ~6 hours to ~90 minutes using the same hardware.

I hope this helps.

bakunin

rufino · October 26, 2013, 5:49am

I don't think df command can get total size per folder? is there's away to do df the folder?

thanks

jim_mcnamara · October 26, 2013, 8:41am

Short answer for df: no.
A not so great answer to using du:

You can run du parallel. It is still going to take a very long time.
Assume all of your directories live on two mountpoints (directories): dira and dirb

cnt=0
> /tmp/summary_sizes.txt   # set the file to zero length
find /dira /dirb -type d | while read dirname 
do
     du -s $dirname >> /tmp/summary_sizes.txt &   # run du in the background
     cnt=$(( $cnt + 1 ))                                          # count background processes
     [  $(( $cnt % 15 )) -eq 0 ]  && wait                   # when 15 active -- wait
done
wait                                                                      # wait for any leftover processes

15 is arbitrary. There may be so much I/O on your filesystem(s) that you need to lower that number. If there is little impact (see output of iostat -zxnm 1 10 ) you may want to bump it up. Also since you did not post the directory hierarchy, and I am guessing, the result of find may cause du to read the same directories multiple times, which impedes performance ex:

/dira
     foo
          dir1
             subdir1
             subdir2
          dir2
     foo1

So if you know the correct full names of all of the directories you want to monitor
put them in a file (call it dir.txt) like this:

/dira/foo/big1
/dira/foo/big2
/dira/foo2/big1
/dirb/foo/big2

so that du does each "endpoint" directory just one time. This may or may not be feasible.

Change the above code this:

cnt=0
> /tmp/summary_sizes.txt   # set the file to zero length
while read dirname 
do
     [ "$dirname" = "/dira" ]  && continue                # skip highlevel dirs
     [ "$dirname" = "/dirb" ]  && continue
     du -s $dirname >> /tmp/summary_sizes.txt &   # run du in the background
     cnt=$(( $cnt + 1 ))                                          # count background processes
     [  $(( $cnt % 15 )) -eq 0 ]  && wait                   # when 15 active -- wait
done  < /path/to/dir.txt
wait                                                                      # wait for any leftover processes

rufino · October 28, 2013, 6:10am

thanks for response, but not working here still slow.

jim_mcnamara · October 28, 2013, 6:53am

Define "not working". Your directory structure is beyond awful, performance wise, so you will never get an answer to du in a reasonable time using standard UNIX tools. du reads directories as we explained earlier.

You would have to develop a fairly complex daemon to constantly monitor each of the huge directories and then store the output on a file system separate from the big directories. Or simply wait a very long time to get an answer using UNIX tools.

If you do something about where the directory data lives, as bakunin suggested, things would get better. Not perfect.

What OS are you on? Maybe you can tune ufs or whatever filesystem you have.

rufino · October 28, 2013, 10:09pm

centos 6.4 and my bid disk mounted from EMC.

here is my directory structure.
20TB mouted from EMC
/data
10m files -images/
500k files -txt/
wanted to get total size of images and txt folder separately.

cnt=0 > /tmp/summary_sizes.txt   # set the file to zero length while read dirname  do 
[ "$dirname" = "/data/images" ]  && continue                # skip highlevel dirs 
[ "$dirname" = "/data/txt" ]  && continue      
du -s $dirname >> /tmp/summary_sizes.txt &   # run du in the background      
cnt=$(( $cnt + 1 ))                                          # count background processes      
[  $(( $cnt % 15 )) -eq 0 ]  && wait                   # when 15 active -- wait
done  < /tmp/dir.txt
wait

I try your script but no result

bakunin · October 29, 2013, 12:09pm

No wonder - you have skipped exactly the directories you were interested in. You should skip "/data", as HIGHERLEVEL directory, like the commentary suggested. It might help to actually try to understand what the script is doing before modifying it.

I hope this helps.

bakunin

rufino · October 30, 2013, 2:07am

yes you right, now is working and will update you.