Detailed disk usage versus age summary

Hi,

I'm posting my question here as I fele that what I am about to try to do must have been done already, and I don't want to re-invent the wheel.

I have recently become responsible for monitoring disk space usage for a large file system.

I would like to geenrate reports that will summise the amount of disk space used by directories at a certain level, grouped into date ranges.

e.g. results

Last modified : file path          : total
 0 - 1 months : /foo/foo_01/bar_01 : 101 GB
                /foo/foo_01/bar_02 :  98 GB
                /foo/foo_02/bar_03 : 202 GB
                /bar/bar_01/etc    : 203 GB
 1 - 6 months : /foo/foo_01/bar_04 : 405 GB
                /bar/bar_02/etc    : 203 GB
                /bar/bar_03/etc    : 203 GB
6 - 12 months : /bar/bar_03/tmp    :  20 GB
                /bar/bar_01/tmp    :  22 GB
12 months +   : /bar/bar_02/tmp    : 203 GB

I hope that gives some idea of what I am trying to achive. Basically, I want to highlight large areas of the filesystem that can be archived off because they have not been accessed for some time.

If anyone can point me towards any scripts already written that would do this or something I can modify to do it I would appreciate it.

At the moment I am loking at starting from scratch, which I'd enjoy, but will take some time.

I can not install any software - it must be script based.

Thanks for any tips/advice! :o

So normally you can just use "du -sh $DIR" to get the summary information. The trick is figuring out which ones you want to sum. What does "Last modified" really mean? Does it refer to the directory itself (which means any change to any filename)? Or does it mean a file in that directory? If so, does it mean the oldest modified or newest modified?

Thanks for your reply. I've worked on this a fair bit yesterday and got a lot further than I thought I would.

I am using the find command to look through all files and directories, look at the modified time (-mtime) and report back all files that are modified between set time frames ... so last week, 1 to 4 weeks ago, 1 to 6 months ago etc ... all the way up to over a year ago.

find . -path './.snapshot' -prune -o  -type f -mtime -8 -ls

I have then put piped the output to awk to sum the number of bytes and number of files:

 | awk '{bytes += $7; count++} END print bytes, count}'

I am running this as 2 loops - so that I get all the subdirectories in the top level as supplied at the command line.

I am passing all output through several echo commands to output in html format so I can put the output in a table.

Sample output so far:

<table width="960" border="1" bordercolor="gray" align="center" cellpadding="0" cellspacing="0">
<tr align="center"><td align="left" width="160">foobar</td>
<td colspan="2" width="160">0 - 7 days</td>
<td colspan="2" width="160">1 - 4 weeks</td>
<td colspan="2" width="160">1 - 6 months</td>
<td colspan="2" width="160">7 - 12 months</td>
<td colspan="2" width="135">1 year +<td></tr>
<tr align="right"><td align="right">/basecase</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>187307</td><td>2</td> <td>�</td><td>�</td> <td>160477762</td><td>132</td> </tr>
<tr align="right"><td align="right">/cbabble</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>120476297</td><td>82</td> <td>�</td><td>�</td> <td>�</td><td>�</td> </tr>
<tr align="right"><td align="right">/hi_STOIIP</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>5561429</td><td>15</td> </tr>
<tr align="right"><td align="right">/libra</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>30312</td><td>18</td> </tr>
<tr align="right"><td align="right">/lowestcase</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>17828811</td><td>26</td> </tr>
<tr align="right"><td align="right">/region</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>108363878</td><td>105</td> </tr>
<tr align="right"><td align="right">/with_XYZ</td>
<td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>�</td><td>�</td> <td>35384975</td><td>43</td> </tr>
</table>

I will have a table as above for each directory in the file path supplied at the command line - one table after another.

I allows me to see which directories have not been modified for a long time - so in the above example I could possible archive off the last 5 directories as they have not been modified in the past year.

Does that makes sense?

I wonder if my find command is good enough?
Is -mtime reliable?
Should I use -atime?

(Sorry for the wide page!)

Uh, that's what I was going to do, except I would use "-printf %s %p\n" instead of "-ls". Use mtime. atime is for access time, which you don't really want, do you? Maybe you do... maybe you want when the file was last used, not just modified.

Hmm - I'm not sure. I'm looking at a large file system with many users. They use the system for generating files but also some users just access files that are used by differtent software packages to run large processing jobs.

So I imagine that there will be some files that are accessed but not modified - i.e. read only. I was worried about using the -atime as I think find itself changes the -atime by looking at it - doesn't it? I thought I saw that on a man page, but can't see it just now.

No, the find with with -atime doesn't change the files just for doing a "stat", which is what find does. (Bbut find will change the atime of any directory it reads).

Thanks for your help on this - much appreicated! Maybe I'll post my final script here (is that the done thing/pssible?)

So as long as I concentrate my find command to search files only, then maybe I should be doing a -atime to bring out the date that files that are being accessed rather than modified.

I asussume files will never have a 'younger' modified time than accessed time?
Logic would tell me no, but my logic and unix logic are not always compatible ;o)

Also, I'm running this as root but still getting "Permission denied" errors. I've had this before - something to do with my root access only being a semi-root access, via LRAM. I think the groups permissions of my user.

I'll need to catch these errors somehow.

It's possible to change mtime without touching atime. The "touch" command can do this. You're encouraged to post your script here. You might want to license it via LGPL or CC or something.

About the Permission denied errors: Redirect stderr to an error file. It could be you have NFS mounted a directory that enables squash-root. It's also possible you have a corrupted filesystem.

Hmm - ok. I'll look into how to do that (re LGPL / CC) and then post what I have so far.

But right now it's time to go home and open a bottle of red ... thanks for all your help.