awk arrays can do this better - but how?

Hi,

I have spent the afternoon trawling Google, Unix.com and Unix in a Nutshell for information on how awk arrays work, and I'm not really getting too far.

I ahve a batch of code that I am pretty sure can be better managed using awk, but I'm not sure how to use awk arrays to do what I'm trying to do effieciently.

I have used 'find' to produce a list of file sizes and last access time in seconds:

find . -path './.snapshot' -prune -o -type f -exec stat -c "%X %s" {} \;

1227265466 5108
1224230970 1685
1225974287 6502
1224237225 105532
1224239208 4125
1225974287 6552
1224240021 1066
1225974287 8399

I now want to sum the value in bytes for different date ranges:

cat asecs_bytes.txt | while read asecs bytes
do 
	#convert days to months
	monthsago=$(( (now - asecs) / 2592000  ))
	
	if [ $monthsago -gt 36 ] ; then 
		
		col_4_bytes=$(( col_4_bytes + bytes ))
		col_4_count=$(( col_4_count + 1 ))
	
	elif [ $monthsago -gt 12 ] || [ $monthsago -le 36 ] ; then 
		
		col_3_bytes=$(( col_3_bytes + bytes ))
		col_3_count=$(( col_3_count + 1 ))
	
	elif [ $monthsago -gt 1 ] || [ $monthsago -le 12 ] ; then
	
		col_2_bytes=$(( col_2_bytes+ bytes ))
		col_2_count=$(( col_2_count + 1 ))

	elif [ $monthsago -eq 0 ] || [ $monthsago -eq 1 ] ; then
	
		col_1_bytes=$(( col_1_bytes + bytes ))
		col_1_count=$(( col_1_count + 1 ))
	else
		# should not be possible
		col_x_bytes=$(( col_x_bytes + bytes ))
		col_x_count=$(( col_x_count + 1 ))
	fi	
	

done  

I'm pretty sure I could use awk to read my file and output the information in a much better manner. Any suggestions?

I have thought about reading the file in then evaluating each asecs value, and detemrine if it is in a certian range - but I think awk arrays can be much smarter than that.

I'll also need to convert the bytes into a more logical value (GB/MB/KB) at some point, but can do this in a separate step if necessary.

for completness, my output is going to be emailed ot users of a file sytem to inform them of how much data they have on different disk areas of different date ranges, like this:

<div class="tabletopBLUE" style="width:100%;">/full/toplevel/file/path/here/ ���[username]</div> 
<div style="background-color:White; color:White; line-height:4px; width:100%;">_</div>
<div class="outline" >
<table class="cboxTXT1" width="100%" border="0" align="center" cellpadding="0px" cellspacing="0px">
 <tr style="font-weight: bold;">
  <td width="200"id="Rowheader" colspan="1" align=left>sub-directory</td>
  <td width="150" id="Rowheader" colspan="2" align=center>0 - 1 month</td>
  <td width="150" id="Rowheader" colspan="2" align=center>1 - 12 months</td>
  <td width="150" id="Rowheader" colspan="2" align=center>1 - 3 years</td>
  <td width="150" id="Rowheader" colspan="2" align=center>3 years +</td>
 </tr>
 <tr> 
 <td align=right>/</td>
  <td align=right></td><td align=left></td> 
  <td align=right></td><td align=left></td> 
  <td align=right></td><td align=left></td> 
  <td align=right></td><td align=left></td> 
 </tr>
 <tr> 
 <td align=right>/example_one ��� [username]</td>
  <td align=right>1602760</td><td align=left>1</td> 
  <td align=right></td><td align=left></td> 
  <td align=right>19141123</td><td align=left>72</td> 
  <td align=right></td><td align=left></td> 
 </tr>
 <tr> 
 <td align=right>/example_two ��� [username]</td>
  <td align=right></td><td align=left></td> 
  <td align=right>666854</td><td align=left>3</td> 
  <td align=right>27799028</td><td align=left>67</td> 
  <td align=right></td><td align=left></td> 
 </tr>
 <tr> 
 <td align=right>/example_three ��� [username]</td>
  <td align=right></td><td align=left></td> 
  <td align=right>485</td><td align=left>1</td> 
  <td align=right>249226085</td><td align=left>438</td> 
  <td align=right></td><td align=left></td> 
 </tr>
 <tr> 
 <td align=right>/example_four ��� [username]</td>
  <td align=right></td><td align=left></td> 
  <td align=right>130095309</td><td align=left>1</td> 
  <td align=right>74821761</td><td align=left>18</td> 
  <td align=right></td><td align=left></td> 
 </tr>
 <tr> 
 <td align=right>/example_five ��� [username]</td>
  <td align=right></td><td align=left></td> 
  <td align=right></td><td align=left></td> 
  <td align=right>2572753103</td><td align=left>73</td> 
  <td align=right></td><td align=left></td> 
 </tr>
   </table>
  </div>
 <div>

Cheers,
littleIdiot

Sorry for the late reply, but you should understand that Thanksgiving time (end of November) is a good time for lots of us to now get on the computer....

Yeah, though only slightly better:

#!/usr/bin/awk -f
{
   # figure out monthsago here....
   # ...
   if (monthsago <= 0)
     bucket=0;
   else if (monthsago <= 1)
     bucket=1;
   else if (monthsago <= 2)
     bucket=2;
   else if (monthsago <= 12)
     bucket=12   
   else if (monthsago <= 36)
     bucket=36
   else 
     bucket="inf"

   sum[bucket]+=$2
   count[bucket]++;
}
END {
   for (bucket in sum) {
      print sum[bucket],count[bucket],sum[bucket]/count[bucket];
   }
}

Doing kb,mb,gb, etc is similar:

BEGIN {
  split(",kb,mb,gb,tb,xb,pb",units,","); 
}

END {
  magnitude=1;
  while (val >= 1024) { 
       val/=1024;
       magnitude++;
   }
   print val,units[magnitude];
}

Hope that helps.