Need Help with awk and arrays

fusionX · February 10, 2008, 3:36am

now its owkring - thanks fo rthe help all .

vgersh99 · February 10, 2008, 7:03am

what exactly have you tried so far?

otheus · February 10, 2008, 7:48am

Arrays in awk (and php) are purely associative. So you can say

ip[$1]++;

Getting the busiest date and hour pretty much requires perl, but it's possible to do in gawk and maybe nawk. You'll need the mktime function in the least. It's ugly and you need to populate the full months table and some additional parsing.

BEGIN { m["Jan"]="01"; m["Feb"]="02"; } # and so on for all months
{ 

# split time field into numbers and letters
split($4,lt,"[^0-9a-zA-Z-]*"); 
# construct timestamp into internal unix representation
ts=mktime( lt[4] " " m[lt[3]] " " lt[2]  " " lt[5]  " " lt[6]  " " lt[7]  " " lt[8]); 
# you don't care about minutes and seconds, so just replace lt[7] and lt[6] with 0's. You could make *two* timestamps -- one for just the days (hours 0'd out) and another for just the hours (always Jan-1-1970, but with the hour filled in).

# bump count for this ip address
ip[$1]++; 
# 
day[ts]++;
}

END { 
  # find busiest day
  frequency=-1; busiest=-1;
  for (d in day) {
   if (day[d] > frequency) {
      frequency=day[d];
      busiest=d;
   }
  }
  print "busiest day: " busiest " hit " frequency " times";
  
}

fusionX · February 10, 2008, 4:44pm

done - thanks for the help !

fusionX · February 10, 2008, 4:45pm

check out what I had tried....

fusionX · February 10, 2008, 5:11pm

la la la la la la - its owkring now.

otheus · February 11, 2008, 5:31pm

First, finish filling out the months table in the BEGIN block. Second, the time-zone doesn't correctly get registered. Replace lt[8] with $5. Third, feel free to insert "print" statements to get some debugging output. I saw only a few lines of the log file, and I tested my code only against those liens.

otheus · February 11, 2008, 5:41pm

This doesn't work as you expect. $1 stays the same. It's the same as doing:

 numIP[$1] = NR;

So each element in the array would be filled with the latest Record number. I'd expect the last IP address to have the highest count.

With all respect, I think you are confused about how AWK works. Your "program" is executed for each line of input, every time. It's like there's a big while loop around your code, and in each iteration, $0 is the input line, and $1, $2, etc, are the fields split via the regular expression in FS (whitespace by default). NR just indicates the current record (line) number. If you want to know if the current record's day is not the same as the previous, you have something like:


tmp=substr($4,2,2);
if (day != tmp) {
   # New day code here
   # ie, print how many IPs were hit on this day
   print hits_on_this_day;
   # change day to match current record
   day = tmp;
}
else {
  hits_on_this_day++;
}

No need to initialize "day".