I have only tested this with the given data, but I'm surprised that you're showing sample output that includes the data marked in red above when there was no sample input for that date. When I tried running the above code with the given sample input, I didn't get the last line of the output shown above.
It also looks like there is a missing semicolon. Did you perhaps intend the following?:
awk -F'|' 'NR>2 {hour=" " substr($2,1,2); array[$1 hour] += $3; count[$1 hour]++ } END { for (a in array) {print a, array[a]/count[a] } } ' file | sort
which produces the output:
03/02/2015 00 26.24
03/02/2015 01 26.36
from the given sample input. And, this seems to match the given sample input.
Don,
Yes, I added another test record with 3/3/2015 date to test the break logic and forgot to remove it from posted output (as OP did not include it in his file). I missed the semi-colon although I'm not clear at the moment why it impacted the output (the extra decimal places). Thanks for pointing this out.
Unlike many languages where two adjacent strings representing numbers separated by a space would be a syntax error, awk is happy to concatenate them and treat them as a single numeric string.
So if we look at the first few lines of the input file:
| | | User |Memory| User |
Date | Time |CPU %|CPU % | % |Mem % |
03/02/2015|00:00:00| 24.56| 20.66| 89.75| 63.48|
03/02/2015|00:05:00| 24.40| 20.72| 89.88| 63.47|
03/02/2015|00:10:00| 23.23| 19.98| 90.12| 63.48|
On lines 3 through 5, hour will be set to 00 and (since $1 is also a constant on these three lines) the subscript in array[] and count[] on these three lines will be 03/02/2015 00 . And for lines 3 through 5 we end up with:
I understand the concatenation but what I'm not getting is why the extra digits being added are off by one for each line. For example, for line 3, I would expect the extra digit in red to be 1 instead of 0 as count is 1. So why it is that count[$1 hour]++ is 0 when it's concatenated to the array for line 3?
It is because the ++ in count[subscript]++ is a post-increment operation. That expression returns the value it had before the value is incremented and then increments the variable to the new value. To have it return the new value (instead of the old value), you would use the pre-increment operation ++count[subscript] .
As mjf already said, why gather each hour's data and calculate an average of each input line if you're trying to calculate a daily average.
But, in addition to that, there aren't any | characters in your input. So, with -F'|' in your code, there is nothing in $2 nor in $3 in your input. And, there is no header in this data, so, unless you want to skip data from the midnight hour and the 1am hour data in your daily averages, you don't want the NR>2 . And, as has been said before, if you're going to use a single line scrunched up awk script, you MUST separate statements from each other with semicolons. And, since you data appears to be in month/day/year format instead of year/month/day, you need to modify your sort if the goal is to print the output in increasing date order when you run this code with data from December in one year and January in the next year...
Perhaps you wanted something more like:
awk '{array[$1]+=$3;count[$1]++}END{for(a in array){print a,array[a]/count[a]}}' filec|sort -k1.7,1.10 -k1.1,1.2 -k1.4,1.5
Don,
I interpreted the data Saravanan_0074 included in his/her last post to be the output, not the input, of running the awk command (in which case you would need -F'|' and NR>2 ). The avg of 26.2439 (should be 26.24) for hour 00 appears to match the input in the original post although there appears to not be enough input data to confirm the hours beyond 00 in the output.
You may be correct. If that is the case, the only thing wrong with the Saravanan_0074's script was the missing semicolon (like the problem you had in post #2 in this script). But, it clearly is not going to provide a daily average; only hourly averages.
But, the stated goal in post #10 is to get the daily average, and the script in that post does NOT do that. And, it isn't clear which average is desired. Is the desire to get the arithmetic mean of the hourly arithmetic means? (That is what the script I suggested would provided if the data shown in post #10 was fed into it as input.) Is the desire to get the arithmetic mean of the individual data for each day? Or, is some other average desired???
Is there ever input for more than one day in the input file?
Are there always the same number of sample points for each day (and the same number of sample points each day for each hour)? What happens on days when there is a shift to or from daylight savings time?
Is the sort in the pipeline intended to sort average day values into date order? Or, is it intended to sort average hourly values for a single date into hour order? If daily averages and hourly averages are both supposed to be in the output, what is the sort order supposed to be? What output format is wanted for the daily average values?
Sometimes I get tired of trying to guess what requirements we are trying to meet.