Want to get average value for each hour

I want to get CPU average value only (not required user CPU & memory) with each hours on individual date. The sample output is below

                |           |        | User  |Memory| User |
   Date       |  Time  |CPU %|CPU % |  %   |Mem % |
03/02/2015|00:00:00| 24.56| 20.66| 89.75| 63.48|
03/02/2015|00:05:00| 24.40| 20.72| 89.88| 63.47|
03/02/2015|00:10:00| 23.23| 19.98| 90.12| 63.48|
03/02/2015|00:15:00| 26.88| 23.25| 90.30| 63.56|
03/02/2015|00:20:00| 27.57| 23.39| 89.94| 63.49|
03/02/2015|00:25:00| 24.87| 20.46| 89.55| 63.41|
03/02/2015|00:30:00| 33.23| 29.12| 89.48| 63.35|
03/02/2015|00:35:00| 30.05| 26.04| 89.49| 63.36|
03/02/2015|00:40:00| 35.62| 31.81| 89.38| 63.25|
03/02/2015|00:45:00| 20.50| 17.03| 89.83| 63.71|
03/02/2015|00:50:00| 20.49| 17.81| 89.61| 63.50|
03/02/2015|00:55:00| 23.48| 19.37| 89.59| 63.56|
03/02/2015|01:00:00| 56.81| 49.93| 89.97| 63.88|
03/02/2015|01:05:00| 44.84| 39.46| 90.05| 63.98|
03/02/2015|01:10:00| 22.45| 19.12| 89.64| 63.57|
03/02/2015|01:15:00| 17.34| 14.53| 89.66| 63.59|
03/02/2015|01:20:00| 17.29| 14.75| 89.65| 63.58|
03/02/2015|01:25:00| 14.08| 11.54| 89.49| 63.44|
03/02/2015|01:30:00| 19.08| 16.77| 89.33| 63.28|
03/02/2015|01:35:00| 18.99| 16.80| 89.31| 63.26|

Total count of the page is = 8919

Each days contains 288 entries and totally 31 days output contains.

Could you please help me out to simplify in shell programming.

Try this awk solution if I understand what you are looking for (untested as more sample data is needed).

awk -F'|' 'NR>2 {hour=" " substr($2,1,2); array[$1 hour] += $3 count[$1 hour]++ } END { for (a in array) {print a, array[a]/count[a] } } ' file | sort
03/02/2015 00 26.2439
03/02/2015 01 26.3635
03/03/2015 01 18.99

I have only tested this with the given data, but I'm surprised that you're showing sample output that includes the data marked in red above when there was no sample input for that date. When I tried running the above code with the given sample input, I didn't get the last line of the output shown above.

It also looks like there is a missing semicolon. Did you perhaps intend the following?:

awk -F'|' 'NR>2 {hour=" " substr($2,1,2); array[$1 hour] += $3; count[$1 hour]++ } END { for (a in array) {print a, array[a]/count[a] } } ' file | sort

which produces the output:

03/02/2015 00 26.24
03/02/2015 01 26.36

from the given sample input. And, this seems to match the given sample input.

2 Likes

Don,
Yes, I added another test record with 3/3/2015 date to test the break logic and forgot to remove it from posted output (as OP did not include it in his file). I missed the semi-colon although I'm not clear at the moment why it impacted the output (the extra decimal places). Thanks for pointing this out.

1 Like

Thanks for providing such a wonderful script and also help me to get the following items.

Daily Average
Hourly Average.

Output sample updated in the code.

Daily Average

03/11/15	35.52%	21.62%
03/12/15	40.20%	22.23%
03/13/15	34.92%	20.53%
03/14/15	51.13%	22.42%
03/15/15	35.94%	14.47%
03/16/15	33.22%	18.70%
03/17/15	35.30%	19.89%

Hourly Average for the month

12AM	40.53%	29.66%
1AM	36.41%	23.77%
2AM	34.87%	23.10%
3AM	40.30%	23.49%
4AM	39.05%	19.84%
5AM	41.52%	18.91%
6AM	42.05%	19.29%

Original Output File 
   Date   |  Time  |CPU % |CPU % |  %   |Mem % |
03/02/2015|00:00:00| 24.56| 20.66| 89.75| 63.48|
03/02/2015|00:05:00| 24.40| 20.72| 89.88| 63.47|
03/02/2015|00:10:00| 23.23| 19.98| 90.12| 63.48|
03/02/2015|00:15:00| 26.88| 23.25| 90.30| 63.56|
03/02/2015|00:20:00| 27.57| 23.39| 89.94| 63.49|
03/02/2015|00:25:00| 24.87| 20.46| 89.55| 63.41|
03/02/2015|00:30:00| 33.23| 29.12| 89.48| 63.35|
03/02/2015|00:35:00| 30.05| 26.04| 89.49| 63.36|
03/02/2015|00:40:00| 35.62| 31.81| 89.38| 63.25|
03/02/2015|00:45:00| 20.50| 17.03| 89.83| 63.71|
03/02/2015|00:50:00| 20.49| 17.81| 89.61| 63.50|
03/02/2015|00:55:00| 23.48| 19.37| 89.59| 63.56|
03/02/2015|01:00:00| 56.81| 49.93| 89.97| 63.88|
03/02/2015|01:05:00| 44.84| 39.46| 90.05| 63.98|
03/02/2015|01:10:00| 22.45| 19.12| 89.64| 63.57|
03/02/2015|01:15:00| 17.34| 14.53| 89.66| 63.59|
03/02/2015|01:20:00| 17.29| 14.75| 89.65| 63.58|
03/02/2015|01:25:00| 14.08| 11.54| 89.49| 63.44|
03/02/2015|01:30:00| 19.08| 16.77| 89.33| 63.28|
03/02/2015|01:35:00| 18.99| 16.80| 89.31| 63.26|
03/02/2015|01:40:00| 15.59| 13.32| 89.28| 63.23|
03/02/2015|01:45:00| 20.91| 18.27| 89.28| 63.23|
03/02/2015|01:50:00| 27.69| 24.68| 89.32| 63.27|
03/02/2015|01:55:00| 29.00| 25.71| 89.32| 63.27|
03/02/2015|02:00:00| 30.21| 25.44| 89.17| 63.21|
03/02/2015|02:05:00| 27.62| 22.90| 89.16| 63.22|
03/02/2015|02:10:00| 23.68| 19.24| 88.84| 63.19|
03/02/2015|02:15:00| 24.77| 20.00| 88.86| 63.21|
03/02/2015|02:20:00| 29.70| 25.90| 88.92| 63.27|
03/02/2015|02:25:00| 28.41| 24.24| 88.97| 63.32|
03/02/2015|02:30:00| 25.70| 22.29| 88.97| 63.29|
03/02/2015|02:35:00| 17.59| 14.77| 88.92| 63.25|
03/02/2015|02:40:00| 21.38| 18.68| 88.91| 63.23|
03/02/2015|02:45:00| 18.21| 15.20| 88.98| 63.33|
03/02/2015|02:50:00| 26.29| 23.21| 88.91| 63.26|
03/02/2015|02:55:00| 23.37| 20.37| 88.91| 63.26|
03/02/2015|03:00:00| 30.18| 25.81| 88.72| 63.05|
03/02/2015|03:05:00| 27.22| 23.42| 88.78| 63.15|
03/02/2015|03:10:00| 30.25| 26.39| 88.93| 63.30|
03/02/2015|03:15:00| 43.21| 38.60| 88.99| 63.38|
03/02/2015|03:20:00| 28.94| 25.25| 88.94| 63.33|
03/02/2015|03:25:00| 24.68| 21.22| 88.96| 63.23|
03/02/2015|03:30:00| 30.62| 25.71| 88.73| 63.13|
03/02/2015|03:35:00| 29.18| 26.12| 88.58| 62.99|
03/02/2015|03:40:00| 23.66| 21.63| 88.49| 62.90|
03/02/2015|03:45:00| 24.87| 22.66| 88.49| 62.91|
03/02/2015|03:50:00| 14.66| 12.61| 88.55| 62.97|
03/02/2015|03:55:00| 12.54| 10.53| 88.51| 62.96|
03/02/2015|04:00:00| 16.39| 13.42| 88.66| 63.09|
03/02/2015|04:05:00| 10.22|  8.06| 88.74| 63.17|

Unlike many languages where two adjacent strings representing numbers separated by a space would be a syntax error, awk is happy to concatenate them and treat them as a single numeric string.

So if we look at the first few lines of the input file:

                |           |        | User  |Memory| User |
   Date       |  Time  |CPU %|CPU % |  %   |Mem % |
03/02/2015|00:00:00| 24.56| 20.66| 89.75| 63.48|
03/02/2015|00:05:00| 24.40| 20.72| 89.88| 63.47|
03/02/2015|00:10:00| 23.23| 19.98| 90.12| 63.48|

and the first part of your script:

awk -F'|' 'NR>2 {hour=" " substr($2,1,2); array[$1 hour] += $3 count[$1 hour]++ }

On lines 3 through 5, hour will be set to 00 and (since $1 is also a constant on these three lines) the subscript in array[] and count[] on these three lines will be 03/02/2015 00 . And for lines 3 through 5 we end up with:

line	array["03/02/2015 00"]		count["03/02/2015 00"]
  3	24.560					1
  4	24.560 + 24.401 (48.961)		2
  5	48.961 + 23.232 (72.193)		3

So, the higher than expected results come from the extra digits being added to the values that are being summed.

1 Like

I understand the concatenation but what I'm not getting is why the extra digits being added are off by one for each line. For example, for line 3, I would expect the extra digit in red to be 1 instead of 0 as count is 1. So why it is that count[$1 hour]++ is 0 when it's concatenated to the array for line 3?

 
line    array["03/02/2015 00"]        count["03/02/2015 00"]  
  3     24.560                                 1    
  4     24.560 + 24.401 (48.961)               2 
  5     48.961 + 23.232 (72.193)               3
1 Like

It is because the ++ in count[subscript]++ is a post-increment operation. That expression returns the value it had before the value is incremented and then increments the variable to the new value. To have it return the new value (instead of the old value), you would use the pre-increment operation ++count[subscript] .

1 Like

Got it! I need to dust off my old college C books.

I have received hourly average for using below script, but i need daily average value.

awk -F'|' 'NR>2 {hour=" " substr($2,1,2); array[$1 hour] += $3 count[$1 hour]++ } END { for (a in array) {print a, array[a]/count[a] } } ' filec | sort 
03/02/2015 00 26.2439
03/02/2015 01 25.3431
03/02/2015 02 24.7481
03/02/2015 03 26.6714
03/02/2015 04 10.4339
03/02/2015 05 17.3723
03/02/2015 06 17.5681
03/02/2015 07 23.8506
03/02/2015 08 14.3764
03/02/2015 09 15.7681
03/02/2015 10 18.7348
03/02/2015 11 20.1648
03/02/2015 12 22.8531
03/02/2015 13 22.2181
03/02/2015 14 16.4239
03/02/2015 15 25.3914
03/02/2015 16 20.2731
03/02/2015 17 12.3039
03/02/2015 18 17.9764
03/02/2015 19 22.6573
03/02/2015 20 15.7089
03/02/2015 21 32.6081
03/02/2015 22 17.2506
03/02/2015 23 21.6848

How about removing the hour?

 awk -F'|' 'NR>2 {array[$1] += $3 count[$1]++ } END { for (a in array) {print a, array[a]/count[a] } } '
1 Like

As mjf already said, why gather each hour's data and calculate an average of each input line if you're trying to calculate a daily average.

But, in addition to that, there aren't any | characters in your input. So, with -F'|' in your code, there is nothing in $2 nor in $3 in your input. And, there is no header in this data, so, unless you want to skip data from the midnight hour and the 1am hour data in your daily averages, you don't want the NR>2 . And, as has been said before, if you're going to use a single line scrunched up awk script, you MUST separate statements from each other with semicolons. And, since you data appears to be in month/day/year format instead of year/month/day, you need to modify your sort if the goal is to print the output in increasing date order when you run this code with data from December in one year and January in the next year...

Perhaps you wanted something more like:

awk '{array[$1]+=$3;count[$1]++}END{for(a in array){print a,array[a]/count[a]}}' filec|sort -k1.7,1.10 -k1.1,1.2 -k1.4,1.5 

Don,
I interpreted the data Saravanan_0074 included in his/her last post to be the output, not the input, of running the awk command (in which case you would need -F'|' and NR>2 ). The avg of 26.2439 (should be 26.24) for hour 00 appears to match the input in the original post although there appears to not be enough input data to confirm the hours beyond 00 in the output.

You are correct re: the sort of the date.

You may be correct. If that is the case, the only thing wrong with the Saravanan_0074's script was the missing semicolon (like the problem you had in post #2 in this script). But, it clearly is not going to provide a daily average; only hourly averages.

But, the stated goal in post #10 is to get the daily average, and the script in that post does NOT do that. And, it isn't clear which average is desired. Is the desire to get the arithmetic mean of the hourly arithmetic means? (That is what the script I suggested would provided if the data shown in post #10 was fed into it as input.) Is the desire to get the arithmetic mean of the individual data for each day? Or, is some other average desired???

Is there ever input for more than one day in the input file?
Are there always the same number of sample points for each day (and the same number of sample points each day for each hour)? What happens on days when there is a shift to or from daylight savings time?

Is the sort in the pipeline intended to sort average day values into date order? Or, is it intended to sort average hourly values for a single date into hour order? If daily averages and hourly averages are both supposed to be in the output, what is the sort order supposed to be? What output format is wanted for the daily average values?

Sometimes I get tired of trying to guess what requirements we are trying to meet.