Get the average of lines with the same first 4 letters

kenshinhimura · August 14, 2018, 4:08pm

How to sum up and print into the next line the total SUM.

]$ cat hhhh
aaa1a 1
aaa1g 2
aaa1f 3
baa4f 3
baa4d 4
baa4s 4
cddg1 3
cddg3 4
cddfg 1

$ cat hhhh|awk ' {sum+=$2} END {print sum}'
25

Desire output:

aaa1a 1
aaa1g 2
aaa1f 3

Total Sum of aaa1: 6

=======================
baa4f 3
baa4d 4
baa4s 4
 
Total Sum of baa4: 11

=======================
cddg1 3
cddg3 4
cddfg 1
Total Sum of cddg: 8

Don_Cragun · August 14, 2018, 7:05pm

You might try something like:

awk '
function print_total() {
	printf("\nTotal Sum of %s: %d\n", last, total)
}
last != substr($1, 1, 4) {
	if(NR > 1) {
		print_total()
		printf("\n=======================\n")
	}
	last = substr($1, 1, 4)
	total = 0
}
{	print
	total += $2
}
END {	print_total()
}' hhhh

but, with the sample data you provided, I get the output:

aaa1a 1
aaa1g 2
aaa1f 3

Total Sum of aaa1: 6

=======================
baa4f 3
baa4d 4
baa4s 4

Total Sum of baa4: 11

=======================
cddg1 3
cddg3 4

Total Sum of cddg: 7

=======================
cddfg 1

Total Sum of cddf: 1

instead of what you said you wanted. The output above seems to more correctly match the title of this thread. If this isn't what you really wanted, please explain your requirements more clearly.

You should always tell us what operating system and shell you're using when you start a new thread. Otherwise, suggestions you receive might not work in your environment. In this case, if you're using a Solaris/SunOS operating system, change awk in the above suggestion to /usr/xpg4/bin/awk or nawk .

Note that using cat as you did in your sample code, eats up system resources and makes your code slower than letting awk read the file directly (as I did in my suggestion above).

kenshinhimura · August 15, 2018, 8:39am

Thank you Don, my question also.. Is there a way to 1 liner that? In one liner. ii understand it more easily and i can able to repeat it someday.
Thank you

------ Post updated at 07:39 AM ------

Also if you can one liner please do, if not.. can you please explain each line? put comment on them please?

Don_Cragun · August 15, 2018, 9:49am

Of course my suggested awk script can be turned into a single line of code. But, if you can't do that on your own, there is absolutely no possible way that the 1-liner version of that code will be easier to understand than the version I provided before. Code is always easier to understand if you can see the structure that shows, by indentation, how commands are connected. And, you can't include comments in the middle of an awk 1-liner, so I can't provide you with a commented 1-liner.

Hoping that comments will help you understand what the code is doing, I provide the following:

awk '	# Invoke awk and start the script specifying actions to be performed.
function print_total() {
	# Define function to print the total acccumulated for the previous set
	# of lines with the same four characters at the start of the 1st field.
	printf("\nTotal Sum of %s: %d\n", last, total)
}
last != substr($1, 1, 4) {
	# Perform the actions in this group when the 1st four characters in the
	# 1st field on this line is not the same as the 1st four characters in
	# the 1st field that we saw on the previous line.  Since there is no
	# previous line when we are reading the 1st line from our input file,
	# we also perform the actions in this group when processing the 1st
	# line in the input file.
	if(NR > 1) {
		# Perform these actions when we are not processing line #1.
		print_total()				# Print the total for
							# for the previous set.
		printf("\n=======================\n")	# Print a set separator.
	}
	# Save the 1st four characters from this line to compare against
	# subsequent input lines.
	last = substr($1, 1, 4)
	# Clear the total for this new set.
	total = 0
}
{	# Perform the actions in this group for every line in the input file.
	print		# Print the current input line.
	total += $2	# Add the contents of the 2nd field on this line to the
			# total for this set.
}
END {	# Perform the actions in this group after we have read the last line of
	# input from the input file.
	print_total()	# Print the total for the last set.
}' hhhh	# End the awk script and name the file to be processed.

kenshinhimura · August 15, 2018, 10:16am

Don,
How can we get the average? based on the sum?

------ Post updated at 09:16 AM ------

Will get the total and average..

aaa1a 1
aaa1g 2
aaa1f 3

Total Sum of aaa1: 6

Total Average of aaa1: 2 #because 6/3

Don_Cragun · August 15, 2018, 11:09am

Why don't you try adding processing for averages to the code I suggested.

If you can't get it to work, show us what you have tried and we'll help you fix it.

kenshinhimura · August 15, 2018, 1:00pm

awk '
function print_total() {
        printf("\nTotal Sum of %s site is: %d Mbps\n ", last, total)
}
function print_average() {
        printf("\nTotal Sum of %s site is: %d Mbps\n ", last, average)
}

last != substr($1, 1, 7) {
        if(NR > 1) {
                print_total()
                printf("\n=======================\n")
        }
        last = substr($1, 1, 7)
        total = 0
        average = 0
}
{       print
        total += $2
        average total/NR
}
END {   print_total()
        print_average()
}' hhhh > datarate.log
cat datarate.log

trying different combination but to no avail. Thank you

Don_Cragun · August 15, 2018, 4:21pm

kenshinhimura:

awk '
function print_total() {
   printf("\nTotal Sum of %s site is: %d Mbps\n ", last, total)
}
function print_average() {
   printf("\nTotal Sum of %s site is: %d Mbps\n ", last, average)
}

last != substr($1, 1, 7) {
   if(NR > 1) {
   print_total()
   printf("\n=======================\n")
   }
   last = substr($1, 1, 7)
   total = 0
   average = 0
}
{       print
   total += $2
   average total/NR
}
END {   print_total()
   print_average()
}' hhhh > datarate.log
cat datarate.log

trying different combination but to no avail. Thank you

Since you want to print an average every time you print a total for a set, you don't need two functions to print the total and the average -- just add another printf statement in the print_total function. If you add a new function, you need to call that function when you want the code that it defines to be run. The statement in your code:

        average total/NR

doesn't assign a value to any variable and just performs two calculations throwing away the results. the second calculation divides the current value of the total variable by the current input file line number ( NR ). It would seem that you need to be able to determine how many input lines you have in each set to get the average value of the value of the values in that set.

Why are you using substr($1, 1, 7) if you want to match the 1st 4 characters in the 1st field. (Have you changed your requirement to match the 1st 7 characters of the 1st field?)

Do you understand the difference between what you did and what you see in the following revision of my (commented) suggestion:

awk '	# Invoke awk and start the script specifying actions to be performed.
function print_total() {
	# Define function to print the total and average acccumulated for the
	# previous set of lines with the same four characters at the start of
	# the 1st field.
	printf("\n    Total Sum of %s: %d\n", last, total)
	printf("\nTotal Average of %s: %.2f\n", last, total / count)
}
last != substr($1, 1, 4) {
	# Perform the actions in this group when the 1st four characters in the
	# 1st field on this line is not the same as the 1st four characters in
	# the 1st field that we saw on the previous line.  Since there is no
	# previous line when we are reading the 1st line from our input file,
	# we also perform the actions in this group when processing the 1st
	# line in the input file.
	if(NR > 1) {
		# Perform these actions when we are not processing line #1.
		print_total()				# Print the total for
							# for the previous set.
		printf("\n=======================\n")	# Print a set separator.
	}
	# Save the 1st four characters from this line to compare against
	# subsequent input lines.
	last = substr($1, 1, 4)
	# Clear the count and total for this new set.
	count = total = 0
}
{	# Perform the actions in this group for every line in the input file.
	print		# Print the current input line.
	count++		# Increment the number of lines seen in this set.
	total += $2	# Add the contents of the 2nd field on this line to the
			# total for this set.
}
END {	# Perform the actions in this group after we have read the last line of
	# input from the input file.
	print_total()	# Print the total and average for the last set.
}' hhhh	# End the awk script and name the file to be processed.

Note that this just calculates the average when it is being printed instead of calculating averages every time a line is read and discarding that value except on the last line of a set.

kenshinhimura · August 15, 2018, 8:27pm

$ cat hhhh
aaabbb1a 4000
aaabbb1g 2000
aaabbb1f 8000
baabbb4f 5000
baabbb4d 1000
baabbb4s 9000
cddbbbg1 10000
cddbbbg3 11000
cddbbbge 3000

#

#sh ave.sh

aaabbb1a 4000
aaabbb1g 2000
aaabbb1f 8000

Total Sum of aaabbb1 site is: 8000 Mbps

Total Average of aaabbb1 site is: 2666 Mbps

=======================
baabbb4f 5000
baabbb4d 1000
baabbb4s 9000

Total Sum of baabbb4 site is: 9000 Mbps

Total Average of baabbb4 site is: 3000 Mbps

=======================
cddbbbg1 10000
cddbbbg3 11000
cddbbbge 3000

Total Sum of cddbbbg site is: 3000 Mbps

Total Average of cddbbbg site is: 1000 Mbps
$

[/CODE]

ill try different combination...not working right

------ Post updated at 07:27 PM ------

Never mind..thank you sir. I figured it out.. THank you for the comments..

kenshinhimura · August 17, 2018, 3:02pm

Don, my challenge now is putting date on total and averate. Putting a timestamp.. Trying every combination and googling..

------ Post updated at 11:34 AM ------

I modified it to

        printf("\n%s,%s,%d,total",  last, last, total)
        printf("\n%s,%s,%.1f,average\n",  last, last, total / count)

For parsing, but im hoping to have value like this

"date"baabbb4,baabbb4,101.95,total
"date"baabbb4,baabbb4,101.95,average

------ Post updated at 11:45 AM ------

I tried to add the line below

awk '
function print_total() {
        printf("\n%s,%s,%d,total", last, last, total)
        printf("\n%s,%s,%.1f,average\n", last, last, total / count)
        printf("%s", cmd)
}

#{ cmd = system("date") }
{ cmd = system("date  ") }

But when i put it in printf,,its throwing a lot of error..

------ Post updated at 12:05 PM ------

awk '
function print_total() {
        printf("\n%s,%s,%d,total", last, last, total)
        printf("\n%s,%s,%.1f,average\n", last, last, total / count)
 #       printf("%s", cmd)
}

#{ cmd = system("date") }

last != substr($1, 1, 7) {
        if(NR > 1) {
                print_total()
                printf("\n\n")
        }
        last = substr($1, 1, 7)
        count = total = 0
}
{
        print
        count++
        total += $2
}
{  cmd = system("date") }

END {   print_total()
}'

output:

remotese01 120.653 Mbps
Fri Aug 17 17:02:26 UTC 2018
remotese02 91.7216 Mbps
Fri Aug 17 17:02:26 UTC 2018

remotese,remotese,212,total
remotese,remotese,106.2,average

thats what i got..
im hoping of the output below

Fri Aug 17 17:02:26 UTC 2018, remotese,remotese,210,total
Fri Aug 17 17:02:26 UTC 2018, remotese,remotese,105.1,average

------ Post updated at 01:28 PM ------

I found a way.. but ugly..
revise code:

awk '
function print_total() {
        printf("\n,%s,%s,%d,total", last, last, total)
        printf("\n,%s,%s,%.1f,average\n", last, last, total / count)
 #       printf("%s", cmd)
}

#{ cmd = system("date") }

last != substr($1, 1, 7) {
        if(NR > 1) {
                print_total()
                printf("\n\n")
        }
        last = substr($1, 1, 7)
        count = total = 0
}
{
        print
        count++
        total += $2
}

END {   print_total()

}' rawdata.log |sed -e "s/.*average.*/`date`&/" -e "s/.*total.*/`date`&/" > /datarate/datarate.log

cat /datarate/datarate.log

output:


remotese01 130.326 Mbps
remotese02 99.1562 Mbps

Fri Aug 17 18:25:30 UTC 2018,remotese,remotese,229,total
Fri Aug 17 18:25:30 UTC 2018,remotese,remotese,114.7,average


server1101 219.804 Mbps
server1102 348.427 Mbps

Fri Aug 17 18:25:30 UTC 2018,server11,server11,568,total
Fri Aug 17 18:25:30 UTC 2018,server11,server11,284.1,average

------ Post updated at 01:50 PM ------

The reason that is ugly, because, i have 100 entries in my for loop..they have the same time stamp, unlike if i will embed the date within the awk,,each line will have diffrent time by seconds or minutes..

------ Post updated at 02:02 PM ------

New challenge.. New desired Output.. FInal..need help on this..

Fri Aug 17 18:25:30 UTC 2018,remotese01 130.326 Mbps
Fri Aug 17 18:25:30 UTC 2018,remotese02 99.1562 Mbps
Fri Aug 17 18:25:30 UTC 2018,remotese01,remotese,229,total
Fri Aug 17 18:25:30 UTC 2018,remotese02,remotese,114.7,average


Fri Aug 17 18:25:30 UTC 2018,server1101 219.804 Mbps
Fri Aug 17 18:25:30 UTC 2018,server1102 348.427 Mbps
Fri Aug 17 18:25:30 UTC 2018,server1101,server11,568,total
Fri Aug 17 18:25:30 UTC 2018,server1102,server11,284.1,average

Don_Cragun · August 17, 2018, 4:37pm

You have a few problems. The date utility always terminates its output with a <newline>. The awk system() function sends its output to awk 's standard output and returns the exit status of the command that you invoke. (It does not return the output produced by the command you invoke as a string that you can assign to a variable in awk as you tried to do in your code.)

To fix that, you can use sprintf() (in this case a couple of times) to create a date command string to give to system() that will invoke the date utility with a format string that will print the two lines of output you want each with the leading date and time field. Note that the following invokes date each time you call print_total() as in your earlier awk code:

function print_total() {
	dc = sprintf("date \"+%%a %%b %%d %%H:%%M%%S %%Z %%Y,%s-%s,%d,total%%n",
		last, last, total)
	dc = sprintf("%s%%a %%b %%d %%H:%%M%%S %%Z %%Y,%s-%s,%.1f,average\"",
		dc, last, last, total / count)
	system(dc)
}

If you just want to invoke date once at the start of your script and print that timestamp in all of the output produced by a single run of your awk script (which is closer to what you're doing now with awk and sed , but you're invoking date twice and may end up with one date and time on the total lines and a later date and time on the average lines), it would be much simpler to use:

awk -v dt="`date`" '
function print_total() {
	printf("%s,%s-%s,%d,total\n", dt, last, last, total)
	printf("%s,%s-%s,%.1f,average\n", dt, last, last, total / count)
}

PS I didn't see your last two updates before posting this. But, it should give you enough information to make changes to get what you want.

kenshinhimura · August 20, 2018, 9:51am

I did different code on the first block.. Nasty Ugly. Etc. But it worked. Hoping to clean that one. More cleane. I got the output right... But with an ugly code from me. Thank you for your help. But if by chance.. Can you make my code cleaner? Especially the for loop.

for i in `cat DATASERVER`;do  echo -n $i",";ssh $i "echo -n $i,|cut -c1-7 -n; f=\`/sbin/ip a|egrep '127.127|128.128' |head -1| awk '{print \$NF }' \` ; sar -n DEV|grep \$f|grep Average|awk '{print \$5 * 8 / 1000 \" Mbps\" }'"  ; done  |sed 'N;s/\n/, /' > rawdata.log

awk '
function print_total() {
        printf("\n%s,%s,%d,total", full, last, total)
        printf("\n%s,%s,%.1f,average\n", full, last, total / count)
 #       printf("%s", cmd)
}

#{ cmd = system("date") }

last != substr($1, 1, 7) {
        if(NR > 1) {
                print_total()
                printf("\n\n")
        }
        last = substr($1, 1, 7)
        full = substr($1, 1, 9) #added
        count = total = 0
}
{
        print
        count++
        total += $2
        full # added
}

END {   print_total()

}' rawdata.log |sed -e "s/.*nc.*/`date`,&/" > /datarate/datarate.log

cat /datarate/datarate.log

this was the output:

Mon Aug 20 13:22:53 UTC 2018,datasdr01,datasdr, 91.2404 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr02,datasdr, 64.7573 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr03,datasdr, 67.3033 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr04,datasdr, 64.3564 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr05,datasdr, 64.533 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr06,datasdr, 123.74 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr07,datasdr, 0.43832 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr08,datasdr, 34.9874 Mbps
Mon Aug 20 13:22:53 UTC 2018,datasdr09,datasdr, 34.2882 Mbps

Mon Aug 20 13:22:53 UTC 2018,datasdr01,datasdr,545,total
Mon Aug 20 13:22:53 UTC 2018,datasdr01,datasdr,60.6,average


Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr, 292.405 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr02,comondr, 265.401 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr03,comondr, 290.693 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr04,comondr, 277.023 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr05,comondr, 263.343 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr06,comondr, 384.669 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr08,comondr, 378.82 Mbps

Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr,2152,total
Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr,307.5,average


Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr, 54.6014 Mbps
Mon Aug 20 13:22:53 UTC 2018,servedr02,servedr, 48.7523 Mbps
Mon Aug 20 13:22:53 UTC 2018,servedr03,servedr, 48.4779 Mbps

Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr,151,total
Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr,50.6,average

------ Post updated at 08:51 AM ------

Anyway, your code, doesnt work on mine..

1st code, it is throwing double date..

2nd code from yours

awk: cmd. line:6: }
awk: cmd. line:6: ^ syntax error

Don_Cragun · August 20, 2018, 10:26am

Your desired output changes every time you add a new post to this thread.

I suggested code in post #11 that gets rid of the need to need for the for loop you showed us in post #12 and produces the results that loop would produce inside the print_total() function.

The code that I suggested there worked perfectly with the earlier sample data you provided to produce the output you said you wanted in an earlier post in this thread. If you modified that code to produce a different output format and introduced a syntax error in the code as you did it, I can't help you without seeing how you modified the code. Since your requirements are constantly changing and since you haven't provided us with any sample input that could possibly produce the most recent output you're showing us, I'm not motivated to try to rewrite my suggestion again hoping to guess at what your next output format requirements might be.

kenshinhimura · August 20, 2018, 2:51pm

Output is final. Made it work. but with ugly code.

Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr, 292.405 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr02,comondr, 265.401 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr03,comondr, 290.693 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr04,comondr, 277.023 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr05,comondr, 263.343 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr06,comondr, 384.669 Mbps
Mon Aug 20 13:22:53 UTC 2018,comondr08,comondr, 378.82 Mbps

Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr,2152,total
Mon Aug 20 13:22:53 UTC 2018,comondr01,comondr,307.5,average


Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr, 54.6014 Mbps
Mon Aug 20 13:22:53 UTC 2018,servedr02,servedr, 48.7523 Mbps
Mon Aug 20 13:22:53 UTC 2018,servedr03,servedr, 48.4779 Mbps

Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr,151,total
Mon Aug 20 13:22:53 UTC 2018,servedr01,servedr,50.6,average

Because i will run the command below to all 100 servers. Then the ethernet card will vary.

# sar -n DEV|grep Average
Average:         eth0     10.15      5.18      5.87      6.21      0.00      0.00      0.00