Filter on one column and then perform conditional calculations on another column with a Linux script

Hi,
I have a file (stats.txt) with columns like in the example below. Destination IP address, timestamp, TCP packet sequence number and packet length.

destIP   time  seqNo  packetLength
1.2.3.4  0.01   123       500
1.2.3.5  0.03    44       1500
1.3.2.5  0.08    44       1500
1.2.3.4  0.44   123       500
1.2.3.4  0.48   123       500
1.2.3.4  0.52   124       800
1.2.3.4  0.72   124       800
1.2.3.5  0.83    45       80
...

I'm trying to come up with a way to derive some statistics from this file. Ideally, my Linux script would take the input from stats.txt (which could consist of 10 000's of rows) and tell per destination address (example for address 1.2.3.4 above used to illustrate):

  • For destination IP 1.2.3.4, there has been two retransmissions for sequence number 123 and one retransmission for sequence number 124. This means three packet errors in total.
  • The time between the first and last packet with the same sequence number is 0:48-0:01=0:47 seconds and 0:72-0:52=0.2 seconds respectively.
  • Number of successful packets to 1.2.3.4 is two (sequence number 123 and 124, assuming that 124 is ok since it's not retransmitted).
  • The total number of successfully transmitted Bytes to 1.2.3.4 is 500+800=1300B.

And of course the same kind of stats for any other IP address.

My current approach is to first sort the file like this:

sort -u -k1,1 -k3,3 -k2,2 stats.txt > statsSorted.txt

Then I get this:

1.2.3.4  0.01   123       500
1.2.3.4  0.44   123       500
1.2.3.4  0.48   123       500
1.2.3.4  0.52   124       800
1.2.3.4  0.72   124       800
1.2.3.5  0.03    44       1500
1.3.2.5  0.08    44       1500
1.2.3.5  0.83    45       80
...

Then to use awk to extract the stats. Have used the approach below to get started but I get syntax errors on pretty much everything. It probably looks quite bad with the nested loops as well. Wonder if someone could give some advice on how to improve the syntax or hints on how to make it work?

awk '
{	# Do-while criteria: as long as the IP address is the same
	do
		address[$1] = $1
		# Loop as long as sequence number is the same
		do
		
			# Is this the first time we see this sequence number?
			if (!($3 in c))
				# Set temporary min and max time and set retransmission counter to zero.
				tempMin=tempMax=$2
				retransmissions=0
			# If not the first time this sequence number occurs, increment retransmission and add time
			else
3			tempMax=$2
				retransmissions6+
		while ($3 in c)	
		averageTime[$1]=tempMax-tempMin
		retransmissions[$1]=retransmissions
	
	while ($1 in c)
END {	
	for(i in c)
		printf("%-17s %3d %5.1f \n", address, averageTime, retransmissions)
}' statsSorted.txt

Any hits welcome, even on how to form the basic syntax. Then I can try to pull it together myself.

Thanks!
/Z

A few comments on your code:

  1. There is no do ... while loop in awk .
  2. I have no idea what you are trying to accomplish with the statement 3 tempMax=$2 .
  3. You can't have an array and a scalar variable with the same name: retransmissions[$1]=retransmissions .
  4. If you have multiple statements to be processed in a loop, in an if , or in an else , you need to use braces ( { and } ) to group those statements.
  5. The expression ($3 in c) is meaningless when you haven't created any elements in an array named c[] .
  6. You don't calculate an average of n items by subtracting the lowest value from the highest value.
  7. Using sort -u deletes duplicate entries. Deleting duplicate entries makes it impossible to calculate an average of all values for any given IP address, or for an IP address and sequence # pair.
  8. A for(i in c) loop produces output in random (not necessarily sorted) order.

You didn't show what output you hope to produce from your sample input.

You talked about reporting the number of bytes transmitted, but there is nothing in your code that seems to try to capture or print that data. (And, the following script doesn't either.)

You seem to be trying to print the average time as a decimal number and the number of retransmissions as a floating point value printed with one decimal place. (Neither of these make any sense to me.)

So, making lots of wild guesses (ignoring the output your script seemed to be trying to produce), the following might help as a starting point for a script that will do what you want:

#!/bin/ksh
sort -k1,1 -k3,3n stats.txt | awk '
BEGIN {	printf("%17s %7s %s %s\n",
		"destIP", "seqNO", "AverageTime", "retransmissions")
	printf("----------------- ------- ----------- ---------------\n")
}
$1 != lIP || $3 != lSeqNo {
	if(NR != 1)
		printf("%17s %7d %11.3f %15d\n",
			lIP, lSeqNo, tTime / cnt, cnt - 1)
	if($1 == "destIP")
		exit
	lIP = $1
	tTime = $2
	lSeqNo = $3
	cnt = 1
	if(debug) printf("input: %s\nlIP=%s, lSeqNo=%d, tTime=%f, cnt=%d\n",
		$0, lIP, lSeqNo, tTime, cnt)
	next
}
{	tTime += $2
	cnt++
	if(debug)printf("input: %s\nlIP=%s, lSeqNo=%d, tTime=%f, cnt=%d\n",
		$0, lIP, lSeqNo, tTime, cnt)
}'

which, with the sample input you provided, produces the output:

           destIP   seqNO AverageTime retransmissions
----------------- ------- ----------- ---------------
          1.2.3.4     123       0.310               2
          1.2.3.4     124       0.620               1
          1.2.3.5      44       0.030               0
          1.2.3.5      45       0.830               0
          1.3.2.5      44       0.080               0
1 Like

Actually there is:

awk 'BEGIN{ do print "Hello" ++i; while(i<10) }' 

Although it cannot be used like the OP uses it..

2 Likes

Thanks guys. Appreciate a lot. You've got some good questions Don and I realize I was fuzzy with the output and the description. So for the record I will describe a bit better here. The desired output would look like this:

destIP    avgRetransTime   maxRetransTime  noRetrans  noSuccPack  transBytes
-------- ----------------  --------------  ---------  ----------  ----------
1.2.3.4         0.335          0:47            3          2          1300
1.2.3.5         0.05           0:05            1          2          1580

So only one resulting line per destination IP with the following info:

  1. IP address.

  2. avgRetransTime: derived by finding the time difference between the first and last packet with same IP and sequence number and then divide that with number of seq numbers that are subject to retransmission. Example: For 1.2.3.4, there have been two seq numbers with retransmissions. For 123, the time between the last and first packet is 0:47 seconds (0:48-0:01). For 124 it's 0.2 seconds (0:72-0:52). So the average time is (0:47+0:2)/2=0.335.

  3. maxRetransTime: The sequence number that took longest time to retransmit. For 1.2.3.4 it's 123 which took 0:47 seconds.

  4. noRetrans: All retransmissions counted. For 1.2.3.4, packet 123 has been sent 3 times (2 retransmissions) and packet 124 has been sent 2 times (1 retransmission). So a total of 3.

  5. noSuccPack: The number of packets (per IP) that are considered delivered. For 1.2.3.4, both 123 and 124 are considered delivered unless the number of retransmissions for a single sequence number exceeds 5. Then the packet is considered "not delivered".

  6. transBytes: Each time a packet delivery is successful (counted as the last time the sequence number is seen if the sequence number is not repeated more than 5 times), this parameter is incremented with the number of Bytes.

Your code is very straight forward and useful. I think I will be able to adjust it to get what I need. Almost :-). What's missing is this:

The "loop" is repeated as long as the IP address and the seq No are the same. Given my desired output I want to sum up a few things, make some divisions and so on. Feels like I need a loop that knows if it's the last lap inside of the brackets so to say. "If this is the last time I see this combination of IP and seq No I should sum up things and divide etc.". That's why I tried to go for the do-while loop. Could you recommend how to approach this one?

Thanks!
/Z

Hi,
Have created some code now that I think would do the trick if I didn't get syntax errors. Really appreciate any help.

Cheers!
/Z

#!/bin/ksh
sort -k1,1 -k3,3n -k2,2 stats.txt | awk 
BEGIN {	printf("%17s %7d %d %d %d %d\n",
		"destIP", "avgRetransTime", "maxRetransTime", "noRetrans", "noSuccPack", "transBytes")
	printf("---------- --------------- ---------------- -------------- -------------- ---------------\n")

}

# If the IP address found is not in the list
$1 != lIP {
	maxOverallTime = 0
	tempIp = $1
	noSuccPackPerIp=0
	transBytesPerIp=0
	
	while (tempIp == $1){
			transBytesPerIp=0
					
		$3 != lSeqNo{
			minTime = maxTime = $2
			cnt = 0
			
			# check this
			transBytesForSeqNo = $4
		
			while($3 == lSeqNo) {
				maxTime = $2
				cnt++
				next
			}
			
			if ((maxTime-minTime)>maxOverallTime){
				maxOverallTime=(maxTime-minTime)
			}
			
			if (count<10){
				noSuccPackPerIp++
				transBytesForSeqNo=0
			}
			transBytesPerIp += transBytesForSeqNo
			lSeqNo = $3
		}
	}
	printf("%17s %7d %11.3d %d %d %15d\n", lIP, (maxTime-minTime)/cnt, maxOverallTime, cnt, noSuccPackPerIp, transBytesForSeqNo)
}'

This will become lengthy. Did you get any error message that would point you in some direction?

OK, let's start:

  • The first single quote after awk is missing; it should introduce the 'program text'
  • In the BEGIN section, you print 6 strings using 1 string but 5 integer format specifiers.
  • The 17s don't match the underlining dashes.
  • You don't modify/assign lIP , so $1 will rarely match.
  • $3 != lSeqNo : you can't use pattern syntax within an action block. Use if (...)
  • not sure if it is wise to leave a while loop with a next statement, but there's no other way, either. (BTW, it will never be entered as the entire block will be run only if $3 != lSeqNo )

... and then I'm lost, even though it looks like the counts of opening and closing brackets match.

1 Like

Hi RudiC,
Thanks a lot for your comments. I also find the overall syntax (missing starting '{' etc) a bit strange, but if you look at Don's example code further up it's also missing the initial '{' and worked excellent anyway.

The errors I get are of the same type:
awk: line X: syntax error at or near {

Have counted all brackets and it should be ok. Weird.. Maybe I try C instead.

Trust the error messages. In an editor, go to the respective line and analyse the code. Still, the error may origine in another line, but it's a good starting point.

Why dont you post the error messages?

1 Like

Ok, so here is the input file I'm using now (sorting it beforehand):

1.2.3.4 0.01 123 500
1.2.3.4 0.44 123 500
1.2.3.4 0.48 123 500
1.2.3.4 0.52 124 800
1.2.3.5 0.03 44  1500
1.2.3.5 0.08 44  1500
1.2.3.5 0.83 45  80

And the code:

#!/bin/bash

awk '{ 

$1 != tempIp {
	maxOverallTime = 0
	tempIp = $1
	noSuccPackPerIp=0
	transBytesPerIp=0
	
	while (tempIp == $1){
			transBytesPerIp=0
					
		if($3 != lSeqNo)
		{
			minTime = maxTime = $2
			cnt = 0
			
		
			transBytesForSeqNo = $4

			
			while($3 == lSeqNo) {
				maxTime = $2
				cnt++
				next
			}
			
			if ((maxTime-minTime)>maxOverallTime){
				maxOverallTime=(maxTime-minTime)
			}
			
			if (count<10){
				noSuccPackPerIp++
				transBytesForSeqNo=0
			}
			transBytesPerIp += transBytesForSeqNo
			lSeqNo = $3
			
	
		}
	}
	#printf("%17s %7d %11.3f %f %f %15d\n", tempIp, (maxTime-minTime)/cnt, maxOverallTime, cnt, noSuccPackPerIp, transBytesForSeqNo)
}}' statsSortedX.txt

Have commented out all printouts to just see if I can get the core code to work. These are the error messages:

awk: line 3: syntax error at or near {
awk: line 42: syntax error at or near }

Thanks for looking at this.

/Z

If you remove the outer pair of braces (shown in red), you'll have a syntactically correct awk script that will run. But, it also has an infinite loop while processing the 1st line in your input file (the while loop also shown in red).

I'm trying to get through your requirements in post #4, and am working on a script to meet those requirements, but I have some other things on my plate right now (so it may be a while before I can post something that works).

It would help if you can post a little more data (showing the results you're trying to get when you have an IP address with unsuccessful retransmissions).

And, please explain what the units are on the timestamps in the 2nd field in your input file. I was assuming that an entry like 0.87 was 87 one hundredths of a second, but you then put a colon in the output and talk about it being minutes and seconds. (But, if that was the case shouldn't the input have been shown as 1:27 instead of as 0.87 ???)

1 Like

Hi Don,
Sounds fantastic, thanks! I removed the brackets and as you said it now runs and it's stuck in an infintie loop as you say. I added an additional counter to stop the loop and I get some sort of printout even though it looks quite messy. Will look at that tomorrow.

Here is a bigger input file (sorted) as example. Note that I have sorted after IP, then sequence number and then time (one extra sorting compared to the example code I got from you). The time stamps are in hundreds of a second as you assumed, sorry for messing up with the colon.

1.2.3.4 0.01 123 500
1.2.3.4 0.44 123 500
1.2.3.4 0.48 123 500
1.2.3.4 0.52 124 800
1.2.3.4 1.00 125 200
1.2.3.4 1.02 125 200
1.2.3.4 1.08 125 200
1.2.3.4 1.11 125 200
1.2.3.4 1.22 125 200
1.2.3.4 1.40 125 200
1.2.3.4 1.55 126 550
1.2.3.4 1.60 127 400
1.2.3.4 1.70 127 400
1.2.3.4 1.75 128 355
1.2.3.5 0.03 44  1500
1.2.3.5 0.08 44  1500
1.2.3.5 0.83 45  80
1.2.3.5 0.88 45  80
1.2.3.5 0.92 45  80
1.2.3.5 0.96 45  80
1.2.3.5 0.97 45  80
1.2.3.5 0.99 45  80
1.2.3.5 1.03 45  80
1.2.3.5 1.14 46  200
1.2.3.5 1.19 47  480
1.2.3.5 1.20 48  800
1.2.3.5 1.30 48  800

This would result in the following output:

destIP    avgRetransTime   maxRetransTime  noRetrans  noSuccPack  transBytes
-------- ----------------  --------------  ---------  ----------  ----------
1.2.3.4         0.16           0.47            8          5          2605
1.2.3.5         0.07           0:20            8          4          2980

And here is how I derive the numbers per IP:

avgRetransTime:
1.2.3.4: ((0.48-0.01)+(0.52-0.52)+(1.40-1.00)+(1.55-1.55)+(1.70-1.60)+(1.75-1.75))/6 = 0.16
1.2.3.5: ((0.08-0.03)+(1.03-0.83)+(1.14-1.14)+(1.19-1.19)+(1.30-1.20))/5 = 0.07
maxRetransTime:
1.2.3.4: 0.47 vs 0.40 vs 0.10 => 0.47
1.2.3.5: 0.05 vs 0.20 vs 0.10 => 0.20
noRetrans:
1.2.3.4: 8 (seqNo 123 two times, seqNo 125 five times, seqNo 127 once)
1.2.3.5: 8 (seqNo 4 once, seqNo 45 6 times, seqNo 48 once)
noSuccPack:
1.2.3.4: 5 (seqNo 125 retransmitted more than 5 times => unsuccessful)
1.2.3.5: 4 (seqNo 45 retransmitted more than 5 times => unsuccessful)
transBytes:
1.2.3.4: 500+800+550+400+355 = 2605 (seqNo 125 counted as 'not delivered')
1.2.3.5: 1500+200+480+800 = 2980 (seqNo 45 counted as 'not delivered')

Thanks!
/Z

This seems to do what you want, although it uses a slightly different output format:

#!/bin/ksh
sort -k1,1 -k3,3n -k2,2n stats.txt | awk '
BEGIN { # Perform script initialization steps here...
        # Print output file headers.
	printf("%16s %s %s\n", "", "---Retransmissions---", "Successful")
	printf("%16s %s %s %5s %-10s %s\n",
		"", "Average", "Maximum", "", "  Packet", "Transferred")
	printf("%-16s %-7s %-7s %s %-10s %s\n", " Destination IP", " Time",
	" Time", "Count", "  Count", "   Bytes")
	print "================ ======= ======= ===== ========== ==========="

	# Initialize any variables that should have initial values other than
	# zero or an empty string.
	# No variables need to be set here in this script.
}
function BeginSeqNo() {
	# Initialize values from the 1st record for a new sequence number.
	lastSeqNo = $3
	SeqNoCount = 0
	SeqNoPacketSize = $4
	SeqNoTimeStart = $2
}
function EndSeqNo() {
	# Perform calculations to save results from prior sequence number data
	# lines for this IP address.
	IPCount++
	RetranCount += SeqNoCount - 1
	RetranTime += (SeqNoTime = SeqNoTimeEnd - SeqNoTimeStart)
	if(SeqNoTime > RetranMaxTime)
		RetranMaxTime = SeqNoTime
	if(SeqNoCount <= 5) {
		SuccByteCount += SeqNoPacketSize
		SuccPacketCount++
	}
}
function PrintIP() {
	# Perform calculations and print results for the previous IP address.
	EndSeqNo()
	printf("%-16s %7.3f %7.2f %5d %10d %11d\n", lastIP,
		RetranTime / IPCount, RetranMaxTime, RetranCount,
		SuccPacketCount, SuccByteCount)
}
$1 != lastIP {
	# If there was a header in the input file, it will sort to the end.  If
	# we find the header, we are done...  If there is no header, the END
	# clause will print the results for the last IP in the input file.  The
	# END clause will print the results from the final IP address in the
	# input file.
	if($1 == "destIP")
		exit
	if(NR != 1) {
		# Wrap up calculations for last Sequence number in previous
		# IP and print results for previous IP.
		PrintIP()
	}
	# If we get to this point, this is not the header line, so it must be the
	# 1st record for a new IP address.  Gather data from this record to
	# initialize processing for a new IP address.
	lastIP = $1
	IPCount = RetranCount = RetranTime = RetranMaxTime = SuccByteCount = \
		SuccPacketCount = 0
	# And, initialize data for the 1st sequence number in this new IP
	# address...
	BeginSeqNo()
}
$3 != lastSeqNo {
	# This is the 1st packet in a new sequence number for the current IP
	# address; perform wrap up calculations for the previous sequence number
	# and initialize for the new sequence number.
	EndSeqNo()
	BeginSeqNo()
}
{	# Gather data from this line to data for current seqnence number.
	SeqNoCount++
	SeqNoTimeEnd = $2
}
END {	if(NR) {# If we did not have an empty input file, wrap up calculationss
		# and print results for the last IP address in the file.
		PrintIP()
	}
}'

With the sample input from post #11 in this thread, the above script produces the output:

                 ---Retransmissions--- Successful
                 Average Maximum         Packet   Transferred
 Destination IP   Time    Time   Count   Count       Bytes
================ ======= ======= ===== ========== ===========
1.2.3.4            0.162    0.47     8          5        2605
1.2.3.5            0.070    0.20     8          4        2980

which seems to match the results you requested in post #11.

Although written and tested using the Korn shell, I don't think there is anything in this script that is shell specific.

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

1 Like

Thanks a lot Don. Works excellent (on a Raspberry Pi). Also very good comments wich will be helpful in my future attempts to write similar scripts. All help very appreciated.

/Z