awk - find average interarrival times for each unique page

All,

I have a test file as specified below. 1st col is <arrival time> and 2nd col is <Page #>. I want to find the inter-arrival time of requests for each page # (I've done this part already). Once I have this, I want to calculate the average interarrival time. Note, that I am trying to have the average interarrival time for the requests that arrive for each unique page. In other words, I don't want the average inter-arrival time for all of the requests in the trace with no respect to pages, b/c that would be trivial to do.

I know how to do the calculation but my problem is I'm not sure what the best way to store these would be. Before I calculate it, I probably need to store all of the inter-arrival times for each unique page first, then I can calculate the average. Or maybe someone knows of an easier way to do this. Here is my example.

My testfile.txt (the file is sorted by Page # (2nd col))

0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600

For the average inter-arrival time, I would just add all the interarrival times up for that page and then divide by [the number of requests for that page - 1]. It is minus one because it is the inter-arrival time between 2 requests.

My desired output should be something like this:

<Page #> <Average inter-arrival time for each Page #>
55588 0
55592 3.232
55596 405.134
55600 194.089

Here is the code I have so far.

#!/bin/bash

cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage
} '

Thank you in advance for your help!
Jonathan

Not sure if this is what you are looking for...

#!/bin/bash
cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++;
}
END{
for(i in a){div=b-1;print "Average Inter-Arrival Time for "i"\t:\t"a/(div?div:1)}
}

regards,
Ahamed

Ahamed,

That definitely worked for the small sample file I posted! Thanks. However, I am doing this on a very large file and for some reason I am getting negative numbers. I'm guessing it's because I need to take into account for very large numbers? Do I need to cast some of the variables as float or somehow account for very large numbers?

Thanks again for your help!
Jonathan

Here is the complete testfile.txt that I am using. I have put it in my dropbox since it is about 18MB.
http://dl.dropbox.com/u/9867823/testfile.txt

I've also modified the script slightly. The updated script is below:

#!/bin/bash

FILE=$1

cat $FILE | sort -n -k2 | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++
} 
END{
for(i in a){div=b-1;print i"\t"a/(div?div:1)
}
}
' > ${FILE}_interArrivalTimes

For floating point notation you need to use printf with %f in your END block e.g.
Slight modification will display everything as you wish.

END{
for(i in a){div=b-1;printf "%s %f\n",i,a/(div?div:1)}
}

Try this,

sort -nk2 -nk1 testfile.txt | awk '{if($2 in a){diff=diff+$1-a[$2];a[$2]=$1;i++;b=$2;next}
else {
if(i>0) {--i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
i=0;diff=0;a[$2]=$1;i++;b=$2
}
}
END {
 if(i>0) { --i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
}'
1 Like
echo '0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600' |awk '{v1=$2==v2?v1:$1;a[$2]=$1-v1;v2=$2;b[$2]++}END{for(i in a) print i,a/(b==1?1:b-1)}'
55592 3.232
55600 194.089
55596 405.134
55588 0

Peasant,

I tried this but I'm still getting negative timestamps. Is the inter-arrival calculation happening correctly? It should be interArrivTime=currTime-prevTime (unless currTime is 0...in which case the ArrivTime for that line should just be 0).

---------- Post updated at 03:22 PM ---------- Previous update was at 03:19 PM ----------

Pravin27,

This looks like it's working perfectly! Thank you!

Jonathan

---------- Post updated at 03:32 PM ---------- Previous update was at 03:22 PM ----------

Thanks everybody for all your help on this...how much harder would it be to also add a 3rd column that gives me the standard deviation for the average inter arrival time for each page?

The formula for standard deviation is:
stand dev = square_root{ Summation[ (x - aveIntArrivTime)^2] / (N-1) }

where
x = the intArrivalTime for each page
aveIntArrivTime = the average InterArrivalTime for each page (which we now have)
N = the number of requests for each page

The formula is also shown here:
Simple Example of Calculating Standard Deviation

function sd(N,average,array,line2)
 {for(i in array) {split(i,x,":") 
  if(x[2]==line2) sum+=(x[1]-average)^2
  }
 s=N==1?"0":sqrt(sum/(N-1))
 return s
 sum=0}
{v1=$2==v2?v1:$1
a[$2]=$1-v1
v2=$2
b[$2]++
c[$1":"$2]
d[$2]+=$1}
END{
 for(i in b) {
  print i,a/(b==1?1:b-1),sd(b,d/b,c,i)|"sort"}
  }

yinyuemi,

How can I test this? Do I just run this on the same file I have? It looks I have to pass it 4 parameters?

save the code as awk.script

cat file
0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600

awk -f awk.script file
55588 0 0
55592 3.232 2.28537
55596 405.134 515.903
55600 194.089 260.771

yinyuemi,

Maybe I didn't explain it clearly...

For this file:

cat file
0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600

The results should be like this:

awk -f awk.script file
55588 0 0
55592 3.232 0
55596 405.134 7.743
55600 194.089 155.207

For page 55588, there is no intArriTime time since there is only 1 request for that page. So both the aveIntArrTime and the stdDev are both 0.

For page 55592, there is only 1 intArriTime (3.232 - 0 = 3.232). So the aveIntArriTime is also 3.232. The stdDev is 0 since there is only 1 aveIntArriTime.

For page 55596, there are 3 ArriTimes, which means there are 2 intArriTimes, which are 412.877 and 397.391, respectively.

So the aveIntArrTime is (412.877 + 397.391)/2 = 405.134.

And the stdDev for page 55596 is:
stdDev = square_root { [ (412.877-405.134)^2 + (397.391-405.134)^2 ] / 2 }
stdDev = 7.743

Similar logic follows for page 55600.

Hopefully this clears things up. Thanks for your time.

function sd(N,average,array,line2)
 {for(i in array) {split(i,x,":") 
  if(x[2]==line2) sum[line2]+=(x[1]-average)^2
  }
 s=N<=2?"0":sqrt(sum[line2]/(N-1))
 return s
 }
{v1=$2==v2?v1:$1
a[$2]=$1-v1
b[$2]++
intr=$1-v3;
$2==v2?c[intr":"$2":"++p]:intr=0
v2=$2
v3=$1
}
END{for(i in b) {
  avr=a/(b<=1?1:b-1);print i,avr,sd(b,avr,c,i)|"sort"}
  }