Generate Regex numeric range with specific sub-ranges

varu0612 · March 17, 2013, 5:37am

hi all,

Say i have a range like 0 - 1000 and i need to split into diffrent files the lines which are within a specific fixed sub-range. I can achieve this manually but is not scalable if the range increase.

E.g

cat file1.txt
Response time 2 ms
Response time 15 ms
Response time 101 ms
Response time 279 ms
etc

What i currently do is create an array and then grep for it in a loop

bucketLimits=( 
 # 100 <> 150, 150 <> 200, 200 <> 250, 250 <> 300, 300 <> 350, 350 <> 400, 400 <> 450, 450 <> 500 
 '[1][0-4][0-9]' '[1][5-9][0-9]' '[2][0-4][0-9]' '[2][5-9][0-9]' '[3][0-4][0-9]' '[3][5-9][0-9]' '[4][0-4][0-9]' '[4][5-9][0-9]'
 # 500 <> 550, 550 <> 600, 600 <> 650, 650 <> 700, 700 <> 750, 750 <> 800, 800 <> 850, 850 <> 900, 900 <> 950, 950 <> 1000
        '[5][0-4][0-9]' '[5][5-9][0-9]' '[6][0-4][0-9]' '[6][5-9][0-9]' '[7][0-4][0-9]' '[7][5-9][0-9]' '[8][0-4][0-9]' '[8][5-9][0-9]' '[9][0-4][0-9]' '[9][5-9][0-9]' 
 )
 
  for bucketLimit in ${bucketLimits[@]}
  do
    limit=${bucketLimits[$index]}
    result=`grep "Response" file1.txt| grep -oE "time ${limit} ms" | wc -l` 
    finalResult=$finalResult","$result
    index=$(( $index + 1 ))
  done
  echo "$finalResult" >> ./stats_results.csv

Any idea how i can auto generate the buckeLimits array by giving the sub-range value? Could be 10 range or as it is now 50 range.

Thx!

balajesuri · March 17, 2013, 7:58am

varu0612:

What i currently do is create an array and then grep for it in a loop

for bucketLimit in ${bucketLimits[@]}
  do
   limit=${bucketLimits[$index]}
   result=`grep "Response" | grep -P "CMDC=${limit} ms" | wc -l` 
   finalResult=$finalResult","$result
   index=$(( $index + 1 ))
   fi
  done

Not sure how this works for you.
The grep statement searches for patterns from what?
There is a "fi" without an "if". Are we missing few lines of code here?

varu0612 · March 17, 2013, 8:30am

i made the correction in my sample code - see initial post above

Thx

balajesuri · March 17, 2013, 10:11am

Here's a solution using a bit of elementary mathematics instead of regular expressions.

And I've assumed that file.txt contains only lines such as "Response time <time> ms"

#! /bin/bash

i=0 # initial
r=50 # range
f=1000 # final

while [ $i -le $f ]
do
   final[$(( $i / $r ))]=0
   i=$(( $i + $r ))
done

while read a b time unit
do
    index=$(( $time / $r ))
    final[$index]=$(( ${final[$index]} + 1 ))
done < file.txt

x=${final[@]}
echo ${x// /,} >> stats_results.csv

alister · March 17, 2013, 2:04pm

Here's an AWK approach which uses an array of buckets, b, with n buckets of size s.

awk '/^Response time/ {++b[int($3/s)]} END {for(i=0; i<n; i++) print b+0}' n=10 s=100 file

It assumes that the first bucket spans 0 to s-1. It could easily be modified to accept an initial starting point other than 0, but I'll leave that as an exercise. Also, values beyond the valid bucket ranges are ignored, though this too can be easily changed.

The output is one line per bucket, but paste -sd, - trivially converts it to the OP's comma-delimited format.

Regards,
Alister

varu0612 · March 17, 2013, 3:41pm

alister:

Here's an AWK approach which uses an array of buckets, b, with n buckets of size s.
awk '/^Response time/ {++b[int($3/s)]} END {for(i=0; i<n; i++) print b+0}' n=10 s=100 file
It assumes that the first bucket spans 0 to s-1. It could easily be modified to accept an initial starting point other than 0, but I'll leave that as an exercise. Also, values beyond the valid bucket ranges are ignored, though this too can be easily changed.

The output is one line per bucket, but paste -sd, - trivially converts it to the OP's comma-delimited format.

Regards,
Alister

Alister,

Your method is a very tidy/ nice one (balajesuri yours works ok as well, so thank you!).

Two more question:

a) how can i add a header like this which should take into account the n buckets of size s

Buckets,0-5ms,5-10ms,10-20ms,20-30ms,30-40ms,40-50ms,50-60ms,60-70ms,70-80ms,80-90ms,90-100ms,100-150ms,150-200ms,> 200ms

b) if the values are beyond the valid range, how can i add it under >200ms for example?

Many thanks,

Chubler_XL · March 17, 2013, 5:40pm

How about this, we specify the upper limit for each bucket with an auto implied bucket for everything greater:

$ awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
 BEGIN {n=split(buckets,B,",");B[n]=x};
 /^Response time/{for(i=1;B&&($3>B);i++);v++}
 END{for(i=0;i<=n;i++) $i=v+0; print}' OFS=, file1.txt >> stats_results.csv

RudiC · March 17, 2013, 5:52pm

alister's proposal assumes a fixed bucket size (in this case 100 ms per bucket), and a fixed number of buckets, 10. Your header does not (5ms, 5ms, 10ms, 8 x 10ms, 50 ms, 50 ms, infinity) and thus is incompatible with that nice, simple, and linear solution. You would need to explicitly pass the buckets to awk ; then it also would be easy to both print the header and check "out of range".

EDIT: Chubler_XL just outpassed me; his proposal comes close to what I had in mind. He just doesn't put the 279 ms in the sample file into the right bin.

EDIT 2: massaging Chubler_XL's proposal slightly, this might be acceptable to the requestor:

awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
         BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
         /^Response time/       {for(i=1;B&&($3>B);i++);v++}
         END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B
                                 printf "\n"
                                 for (i=1; i<=n+1; i++) printf "%3d  ,", v
                                 printf "\n"
                                }
        ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,

varu0612 · March 17, 2013, 7:05pm

rudic:

alister's proposal assumes a fixed bucket size (in this case 100 ms per bucket), and a fixed number of buckets, 10. Your header does not (5ms, 5ms, 10ms, 8 x 10ms, 50 ms, 50 ms, infinity) and thus is incompatible with that nice, simple, and linear solution. You would need to explicitly pass the buckets to awk ; then it also would be easy to both print the header and check "out of range".

EDIT: Chubler_XL just outpassed me; his proposal comes close to what I had in mind. He just doesn't put the 279 ms in the sample file into the right bin.

EDIT 2: massaging Chubler_XL's proposal slightly, this might be acceptable to the requestor:
awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
   BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
   /^Response time/       {for(i=1;B&&($3>B);i++);v++}
   END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B
   printf "\n"
   for (i=1; i<=n+1; i++) printf "%3d  ,", v
   printf "\n"
   }
   ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,

The problem with your/ Chubler_XL suggestion is that i'll have to defined the upper bucket and this is the main reason why i'm moving away from my current solution otherwise for a range of 0 - 1000 with an upper bucket limit of 10 ms will take me ages to define it.

Alister's solution is very simple and so i have to defined only 2 values.

With regards to the header - i only gave an example but as i said to keep the nice/ tidy solution, the header should be generated based ont he n/ s values.

Cheers

alister · March 17, 2013, 8:04pm

If $3>B works with your awk implementation (I know it works with at least some mawk versions, if not all) then it's because it's violating POSIX. That should be performing a string comparison for all iterations of the loop, even when both B [i]and $3 are numeric strings. A compliant implementation can yield an incorrect result (such as when "200" is treated as greater than "10").

From http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html:

Except for the final value in B, every member of B that results from split() is a numeric string. Every "number' assigned from the input data to a field variable (such as $3) is also a numeric string. Note that the case of comparing a numeric string with a numeric string should be handled as a string comparison; at least one operand should be numeric for a numeric comparison to occur (which means "casting" with +0, or using the result of a function that returns a number, or using a numeric literal).

Another issue is that the terminating condition is locale dependent. The only reason the loop terminates is because a string comparison is used to compare the value of $3 against ">200" (in this instance). If a locale-aware implementation were run under a locale that did not place the ">" after all of the digits, an infinite loop would result upon encountering a value that should land in the last bucket.

Regards,
Alister

---------- Post updated at 08:04 PM ---------- Previous update was at 07:18 PM ----------

awk -v n=10 -v s=100 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END {
    for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
    print ">" (n*s)
    for (i=0; i<n; i++) printf "%s,", (b+0)
    print b+0
}' file

Regards,
Alister

varu0612 · March 18, 2013, 7:53am

Alister,

the code in bold what does it mean?

IF this will help others, the code/ output below

Input file

cat file1.txt
Response time 2 ms
Response time 15 ms
Response time 17 ms
Response time 50 ms
Response time 45 ms
Response time 80 ms
Response time 89 ms
Response time 50 ms
Response time 53 ms
Response time 58 ms
Response time 57 ms
Response time 56 ms
Response time 98 ms
Response time 99 ms
Response time 100 ms
Response time 102 ms
Response time 110 ms

awk -v n=10 -v s=10 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END { printf "Buckets, "
for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
print ">" (n*s)
for (i=0; i<n; i++) printf "%s,", (b+0)
print b+0
}' file1.txt

Output/ result

Buckets, 0-9,10-19,20-29,30-39,40-49,50-59,60-69,70-79,80-89,90-99,>100
1,2,0,0,1,6,0,0,2,2,3

Just for my own knowledge: should i understand that is very hard to implement this using the regular expressions? Has anyone done it?

Cheers

alister · March 18, 2013, 12:08pm

It's part of the ternary operator, e1 ? e2 : e3 , which involves three expressions, e1, e2, and e3. If the first expression, e1, evaluates to true, then the result is e2. If e1 is instead false, return e3.

In the quoted code fragment:
e1: (i=int($3/s)) > n
e2: n
e3: i

e1 calculates the bucket index to which $3 belongs, stores that value in i, and then compares the value of the assignment (which is the value stored in i) to n. If i is greater than n, which would indicate a bucket beyond the final bucket, then e1 is true and the result is e2, which is n. This is the logic which folds all values that would fall into a bucket beyond the final bucket into that final bucket. If, however, i is not greater than n, then e1 is false, i is a valid bucket index, and the ternary operator returns e3 (i).

I don't recommend this type of coding, as it's difficult to decipher. Even an expert programmer has to give it a close look to be certain of what's going on. My only defense is that it makes it more fun for me to contribute here, as I attempt to be as concise as possible. A possible beneficial side effect is that it may help others learn more about the language in question.

A much more readable, maintainable, and professional version:

i = int($3/s)
if (i > n)
    i = n
b = b + 1

Regards,
Alister

varu0612 · March 18, 2013, 2:07pm

The fact you took your time to explain in detail how it works where even a 5 years old kid can understand is very much appreciated.

I've seen many smart users replying with solutions who don't fail to explain the logic ... in my view that is a useless answer since it doesn't help the requester to understand/ learn how it works.

All the best!!