Sum value from selected lines script (awk,perl)

paolfili · October 10, 2009, 8:02am

Hello.
I face this (2 side) problem.

Some lines with this structure.

...........
12345678 4
12345989 13
12346356 205
12346644 74
12346819 22
.........

The first field (timestamp) is growing (or at least equal).

1)Sum the second fields if the first_field/500 are equals.
2)Sum the second fields if the difference between first fields is less than 500.
(sliding window)

In the example presented.

1) Becouse 12345678/500 and 12345989/500 both result 24691 sum=4+13
We cannot group the 3rd line so sum=205
And we group the 4th and 5th line so sum=74+22

2) We group the 1st and 2nd line becouse 12345989 - 12345678 < 500
For analogy we group the 2nd and 3th, the 3rd and 4th,
and the 3rd,4th and 5th becouse 12346819 (of the 5th line) - 12346356 (of the 3th line) < 500

Is there any (perl,awk,etc...) way to do it?

Thanks

Paolo

thegeek · October 10, 2009, 9:04am

In Perl it is very simple to do.

But before i work for you, i would want to know what you have tried so far ?!

You should try, and ask for clarifications/advices if you have some difficulties -- which is always good to learn.

paolfili · October 10, 2009, 10:45am

I know little awk and some elements of perl.

awk '{if ($1/500 > last_time_frame) { sum = $2 } else { sum+=$2;print sum };last_time_frame=$1/500;print sum}' AAAA.txt

No way

durden_tyler · October 10, 2009, 4:32pm

Sorry, but the problem is not clear enough.

What's the length of the sliding window ?

Is it just 2 (1st & 2nd, 2nd & 3rd, 3rd & 4th, ...) ?
Or is it 3 (1st, 2nd & 3rd; 2nd, 3rd & 4th; ...) ?

Hopefully, it's not a cartesian product, i.e.

1st vs. (2nd, 3rd, 4th, ... , last_row)
2nd vs. (1st, 3rd, 4th, ... , last_row)
3rd vs. (1st, 2nd, 4th, ... , last_row)
...
last_row vs. (1st, 2nd, 3rd, ..., last-1_row)

Ok, and what do you want to do with the sum ?
Do you want to display it ? Or do nothing with it (highly unlikely) ?
If you want to display it, then how ? The total against each row ? Or the total against the first row only ? Or against the second row only ?

This begs the first counter-question. Why compare the 3rd, 4th and 5th (considering that you have been comparing two-at-a-time all this while) ?
So again, what's the length of the sliding window ?

I guess a very simple example of your input file should help here. So, let's say your input file is as follows:

What do you want your output to look like ?

tyler_durden

paolfili · October 10, 2009, 4:51pm

???

durden_tyler:

Sorry, but the problem is not clear enough.

What's the length of the sliding window ?

Is it just 2 (1st & 2nd, 2nd & 3rd, 3rd & 4th, ...) ?

Or is it 3 (1st, 2nd & 3rd; 2nd, 3rd & 4th; ...) ?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
The lenght is whatever the data requires 1,1000,1000000 of data samples
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Hopefully, it's not a cartesian product, i.e.

1st vs. (2nd, 3rd, 4th, ... , last_row)
2nd vs. (1st, 3rd, 4th, ... , last_row)
3rd vs. (1st, 2nd, 4th, ... , last_row)
...
last_row vs. (1st, 2nd, 3rd, ..., last-1_row)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
No cartesian product
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Ok, and what do you want to do with the sum ?

Do you want to display it ? Or do nothing with it (highly unlikely) ?

If you want to display it, then how ? The total against each row ? Or the total against the first row only ? Or against the second row only ?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Only print the value.
For the other work I' m on my own.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

This begs the first counter-question. Why compare the 3rd, 4th and 5th (considering that you have been comparing two-at-a-time all this while) ?
So again, what's the length of the sliding window ?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
TimeFrame concept.
Events in a Time Frame (ex.500 microseconds)
I need to sum events in a :
1)STATIC time frame evironment.
2)DYNAMIC time frame environment.(what in Digital Signal Processing area is defined as Sliding Windows)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I guess a very simple example of your input file should help here. So, let's say your input file is as follows:
$
$ cat f1
100 1
200 2
300 3
400 4
500 5
600 6
700 7
$
What do you want your output to look like ?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
1 case)
sum=1+2+3+4
sum=5+6+7

2 case)
sum=1+2+3+4
sum=2+3+4+5+6
sum=3+4+5+6+7
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

tyler_durden

>>>>>>>>>>>>>>

Paolo

durden_tyler · October 11, 2009, 12:09am

Thanks for the clarification.

$ 
$ cat f1
100 1   
200 2   
300 3   
400 4   
500 5   
600 6   
700 7   
$       
$ # Case 1
$ ##
$ perl -lane 'chomp;
>             if (int($F[0]/500) != $prev){print "Sum=$s"; $s = $F[1]}
>             else {$s += $F[1]}                                      
>             $prev = int($F[0]/500);                                 
>             END {print "Sum=$s"}' f1                                
Sum=10                                                                
Sum=18                                                                
$                                                                     
$                                                                     
$ # Case 2                                                            
$ ##
$ perl -lne 'chomp; push @x,$_;
>            END {
>              for($i=0; $i<=$#x; $i++){
>                ($x1,$x2) = split/ /,$x[$i];
>                $s = $x2;
>                for ($j=$i+1; $j<=$#x; $j++) {
>                  ($y1,$y2) = split/ /,$x[$j];
>                  if ($y1 - $x1 < 500) {$s += $y2}
>                  else {last}
>                }
>                print "Sum=$s";
>              }
>            }' f1
Sum=15
Sum=20
Sum=25
Sum=22
Sum=18
Sum=13
Sum=7
$
$

Scrutinizer · October 11, 2009, 6:21am

Using awk:

Case1:

awk '{ sum1[int($1/500)]+=$2 } END { for (i in sum1) print "Sum1 "sum1 } ' infile

Case2:

awk 'BEGIN{
       min=1
     }
     { time[NR]=$1
       val[NR]=sum2[NR]=$2
       i=min
       while (time[NR]-time>=500)
         i++
       min=i
       for (i=min;i<NR;i++)
         sum2[NR]+=val
     }
     END {
       for (i in sum2)
         print "Sum2: "sum2
     }' infile

Case1+2 combined:

awk 'BEGIN{
       min=1
     }
     { sum1[int($1/500)]+=$2
       time[NR]=$1
       val[NR]=sum2[NR]=$2
       i=min
       while (time[NR]-time>=500)
         i++
       min=i
       for (i=min;i<NR;i++)
         sum2[NR]+=val
     }
     END {
       for (i in sum1)
         print "Sum1 "sum1
       print ""
       for (i in sum2)
         print "Sum2: "sum2
     }' infile

Original testset:

Sum1 17
Sum1 205
Sum1 96

Sum2: 4
Sum2: 17
Sum2: 218
Sum2: 279
Sum2: 301

Additional testset:

Sum1 10
Sum1 18

Sum2: 1
Sum2: 3
Sum2: 6
Sum2: 10
Sum2: 15
Sum2: 20
Sum2: 25

paolfili · October 12, 2009, 2:04pm

First of all THANK YOU VERY MUCH to durden_tyler and to Scrutinizer.
I need some time to elaborate(and study) your examples, and in some
days (if you want) I' ll came back with some other questions (but I realise
I need to REALLY study some manuals on awk/Perl) to obtain the
data filtering objects on which I'm working now.
Thanks again.

scrutinizer:

Using awk:

Case1:

awk '{ sum1[int($1/500)]+=$2 } END { for (i in sum1) print "Sum1 "sum1 } ' infile

Case2:

awk 'BEGIN{
   min=1
   }
   { time[NR]=$1
   val[NR]=sum2[NR]=$2
   i=min
   while (time[NR]-time>=500)
   i++
   min=i
   for (i=min;i<NR;i++)
   sum2[NR]+=val
   }
   END {
   for (i in sum2)
   print "Sum2: "sum2
   }' infile

Case1+2 combined:

awk 'BEGIN{
   min=1
   }
   { sum1[int($1/500)]+=$2
   time[NR]=$1
   val[NR]=sum2[NR]=$2
   i=min
   while (time[NR]-time>=500)
   i++
   min=i
   for (i=min;i<NR;i++)
   sum2[NR]+=val
   }
   END {
   for (i in sum1)
   print "Sum1 "sum1
   print ""
   for (i in sum2)
   print "Sum2: "sum2
   }' infile

Original testset:

Sum1 17
Sum1 205
Sum1 96

Sum2: 4
Sum2: 17
Sum2: 218
Sum2: 279
Sum2: 301

Additional testset:

Sum1 10
Sum1 18

Sum2: 1
Sum2: 3
Sum2: 6
Sum2: 10
Sum2: 15
Sum2: 20
Sum2: 25

---------- Post updated 10-12-09 at 04:41 PM ---------- Previous update was 10-11-09 at 09:01 PM ----------

---------- Post updated at 08:04 PM ---------- Previous update was at 08:02 PM ----------

$ gunzip -c ES_05_10_Oct2009.gz |sed -e 's/BEST /BEST/g' -e 's/[:.]//g'|awk '{ sum1[int($1/500)]+=$4 } END { for (i in sum1) print "i="i"Su
m1 "sum1 [i]}'|tee outfile.txt
awk: (FILENAME=- FNR=4094608) fatal: format_tree: obuf: can't allocate 512 bytes of memory (Cannot allocate memory)

Is there a way to do the sum step by step avoiding an array structure?
(problem with very big data structures)
Thanks

Paolo

[COLOR="\#738fbf"]

---------- Post updated at 08:01 PM ---------- Previous update was at 04:41 PM ----------

Scrutinizer · October 12, 2009, 5:31pm

Hi paolfili,

Sure:

awk '{ if ( int($1/500) > curtime ) {
         if ( NR>1 ) {  print "Sum1: "cursum }
         curtime=int($1/500)
         cursum=0
       }
       cursum+=$2
     }
     END { print "Sum1: "cursum
     }' infile

one line:

awk '{t=int($1/500); if(t>rt) {if (NR>1) {print "Sum: "s}; rt=t; s=0} s+=$2} END {print "Sum: "s}' infile