Perl- Finding average "frequency" of occurrence of duplicate lines

acsg · August 9, 2011, 12:49am

Hello,

I am working with a perl script that tries to find the average "frequency" in which lines are duplicated. So far I've only managed to find the way to count how many times the lines are repeated, the code is as follows:

perl -ae'
my $filename= $ENV{'i'};
open (FILE, "$filename") or die  $!;

my %seen= ();


while(my $line = <FILE>){
  my @fields = split(/\s+/, $line);
  my @fields2= @fields[3..16];
  my $niin= join("\t", @fields2);
  $seen{$niin}++;
  }

foreach my $keys (sort {$seen{$b} <=> $seen{$a}} keys %seen){
    print "$keys = $seen{$keys}\n";
}

close (FILE);


'

Which produces this type of output:

225    1    225    2    225    3    225    4    225    5    225    6    225    7 = 31789
225    10    225    11    225    12    225    13    225    14    225    15    225    0 = 31772
225    8    225    9    225    10    225    11    225    12    225    13    225    14 = 31714
225    3    225    4    225    5    225    6    225    7    225    8    225    9 = 31686

Now, what I want to do is find a way to find out the number of (in average) "every how many lines a certain line is repeated". So I was wondering if it's possible to have some sort of record and then in the end just calculate the average?

I actually have another way to calculate this frequency. In the original file being read, the first field is a unix timestamp (which i "cut out" for the counting of the duplicate lines). So I thought it would be possible as well to try to keep a record of the "time between repetitions" and then make an average in the end. Of course this would imply keeping a record for each duplicate line, which seems like a rather intricate operation. An example of the lines is :

1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

The first field being the unix timestamp. The first, second and third field are ignored for the comparison of duplicate lines.

Any help is deeply appreciated.

---------- Post updated 08-09-11 at 07:49 AM ---------- Previous update was 08-08-11 at 08:40 AM ----------

Is this really not accomplishable the way I asked for in perl? Is there any other way to do it? Any ideas please?

Thanks again...

yazu · August 9, 2011, 1:19am

Do you want something like this?

% cat INPUTFILE
a
b
a
c
b
a
a
d
d
% perl -lne '
  $seen{$_}++;
  END {
    for $key (sort keys %seen) {
      printf "%s %.2f%%\n", $key, $seen{$key}/$. * 100;
    }
}' INPUTFILE
a 44.44%
b 22.22%
c 11.11%
d 22.22%

acsg · August 9, 2011, 1:31am

I'm sorry I think I wasn't clear enough.

I'd like the average of "every how many lines a certain line is repeated". So say that the line

a b c d e

is repeated first every 2 lines, then the next time it appears after 10 lines, then 2 again, then 4, etc etc.

Is it possible to keep a record of this and make an average? For each duplicate line, of course.

Anyway, if I'm still not being clear enough, please do ask.

Thanks!

I had proposed to use the first field in my file to keep record of time (since it's a unix timestamp). Try to find the "inter-occurrence" time instead of the "every how many lines" record, but I don't know if this would be more complicated.

yazu · August 9, 2011, 1:49am

I believe it is possible. But I'm not sure I understand the task (sorry, English is not my native language). Please give examples of your input and the desired output. Maybe it would be enough if you give the desired output for my INPUTFILE:
All lines: 9
Lines between a: 1, 2, 0 (or maybe you need to remember line numbers - 1, 3, 6, 7?) so what output?
b: 2 - ?
c: ? (only one occurrence) - ?
d: 0 - ?

acsg · August 9, 2011, 2:17am

Thanks for your reply.
Yeah what I want is something like what you said. So, for your example input file, the output would be:

a- 4 2 
b- 2 3
c- 1 0
d- 2 1

the first field being the contents of the line being repeated, the second field the number of times found in the file, the third field being the average of "every how many lines it is repeated". So for example for 'a', first it appears after 2 lines, then 3 lines then 1 line. So the average of this makes 2 lines. Then for 'b' and 'd' since they are only duplicated once, there won't be a need to make an average. And, since 'c' is never repeated, then the average is just '0' (or could be blank, it doesn't matter).

On the other hand, how about keeping track of the timestamp and subtracting it to make the "time between repetitions" and then making an average? That was my original idea but I don't know how to keep track of this time, per each repeated line. The output in this case would be something like:

a- 4 0.05
b- 2 0.89
c- 1 0
d- 2 0.06

the last field being the seconds.

Thanks!

yazu · August 9, 2011, 2:48am

Ok. Is this algorithm is right (there is 1 second difference between lines)?

cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
    for $key (sort keys %seen) {
      @ts = @{$seen{$key}};
      $n = @ts;      
      $prev = $ts[0];
      $nt = 0;
      print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5

acsg · August 9, 2011, 4:00am

yazu:

Ok. Is this algorithm is right (there is 1 second difference between lines)?

cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
   for $key (sort keys %seen) {
   @ts = @{$seen{$key}};
   $n = @ts;      
   $prev = $ts[0];
   $nt = 0;
   print "$key $n ";
   for $time (@ts) {
   $nt += $time - $prev;
   }
   print $nt/$n, "\n";
   }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5

I just tried the algorithm and it works for the example input file but for my actual file, there are a couple of problems.

The input lines in my original file are of the form:

1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

So for the comparison of duplicates I want to ignore the fields 0, 1 and 2.
How can I adjust your code to this?

I tried changing this part that only considers the first field:

 push @{$seen{$F[1]}}, $F[0];

then I changed it to

push @{$seen{$F[3..16]}}, $F[0];

but it doesn't seem to work and well, I don't think I quite get what the code does, could you please explain? :o thanks!

yazu · August 9, 2011, 4:27am

Just change to

push @{$seen{"@F[3..16]"}}, $F[0];

It changes spaces to one. If you want to save them you need use substr() on $_. If you want another output separator then change it in END block before any print like this: $\="\t"

"-a" switch splits every input line to @F array. Then we push in an anonymous array the first field and this array is in the hash "seen", where the key is the joined array slice. In the END block we for every unique key counts how many time fields there is, how seconds total between them and evaluate average.

---

Sorry, I'm wrong about the output separator. If you want change it, you need change in BEGIN block $" variable.

perl -ane '
  BEGIN {
    $"="\t";
  }
  push @{$seen{"@F[3..16]"}}, $F[0];
  END {
    for $key (sort keys %seen) {
        @ts = @{$seen{$key}};
        $n = @ts;
        $prev = $ts[0];
        $nt = 0;
        print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
}' INPUTFILE

acsg · August 9, 2011, 6:18am

yazu:

Just change to

perl -ane '
  BEGIN {
   $"="\t";
  }
  push @{$seen{"@F[3..16]"}}, $F[0];
  END {
   for $key (sort keys %seen) {
   @ts = @{$seen{$key}};
   $n = @ts;
   $prev = $ts[0];
   $nt = 0;
   print "$key $n ";
   for $time (@ts) {
   $nt += $time - $prev;
   }
   print $nt/$n, "\n";
   }
}' INPUTFILE

There is another small problem I found. The record it keeps is static, meaning it should count the seconds since the last appearance, but what it's doing right now is counting the seconds since the FIRST appearance every time. In your example, this makes the seconds since the first 'a' be 2, then 5, then 6 which gives an average of 3.25 and the real average should be made between 2, 3 and 1 (which would give a 1.5 avg).

yazu · August 9, 2011, 6:58am

Change to:

      for $time (@ts) {
        $nt += $time - $prev;
        $prev = $time;
      }

acsg · August 11, 2011, 4:57am

Thanks for all your help yazu!!

Is it possible to do it the other way (keep track of the number of lines between repetitions and then make an avg)?

-----------------
I guess I only have to replace the timestamps with the current input line number in the code, in order to get the average lines

So then it becomes:

push @{$seen{"@F[3..16]"}}, $.;

Or so I think! :o