Counting the differences based on a specific rule

labrazil · September 7, 2008, 5:46am

Hi,
I've been trying to create a perl file to run something very specific. But I'm not getting any success. I'm not very good with hashing.

I have a file with two columns (tab separated) (already sorted)

99890 +
100281 +
104919 -
109672 +
113428 -
114501 +
115357 +
115598 -
116100 +
118192 +
119470 +

What I am trying to do is determine the difference between two sets of numbers only when a + is followed by a -. And then based on the difference value, to count those that are less than 100, 100-200, 201-500, 501-750, 751-1000, or greater than 1001 and also to determine how many didn't follow the rule (+ followed by -).

Based on the file above, I would assume the output would be:
<100 - 0
100-200 - 0
201-500 - 1
501-750 - 0
751-1000 - 0
>1001 - 2

no match - 5

Please if anyone can help me...
Thanks.

era · September 7, 2008, 7:28am

Why do you need a hash for that? Just keep a list of differences and at the end sort through the list and loop over it, spitting out a report and resetting the count at the limits you have specified. While reading and calculating differences, increment a separate counter when you see a plus line, and decrement it when you see a minus line; this will be the count of unmatched plus lines.

labrazil · September 7, 2008, 6:14pm

I see, well that makes sense.

so I could run it so it would search for + followed by a - and then take the difference between the two numbers.

How do I store the information so it knows to subtract out the two numbers when the expression is true? I guess this is why I was having a hard time with this.

Thanks for your help :).

era · September 8, 2008, 2:34am

You keep the previous value in a variable. As per your example, we don't need to worry about two consecutive lines with minuses on them.

while (<>) {
  die "Invalid input" unless m/^(\d+) ([-+])$/);
  my ($number, $sign) = ($1, $2);  # as captured by the previous regex match
  if ($sign eq '+') {
    ++$plusses;
    $previous = $number;
    next;
  }
  # else, must be a minus
  --$plusses;
  push @differences, $previous - $number;
}

labrazil · September 8, 2008, 3:25am

Thank you era! It makes sense now. what i did was this

while (<>) {
	chomp;
  #die "Invalid input" unless m/^(\d+) ([-+])$/);
  my ($number, $sign) = split("\t");  # as captured by the previous regex match
  if ($sign eq '+') {
    ++$plusses;
    $previous = $number;
    next;
  }
  # else, must be a minus
  --$plusses;
	push @differences, $number - $previous;
	print $number - $previous, "\n";
}

I wasn't sure if 'push @differences' was needed? So i just printed the values (I reversed the subtraction because I was getting negative numbers). I guess I can then use some if expressions to count each category.

labrazil · September 8, 2008, 3:36am

dont know what I was thinking. the $plusses gives me the no match and I can use the difference in another expression to print it out. Great, exactly what i need :).

labrazil · September 8, 2008, 4:38am

Hi, so here is my attempt at this. Thank you era for your insight. Much appreciated!!

#!/usr/bin/perl 

#use strict;
use warnings;


my ($plusses, $previous, @differences, $h, $diff_value);
$value_100 = '0'; $value_200 = '0'; $value_500 = '0'; $value_750 = '0'; $value_1000 = '0'; $value_1001 = '0';
while (<>) {
	chomp;
  #die "Invalid input" unless m/^(\d+) ([-+])$/);
  my ($number, $sign) = split("\t");  # as captured by the previous regex match
  if ($sign eq '+') {
    ++$plusses;
    $previous = $number;
    next;
  }
  # else, must be a minus
  --$plusses;
	push @differences, $number - $previous;
	$diff_value = $number - $previous;
	if ($diff_value <= 100){
		$value_100++;
		}elsif ($diff_value > 100 && $diff_value <=200){
			$value_200++;
				}elsif ($diff_value >200 && $diff_value <=500){
				$value_500++;
					}elsif ($diff_value >500 && $diff_value <=750){
					$value_750++;
						}elsif ($diff_value >750 && $diff_value <=1000){
						$value_1000++;
							}else {
							$value_1001++;
							}
			
	
}
my $total_value = $value_100 + $value_200 + $value_500 + $value_750 + $value_1000 + $value_1001;
print "\nDistribution:\n";
print "<100 \t\t-\t$value_100\n";
print "100 - 200\t-\t$value_200\n";
print "201 - 500\t-\t$value_500\n";
print "501 - 750\t-\t$value_750\n";
print "751 - 1000\t-\t$value_1000\n";
print ">1001\t\t-\t$value_1001\n";
print "\t\t\t======\n";
print "TOTAL:\t\t\t$total_value\n";
print "\nNo matches: ",$plusses, "\n\n";

what do you think?

era · September 8, 2008, 6:24am

Welp, my original proposal was to just collect the @differences, then at the end sort it and loop over it.

my %limits = (100 => "< 100",
  200 => "100 - 200",
  500 => "201 - 500",
  750 => "501 - 750",
  1000 => "751 - 1000",
  1_000_000_000 => "> 1001");
my @l = sort keys %limits;
my $total = 0;
print "\nDistribution:\n";
for my $d (sort @differences) {
  if ($d < $l[0]) {
    $total++;
    next;
  }
  print $limits{$l[0]}, "\t-\t", $total, "\n";
  shift @l;
}
print $limits{$l[0]}, "\t-\t", $total, "\n";

(Not tested.)