compare 2 arrays in perl

Hi Im supposed to compare lines in a file :
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
KB204320 1045 S KQLESEGRS 0.880 unsp YES

and print the lines that differs or dont repeat, which i managed to do by first putting lines into 2 arrays (the lines differ in names KB0005 and KB204320) and then by writing a perl script:

foreach $item (@a1, @a2) { $count{$item}++;}

foreach $item (keys %count) {
    if ($count{$item} == 2) {
        next;
    } else {
        push @diff, $item;
    }
}

my @sorted =sort @diff;
#print "\nIntersect Array = @isect\n";
foreach my $el(@sorted){
print "$el\n";
}

OUTPUT:
1019 T IFVATVPVI 0.691 PKC
1019 T IFVATVPVI 0.699 PKC
1036 T YFLQTSQQL 0.785 PKC
1036 T YFLQTSQQL 0.789 PKC
1037 S FLQTSQQLK 0.585 DNAPK
1037 S FLQTSQQLK 0.589 DNAPK
1045 S KQLESEGRS 0.669 PKC
This works good, I just want to print from which line (KB005 or the other) a given line comes from..
Anybody's willing to help? :slight_smile:
Thx

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

open( my $lfh, '<', 'file' ) 
or die "Unable to open file - 'file' <$!>\n";

my %uniqHash;
while ( my $data = <$lfh> ) {
    chomp($data);
    $uniqHash{$data}++;
}

close($lfh);
$
$ cat data.txt
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
KB204320 1045 S KQLESEGRS 0.880 unsp YES
$
$ cat testscr.pl
#!/usr/bin/perl -w
$prevname = "";
open (F,"data.txt") or die "Can't open data.txt: $!";
while (<F>) {
  chomp;
  ($name = $_) =~ s/(\w+) .*/$1/;
  ($elem = $_) =~ s/\w+ (.*)/$1/;
  if ($prevname eq "") {
    $arrnum = 1;
  } elsif ($name ne $prevname) {
    $arrnum = 2;
  }
  # assuming there are only 2 distinct names, hence 2 arrays
  if ($arrnum == 1) {
    push @a1, "$name:$elem";
  } elsif ($arrnum == 2) {
    push @a2, "$name:$elem";
  }
  $prevname = $name;
}
close (F) or die "Can't close data.txt: $!";
# the value is a tilde-delimited string of names for which the
# line occurs
foreach $item (@a1, @a2) {
  ($key = $item) =~ s/\w+:(.*)/$1/;
  ($val = $item) =~ s/(\w+):.*/$1/;
  $count{$key} .= "~".$val
}
foreach $item (keys %count) {
  ($x = $count{$item}) =~ s/[^~]//g;
  if ( $x eq "~~") {  # line occurred for both names
    next;
  } else {            # array element has the line followed by "~name"
    push @diff, "$item$count{$item}";
  }
}
@sorted = sort @diff;
foreach $el (@sorted) {
  ($name = $el) =~ s/.*~(\w+)/$1/;
  ($elem = $el) =~ s/(.*)~\w+/$1/;
  printf("%-10s => %s\n",$name,$elem);
}
$
$ perl testscr.pl
KB0005     => 1019 T IFVATVPVI 0.691 PKC YES
KB204320   => 1019 T IFVATVPVI 0.699 PKC YES
KB0005     => 1036 T YFLQTSQQL 0.785 PKC YES
KB204320   => 1036 T YFLQTSQQL 0.789 PKC YES
KB0005     => 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320   => 1037 S FLQTSQQLK 0.589 DNAPK YES
KB0005     => 1045 S KQLESEGRS 0.669 PKC YES
$
$

HTH,
tyler_durden

one more thing: how can i access the value in 5th column and compare it to the corresponding line, printing the difference, or, warning that this line exists only in one file

(1) For two similar lines of data, is the 5th column the only one that can be different ?

(2) What exactly is that difference ?

  \(a\) KB0005_value minus KB204320_value, or
  \(b\) KB204320_value minus KB0005_value, or
  \(c\) absolute difference ?

tyler_durden

except for 1st column, yes, only 5th is different unless the whole line doesnt exist.. and im looking for KB0005-KB204320 diff

Ok, back to the drawing board.

Instead of creating arrays/hashes etc. and comparing them, this perl program relies on the input being sorted. It simply runs through the data stream and keeps printing differences and pairs and singletons.
$cmp and $key/$prevkey are the main variables upon which the logic is built.

$
$ cat data.txt
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
KB204320 1045 S KQLESEGRS 0.880 unsp YES
$
$ cat testscr1.pl
#!/usr/bin/perl -w
$prevkey = "";
while (<>) {
  chomp;
  @x = split;
  $key = "$x[1]:$x[2]:$x[3]:$x[5]:$x[6]";
  $num = $x[4];
  $line = sprintf("%-10s [MESG] => %s %s %s %s %s %s\n",$x[0],$x[1],$x[2],$x[3],$x[4],$x[5],$x[6]);
  if ($prevkey eq "") {  # we are on line 1; just set $cmp to 1 and move on
    # A value of 1 means "start of comparison" - this line should be compared
    # with the next line for potential pairing. A value of 0 means
    # "end of comparison" - the comparison is over; we either found a pair or
    # found a non-repeating line.
    $cmp = 1;
  } elsif ($key eq $prevkey) {  # we found a pair
    $cmp = 0;
    # find diff
    $diff = sprintf("%6.3f",$prevnum - $num);
    # print prev and current lines if diff != 0
    if ($prevnum != $num) {
      $prevline =~ s/MESG/DIFF = $diff/;
      $line =~ s/MESG/DIFF = $diff/;
      print $prevline,$line;
    }
  } elsif ($key ne $prevkey) {  # we did not find a pair; either prev line is
                                # non repeating or we found and printed a pair
    # if $cmp equals 1 then print previous line else set $cmp to 1
    if ($cmp == 1) {
      $prevline =~ s/MESG/NO_REPETITION/;
      print $prevline;
    } else {
      $cmp = 1;
    }
  }
  $prevkey = $key;
  $prevline = $line;
  $prevnum = $num;
}
# if $cmp equals 1 then print previous line
if ($cmp == 1) {
  $prevline =~ s/MESG/NO_REPETITION/;
  print $prevline;
}
$
$ # Sorted input is absolutely essential for this perl program
$ # In the data below, all lines except line # 7 occur in pairs
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
KB204320 1045 S KQLESEGRS 0.880 unsp YES
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt | perl testscr1.pl
KB0005     [DIFF = -0.008] => 1019 T IFVATVPVI 0.691 PKC YES
KB204320   [DIFF = -0.008] => 1019 T IFVATVPVI 0.699 PKC YES
KB0005     [DIFF = -0.004] => 1036 T YFLQTSQQL 0.785 PKC YES
KB204320   [DIFF = -0.004] => 1036 T YFLQTSQQL 0.789 PKC YES
KB0005     [DIFF = -0.004] => 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320   [DIFF = -0.004] => 1037 S FLQTSQQLK 0.589 DNAPK YES
KB0005     [NO_REPETITION] => 1045 S KQLESEGRS 0.669 PKC YES
$
$ # All lines except the last two occur in pairs
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt | sed -n 1,8p
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt | sed -n 1,8p | perl testscr1.pl
KB0005     [DIFF = -0.008] => 1019 T IFVATVPVI 0.691 PKC YES
KB204320   [DIFF = -0.008] => 1019 T IFVATVPVI 0.699 PKC YES
KB0005     [DIFF = -0.004] => 1036 T YFLQTSQQL 0.785 PKC YES
KB204320   [DIFF = -0.004] => 1036 T YFLQTSQQL 0.789 PKC YES
KB0005     [DIFF = -0.004] => 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320   [DIFF = -0.004] => 1037 S FLQTSQQLK 0.589 DNAPK YES
KB0005     [NO_REPETITION] => 1045 S KQLESEGRS 0.669 PKC YES
KB0005     [NO_REPETITION] => 1045 S KQLESEGRS 0.880 unsp YES
$
$ # No line is repeated
$
$ sed -n 1,5p data.txt
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB0005 1045 S KQLESEGRS 0.669 PKC YES
KB0005 1045 S KQLESEGRS 0.880 unsp YES
$
$ sed -n 1,5p data.txt | perl testscr1.pl
KB0005     [NO_REPETITION] => 1019 T IFVATVPVI 0.691 PKC YES
KB0005     [NO_REPETITION] => 1036 T YFLQTSQQL 0.785 PKC YES
KB0005     [NO_REPETITION] => 1037 S FLQTSQQLK 0.585 DNAPK YES
KB0005     [NO_REPETITION] => 1045 S KQLESEGRS 0.669 PKC YES
KB0005     [NO_REPETITION] => 1045 S KQLESEGRS 0.880 unsp YES
$
$ # Three pairs of lines; no single-occuring line
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt | sed -n 1,6p
KB0005 1019 T IFVATVPVI 0.691 PKC YES
KB204320 1019 T IFVATVPVI 0.699 PKC YES
KB0005 1036 T YFLQTSQQL 0.785 PKC YES
KB204320 1036 T YFLQTSQQL 0.789 PKC YES
KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320 1037 S FLQTSQQLK 0.589 DNAPK YES
$
$ sort -k2,2 -k3,3 -k4,4 -k6,6 -k7,7 data.txt | sed -n 1,6p | perl testscr1.pl
KB0005     [DIFF = -0.008] => 1019 T IFVATVPVI 0.691 PKC YES
KB204320   [DIFF = -0.008] => 1019 T IFVATVPVI 0.699 PKC YES
KB0005     [DIFF = -0.004] => 1036 T YFLQTSQQL 0.785 PKC YES
KB204320   [DIFF = -0.004] => 1036 T YFLQTSQQL 0.789 PKC YES
KB0005     [DIFF = -0.004] => 1037 S FLQTSQQLK 0.585 DNAPK YES
KB204320   [DIFF = -0.004] => 1037 S FLQTSQQLK 0.589 DNAPK YES
$
$ # Only one line
$
$ head -1 data.txt
KB0005 1019 T IFVATVPVI 0.691 PKC YES
$
$ head -1 data.txt | perl testscr1.pl
KB0005     [NO_REPETITION] => 1019 T IFVATVPVI 0.691 PKC YES
$
$

HTH,
tyler_durden