Grab unique record from different files on a condition

Hi,

I think this is the toughest prob :wall: I have ever come across and I thankfully owe all of u for helping me cross this.

cat 1.txt

cat 2.txt

K now. This is what I am looking for.

Output.txt

Here is how my output has been generated.

First, the column one of each file has to be matched to column one of other files, like chr1 to chr1, chr2 to chr2 and chr3 to chr3 only. No different column values has to be matched.

Second, if a particular range of column 2 and 3 intersects/comes in between the range of column 2 and 3 of the other file, they have to be eliminated.

Examples from the given input:

chr1 100 200(1.txt) intersects with chr1 156 199(2.txt), chr1 165 230(2.txt). So, they are eliminated.
chr1 450 700(1.txt) intersects with chr1 525 600(2.txt). So, these two are eliminated from the output.

Similarly,

chr2 500 600(1.txt) intersects with chr2 534 676(2.txt). So, it is eliminated.
chr2 345 765(1.txt) intersects with chr2 200 400(2.txt). So, it is eliminated from the output file.

Same is the case for chr3 too. My files have different number of records in each of them which are not sorted. The last column in the output file indicates the file from which the record originates. If you have any questions or suggestions, please write in the reply and I shall reply ASAP to clarify your doubt that might give me a chance to kick this problem out. All your time, patience and attention are highly appreciated.

Thanks in advance.

Try:

#!/usr/bin/perl
$file1=shift;
$file2=shift;
open A,"$file1";
open B,"$file2";
while (chomp($line=<A>)){
  @temp=split / /,$line;
  $lowA{$temp[0]}{$.}=$temp[1];
  $highA{$temp[0]}{$.}=$temp[2];
}
while (chomp($line=<B>)){
  @temp=split / /,$line;
  $lowB{$temp[0]}{$.}=$temp[1];
  $highB{$temp[0]}{$.}=$temp[2];
}
for $iA (keys %lowA){
  for $jA (keys %{$lowA{$iA}}){
    for $iB (keys %{$lowB{$iA}}){
      if ($lowA{$iA}{$jA}>=$lowB{$iA}{$iB} && $lowA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
      if ($highA{$iA}{$jA}>=$lowB{$iA}{$iB} && $highA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
    }
  }
}
for $iB (keys %lowB){
  for $jB (keys %{$lowB{$iB}}){
    for $iA (keys %{$lowA{$iB}}){
      if ($lowB{$iB}{$jB}>=$lowA{$iB}{$iA} && $lowB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
      if ($highB{$iB}{$jB}>=$lowA{$iB}{$iA} && $highB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
    }
  }
}
for $i (keys %lowA){
  for $j (keys %{$lowA{$i}}){
    print "$i $lowA{$i}{$j} $highA{$i}{$j} $file1\n" if $elimA{$i}{$j}!=1;
  }
}
for $i (keys %lowB){
  for $j (keys %{$lowB{$i}}){
    print "$i $lowB{$i}{$j} $highB{$i}{$j} $file2\n" if $elimB{$i}{$j}!=1;
  }
}

Run it as: ./script.pl 1.txt 2.txt

1 Like

Great and thanks!

Works as I wished.

Also, what if there are multiple files, like say I have more than two files?

But, the condition is still the same, if a record intersects with another record in any of the other files, even one, then it should be eliminated.

Thanks for all your time.

-- deleted --

Hi jacobs.smith,

Other solution using perl. I think it should work with any number of input files. Give it a try. It could be more efficient, but I struggled a little to get it, so if it works, I will be happy for that:

$ cat 1.txt 
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
$ cat 2.txt
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-files-1> <input-file-2> ...\n] unless @ARGV > 0;

my (@data);

while ( <> ) {
        my @f = split;
        next unless @f == 3;
        push @data, [ $ARGV, @f ]; 
}

for my $d ( @data ) {
        if ( grep {
                        $d->[0] ne $_->[0] &&
                        $d->[1] eq $_->[1] &&
                        ($d->[2] < $_->[2] &&
                        $d->[3] > $_->[2]
                                        ||
                        $d->[2] < $_->[3] &&
                        $d->[3] > $_->[3]
                                        ||
                        $d->[2] > $_->[2] &&
                        $d->[3] < $_->[3])

                } @data
        ) { 
                next;
        }

        printf qq[%s\n], join qq[ ], @$d[1..3], $d->[0];
}
$ perl script.pl 1.txt 2.txt
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Regards,
Birei

1 Like

Thanks a lot Birei. The script works for the two files I have mentioned earlier before. And I even tried using it with 3 files. The 3 files and their output has been given below just for your confirmation and my satisfaction :smiley:

Thanks once again

cat 1.txt

cat 2.txt

cat3.txt

perl newscript.pl 1.txt 2.txt 3.txt

I find everything to be smooth. Let me know if you see anything.

Thanks

Here is a solution using shell scripts:

#!/bin/ksh
typeset -i mFromA mToA mFromB mToB
mF1='1.txt'
mF2='2.txt'
mPrevTag=''
#### sort is used to reduce the number of "grep"
sort ${mF1} | while read mTagA mFromA mToA; do
  if [[ "${mTagA}" != "${mPrevTag}" ]]; then
    grep "${mTagA}" ${mF2} > ${mF2}.tmp
  fi
  mFound="N"
  while read mTagB mFromB mToB; do
    if [[ ${mToA} -ge ${mFromB} && ${mFromA} -le ${mToB} ]]; then
      mFound="Y"
      break
    fi
  done < ${mF2}.tmp
  if [[ "${mFound}" = "N" ]]; then
    echo ${mTagA} ${mFromA} ${mToA} ${mF1}
  fi
  mPrevTag=${mTagA}
done