perl: comparision of field line by line in two files

Thelost · May 7, 2012, 7:25am

Hi everybody,
First I apologize if my question seems demasiad you silly, but it really took 4 days struggling with this, I looked at books, forums ... And Also ask help to a friend that is software developer and he told me that it is a bad idea do it by perl... but this is my problem.
I moved to another lab for a couple of months, in which they use perl as tool to analyse DNA data (at my lab I ever use or developed software, command lines to modificate files to use it correctly, and some tools that people of my lab perform previously). Really in the weeks that I'm working here I saw the power of perform your own scripts to solve problem.
The problem is that i must to compare two files and select the lines of one of them whose fields comply a few requirements, which are comparisons with the other file fields.

my files are (of course that are only few lines)
File 1

Start	End	Origin	HomeCluster	BAPSIndex	Strain
1	58292	5	5	1	TW20.dna
87840	87883	5	5	1	TW20.dna
247298	253176	5	5	1	TW20.dna
395979	400031	5	5	1	TW20.dna
404314	404824	5	5	1	TW20.dna

File 2

Coordinate	type	RefAllele	Strain	SNPAllele
358909	Int	<T>	5083_6_1	>A<
2074234	syn	<G>	5083_6_1	>A<
31160	non	<G>	5083_6_12	>A<

I must locate the file lines 2, which is within the range Coordinate generated by start and End, and also the strain match. ie I must compare each line of the file 2 with each line of 1.
I started the script many times, the variables are defined ... but can not get results ... I have tried arrays, hash .. I can not.
I include the script (the part that works) and the conditions that must be met.

#!/usr/bin/perl -w
# insideRecombinantSNP.pl
#Script to analyze the snps inside the recombinat regions
# if the file is not in your working directory, you have to write the complete path 
use warnings;

print "Coordinate	Type	Reference Allele	Strain		Strain Allele\n";

 
open IN, "resultsnplinev2.out" or die;     # file 1 y file 2 compared files
open INN, "turkish_segments_tabularv2.txt" or die;

while(<IN>){
		if(m/^line\s+(\d+\s+\S+\s+\S+\s+\S+\s+\S+)/){
			$lineSNP=$1;
			$lineSNP =~m/^(\d+)\s+\S+\s+\S+\s+\S+\s+\S+/;
			$SNPcoor=$1;
			 $lineSNP =~m/^\d+\s+\S+\s+\S+\s+(\S+)\s+\S+/;
			$SNPstrain=$1;
					 	 		  		 		 }
while(<INN>){	 	 		  		 		 
		if(m/^(\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+.*)/){
		$recline=$1;
		$recline =~m/^\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(.*)/;
		$recstrain=$1;
		$recline =~m/^(\d+)\s+\d+\s+\S+\s+\S+\s+\S+\s+.*/;
	 	$leftcoor=$1;
	 	$recline =~m/^\d+\s+(\d+)\s+\S+\s+\S+\s+\S+\s+.*/;
		$rightcoor=$1;
		 		}
}
if (($leftcoor<=$SNPcoor) && ($SNPcoor<=$rightcoor)){
print "$lineSNP\n";
}elsif ($recstrain eq $SNPstrain){
print "$lineSNP\n";	
}
}

Any idea, any hint or suggestion ...

Klashxx · May 7, 2012, 11:57am

hello, check this :

#cat file1
Start   End     Origin  HomeCluster     BAPSIndex       Strain
1       58292   5       5       1       TW20.dna
87840   87883   5       5       1       TW20.dna
247298  253176  5       5       1       TW20.dna
395979  400031  5       5       1       TW20.dna
404314  404824  5       5       1       TW20.dna

#cat file2
Coordinate      type    RefAllele       Strain  SNPAllele
358909  Int     <T>     5083_6_1        >A<
2074234 syn     <G>     5083_6_1        >A<
31160   non     <G>     5083_6_12       >A<

#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {split(a,b," ");if (b[1] >= $1 && b[1] <=$2) {print a" match --->"$0;next}}}' file2 file1       
31160   non     <G>     5083_6_12       >A< match --->1 58292   5       5       1       TW20.dna

Thelost · May 7, 2012, 1:16pm

Thanks so much for give me a hand, but ...
I know, I know I'm the worst, but not how to use your help ... that is to introduce into the script for a new script ...?
Thanks again and I regret so little skill .. but only took 10 days to work with perl ...

Klashxx · May 7, 2012, 1:21pm

No worry , give us an example of your expected output.

Thelost · May 7, 2012, 3:11pm

Thank you again,
I have to extract the lines in file 2, whose coordinates are between Start and End positions of the file 1, which also belong to the same strain. In other words two assumptions must be fulfilled going line by line and check if the strain matches and if the coordinate is within the range formed by Start and End, then the expected output is the line of file 2.
for example

Int <T> 5083_6_1 358909> A <3_6_1 358909> A <

I know that there are lines that fulfilled this assumptions, I checked it by hand and found several lines.

---------- Post updated at 09:11 PM ---------- Previous update was at 07:39 PM ----------

Really I don't know what happens but the expected output is not that
is...
Coordinate type RefAllele Strain SNPAllele
240450 non <G> 6949_5_23 >A<

Sorry

Klashxx · May 7, 2012, 3:28pm

You mean something like this:

#cat file1
Start   End     Origin  HomeCluster     BAPSIndex       Strain
1       58292   5       5       1       TW20.dna
87840   87883   5       5       1       TW20.dna
247298  253176  5       5       1       TW20.dna
395979  400031  5       5       1       TW20.dna
404314  404824  5       5       1       TW20.dna    >A<

#cat file2
Coordinate      type    RefAllele       Strain  SNPAllele
358909  Int     <T>     5083_6_1        >A<
2074234 syn     <G>     5083_6_1        >A<
31160   non     <G>     5083_6_12       >A<
404820 non     <G>     5083_6_12       >A<

#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {e=split(a,b," ");if (b[1] >= $1 && b[1] <=$2 && b[e] == $NF) {print a;next}}}' file2 file1                                  
404820 non     <G>     5083_6_12       >A<

Thelost · May 7, 2012, 4:29pm

Yes something like this, but also has to match the strain...

something like

line x file 2 : 136 non <T> 5083_6_1 >A<
this line match with the line y file 1: 12 52000 1 1 1 5083_6_1.
12<=136<=52000 & 5083_6_1=5083_6_1

Then in my ouput file will appear
136 non <T> 5083_6_1 >A<

I Know that it's a bit difficult (at least for me) but I'm really grateful for your help.

Klashxx · May 8, 2012, 2:18am

Ok, this will solve your problem:

#cat file1
Start   End     Origin  HomeCluster     BAPSIndex       Strain
1       58292   5       5       1       TW20.dna
87840   87883   5       5       1       TW20.dna
247298  253176  5       5       1       TW20.dna
395979  400031  5       5       1       TW20.dna
12      52000   1       1       1       5083_6_1

#cat file2
Coordinate      type    RefAllele       Strain  SNPAllele
358909  Int     <T>     5083_6_1        >A<
2074234 syn     <G>     5083_6_1        >A<
31160   non     <G>     5083_6_12       >A<
404820 non     <G>     5083_6_12       >A<
136 non <T> 5083_6_1 >A<

#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {split(a,b," ");if (b[1] >= $1 && b[1] <=$2 && b[4] == $NF) {print a;next}}}' file2 file1             
136 non <T> 5083_6_1 >A<

Thelost · May 11, 2012, 10:03am

'm sure this will solve the problem ...
Sorry, but I was out for a couple of days.