Hi everybody,
First I apologize if my question seems demasiad you silly, but it really took 4 days struggling with this, I looked at books, forums ... And Also ask help to a friend that is software developer and he told me that it is a bad idea do it by perl... but this is my problem.
I moved to another lab for a couple of months, in which they use perl as tool to analyse DNA data (at my lab I ever use or developed software, command lines to modificate files to use it correctly, and some tools that people of my lab perform previously). Really in the weeks that I'm working here I saw the power of perform your own scripts to solve problem.
The problem is that i must to compare two files and select the lines of one of them whose fields comply a few requirements, which are comparisons with the other file fields.
my files are (of course that are only few lines)
File 1
Start End Origin HomeCluster BAPSIndex Strain
1 58292 5 5 1 TW20.dna
87840 87883 5 5 1 TW20.dna
247298 253176 5 5 1 TW20.dna
395979 400031 5 5 1 TW20.dna
404314 404824 5 5 1 TW20.dna
File 2
Coordinate type RefAllele Strain SNPAllele
358909 Int <T> 5083_6_1 >A<
2074234 syn <G> 5083_6_1 >A<
31160 non <G> 5083_6_12 >A<
I must locate the file lines 2, which is within the range Coordinate generated by start and End, and also the strain match. ie I must compare each line of the file 2 with each line of 1.
I started the script many times, the variables are defined ... but can not get results ... I have tried arrays, hash .. I can not.
I include the script (the part that works) and the conditions that must be met.
#!/usr/bin/perl -w
# insideRecombinantSNP.pl
#Script to analyze the snps inside the recombinat regions
# if the file is not in your working directory, you have to write the complete path
use warnings;
print "Coordinate Type Reference Allele Strain Strain Allele\n";
open IN, "resultsnplinev2.out" or die; # file 1 y file 2 compared files
open INN, "turkish_segments_tabularv2.txt" or die;
while(<IN>){
if(m/^line\s+(\d+\s+\S+\s+\S+\s+\S+\s+\S+)/){
$lineSNP=$1;
$lineSNP =~m/^(\d+)\s+\S+\s+\S+\s+\S+\s+\S+/;
$SNPcoor=$1;
$lineSNP =~m/^\d+\s+\S+\s+\S+\s+(\S+)\s+\S+/;
$SNPstrain=$1;
}
while(<INN>){
if(m/^(\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+.*)/){
$recline=$1;
$recline =~m/^\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(.*)/;
$recstrain=$1;
$recline =~m/^(\d+)\s+\d+\s+\S+\s+\S+\s+\S+\s+.*/;
$leftcoor=$1;
$recline =~m/^\d+\s+(\d+)\s+\S+\s+\S+\s+\S+\s+.*/;
$rightcoor=$1;
}
}
if (($leftcoor<=$SNPcoor) && ($SNPcoor<=$rightcoor)){
print "$lineSNP\n";
}elsif ($recstrain eq $SNPstrain){
print "$lineSNP\n";
}
}
Any idea, any hint or suggestion ...
hello, check this :
#cat file1
Start End Origin HomeCluster BAPSIndex Strain
1 58292 5 5 1 TW20.dna
87840 87883 5 5 1 TW20.dna
247298 253176 5 5 1 TW20.dna
395979 400031 5 5 1 TW20.dna
404314 404824 5 5 1 TW20.dna
#cat file2
Coordinate type RefAllele Strain SNPAllele
358909 Int <T> 5083_6_1 >A<
2074234 syn <G> 5083_6_1 >A<
31160 non <G> 5083_6_12 >A<
#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {split(a,b," ");if (b[1] >= $1 && b[1] <=$2) {print a" match --->"$0;next}}}' file2 file1
31160 non <G> 5083_6_12 >A< match --->1 58292 5 5 1 TW20.dna
Thanks so much for give me a hand, but ...
I know, I know I'm the worst, but not how to use your help ... that is to introduce into the script for a new script ...?
Thanks again and I regret so little skill .. but only took 10 days to work with perl ...
No worry , give us an example of your expected output.
Thank you again,
I have to extract the lines in file 2, whose coordinates are between Start and End positions of the file 1, which also belong to the same strain. In other words two assumptions must be fulfilled going line by line and check if the strain matches and if the coordinate is within the range formed by Start and End, then the expected output is the line of file 2.
for example
Int <T> 5083_6_1 358909> A <3_6_1 358909> A <
I know that there are lines that fulfilled this assumptions, I checked it by hand and found several lines.
---------- Post updated at 09:11 PM ---------- Previous update was at 07:39 PM ----------
Really I don't know what happens but the expected output is not that
is...
Coordinate type RefAllele Strain SNPAllele
240450 non <G> 6949_5_23 >A<
Sorry
You mean something like this:
#cat file1
Start End Origin HomeCluster BAPSIndex Strain
1 58292 5 5 1 TW20.dna
87840 87883 5 5 1 TW20.dna
247298 253176 5 5 1 TW20.dna
395979 400031 5 5 1 TW20.dna
404314 404824 5 5 1 TW20.dna >A<
#cat file2
Coordinate type RefAllele Strain SNPAllele
358909 Int <T> 5083_6_1 >A<
2074234 syn <G> 5083_6_1 >A<
31160 non <G> 5083_6_12 >A<
404820 non <G> 5083_6_12 >A<
#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {e=split(a,b," ");if (b[1] >= $1 && b[1] <=$2 && b[e] == $NF) {print a;next}}}' file2 file1
404820 non <G> 5083_6_12 >A<
Yes something like this, but also has to match the strain...
something like
line x file 2 : 136 non <T> 5083_6_1 >A<
this line match with the line y file 1: 12 52000 1 1 1 5083_6_1.
12<=136<=52000 & 5083_6_1=5083_6_1
Then in my ouput file will appear
136 non <T> 5083_6_1 >A<
I Know that it's a bit difficult (at least for me) but I'm really grateful for your help.
Ok, this will solve your problem:
#cat file1
Start End Origin HomeCluster BAPSIndex Strain
1 58292 5 5 1 TW20.dna
87840 87883 5 5 1 TW20.dna
247298 253176 5 5 1 TW20.dna
395979 400031 5 5 1 TW20.dna
12 52000 1 1 1 5083_6_1
#cat file2
Coordinate type RefAllele Strain SNPAllele
358909 Int <T> 5083_6_1 >A<
2074234 syn <G> 5083_6_1 >A<
31160 non <G> 5083_6_12 >A<
404820 non <G> 5083_6_12 >A<
136 non <T> 5083_6_1 >A<
#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {split(a,b," ");if (b[1] >= $1 && b[1] <=$2 && b[4] == $NF) {print a;next}}}' file2 file1
136 non <T> 5083_6_1 >A<
'm sure this will solve the problem ...
Sorry, but I was out for a couple of days.