Find common numbers and print yes or no

manigrover · September 20, 2012, 11:44pm

Hi

I have 2 files with following data

First file,

 sp|Q676U5|A16L1_HUMAN, 
    Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2,
  Maximum coiled-coil residue probability: 0.657 in position 163.
  Maximum dimeric residue probability:     0.288 in position 163.
  Maximum trimeric residue probability:    0.369 in position 163.
Coil    0.63@  91- 118:c,3    0.60@ 154- 190:c,3

Second file

AC   Q676U5; A3EXK9; A3EXL0; B6ZDH0; Q6IPN1; Q6UXW4; Q6ZVZ5; Q8NCY2;
AC   Q96JV5; Q9H619;
FT   COILED       78    230       Potential.
FT   VARIANT     300    300       T -> A (associated with susceptibility to
FT   VARIANT     307    307       E -> K (in dbSNP:rs1866878).

If number afte sp in first file "Q676U5" matches with the first number after AC in second file "Q676U5"

it should check for second file

"variant" and the number after this if lies within
the numeric range mentioned in first file after

@ 91- 118 @ 154- 190

then expected output should be accordingly that

Q676U5 : No

because number after variant in second file 300 and 307 do not lie in the range of @ 91- 118 @ 154- 190

so expected output is No after the matched first number.

In the same way we can match entries with other number and put the yes or no if the number after variant in second file lies in range afte@in fist files.

Don_Cragun · September 21, 2012, 11:23am

With over 150 posts, I would expect that you have some idea of how to do this. What have you tried so far?

Your specification of how to determine whether an entry in file1 matches an entry in file2 is very weak. Please clarify the requirements by answering all of the following questions:

Does the first line of an entry in file1 always start with <space>sp ?
[list=a]
If not, what else can come before the sp besides a <space> character?
[/list]
Does the first line in an entry in file1 always use | as the field separator?
Do any other lines in an entry in file1 contain an | character?
Is [sp] in file1 always lowercase letters?
How do we find the ranges to be checked?
[list=a]
How do we recognize that a range is present?
Are lines with ranges the only lines in file1 that contain the @ character?
Does the line that contains the ranges always start with Coil in column 1 (uppercase C and lowercase oil )?
Are there always exactly two ranges to be matched against?
Do ranges always immediately follow an @ character?
[/list]
What constitutes a successful match on the ranges?
[list=a]
Does just one variant have to match any of the given ranges, or does each variant have to match one of the ranges?
Does the 1st variant have to fall within the 1st range, the 2nd variant have to fall with the 2nd range, etc.?
Will there always be the same number of variants as there are ranges in matched records in file1 and file2?
[/list]
In file2, is a 2nd contiguous line starting with AC a continuation of the previous line, or is it a separate AC instance? (I.e.I if the 1st line in file1 had been: sp|Q96JV5|A16L1_HUMAN, instead of: sp|Q676U5|A16L1_HUMAN, should it have still matched the same entry in file2?)
In file2, is VARIANT case sensitive?
In file2, will VARIANT only appear on a line starting with FT ?
In file1 is there any separator between entries?
In file2 is there any separator between entries?
Approximately how large are file1 and file2?