Hi there,
I have a doubt about how to set this up. This is the situation.
I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
1 + 100208127 100261594 6 100208127,100231680,100237404,100245177,100249508,100260529, 100208306,100231885,100237559,100245300,100249677,100261594,
1 + 100217082 100217185 1 100217082, 100217185,
1 + 100276376 100321515 12 100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146, 100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,
the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.
the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
1 100208130 100208166 +
1 100208310 100208346 +
1 100217090 100217126 +
1 100231689 100231725 +
As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).
So if I use these two files to generate my output, my output would tell me:
There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.
YES 1 100208130 100208166 +
NO 1 100208310 100208346 +
YES 1 100217090 100217126 +
YES 1 100231689 100231725 +
To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!