Help with processing coordinates in a file.

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start and end coordinates (3rd and 4th column), followed by some details

I have to assign the variations (based on chromosome number and its position) to the genes. First, it should look for the matching chromosome number in both files, and if that matches, the coordinate of the variation should be within (including) start and end position of the gene. I have made an attempt to print the entries of variation table. I want to append the gene IDs also from annotation table in the first column of the output. How do I do that ?

variation.txt

SL3.0ch02     702679    C     A     -     -     -     -     -     -     -     -    
SL3.0ch01     711131    A     G     -     -     -     -     -     -     -     -
SL3.0ch00     715124    G     A     -     -     -     -     -     -     -     -
SL3.0ch00     719289    C     T     -     -     -     -     -     -     -     -
SL3.0ch00     720926    A     C     -     -     -     -     -     -     -     -
SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
SL3.0ch00     723903    T     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     37    1    novel     TOLERATED

annotation.txt

Solyc00g005000.3.1    SL3.0ch02    702600    702900    +    Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1    SL3.0ch01    715100    715200    +    Potassium channel
Solyc00g005050.3.1    SL3.0ch00    715150    715300    -    UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1    SL3.0ch00    723741    724013    -    LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1    SL3.0ch00    723800    723900    -    LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1    SL3.0ch05    809593    813633    +    UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1    SL3.0ch07    1061632    1061916    -    LOW QUALITYYNAMIN-like 1B
Solyc00g005092.1.1    SL3.0ch01    1127794    1144385    +    Serine/threonine phosphatase-like protein
Solyc00g005094.1.1    SL3.0ch00    1144958    1146952    -    Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1    SL3.0ch00    1734562    1736567    +    RWP-RK domain-containing protein

awk script:

awk '
NR==FNR {
    a[$2][$3 " " $4]=$0
    next
}
($1 in a){
    for(i in a[$1])
        if(split(i,t)&&$2>=t[1]&&$2<=t[2])
            print
}' annotation.txt variation.txt

Desired output:

Solyc00g005060.1.1    SL3.0ch02    702679    C    A    -    -    -    -    -    -    -    -
Solyc00g005060.1.1    SL3.0ch00    723860    A    C    Solyc00g005060.1    CDS    NONSYNONYMOUS    W/G    52    0    novel    DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005080.2.1    SL3.0ch00    723860    A    C    Solyc00g005060.1    CDS    NONSYNONYMOUS    W/G    52    0    novel    DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005060.1.1    SL3.0ch00    723867    A    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    49    1    novel    TOLERATED
Solyc00g005080.2.1    SL3.0ch00    723867    A    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    49    1    novel    TOLERATED
Solyc00g005060.1.1    SL3.0ch00    723903    T    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    37    1    novel    TOLERATED 

Current output:

SL3.0ch02   702679  C   A   -   -   -   -   -   -   -   -
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS     NONSYNONYMOUS   W/G     52  0   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS     NONSYNONYMOUS   W/G     52  0   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     49  1   novel   TOLERATED
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     49  1   novel   TOLERATED
SL3.0ch00   723903  T   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     37  1   novel   TOLERATED

Please explain how this line

Solyc00g005060.1.1    SL3.0ch02    702679    C    A    -    -    -    -    -    -    -    - 

made it into the desired ouput - there is no connection between Solyc00g005060.1.1 and SL3.0ch02 in your annotation file, but it exists for the Solyc00g005000.3.1 and SL3.0ch02 pair. And, why isn't Solyc00g005050.3.1 / SL3.0ch00 mentioned? Shouldn't that match the third variation line?

Howsoever, pls try and report back

awk '
NR==FNR         {a[$2 FS $3 FS $4] = $1
                 next
                }
                {for (i in a) if (split (i,t) && $2>=t[2] && $2<=t[3])  print a, $0
                }
' annotation.txt variation.txt
Solyc00g005000.3.1 SL3.0ch02     702679    C     A     -     -     -     -     -     -     -     -    
Solyc00g005040.3.1 SL3.0ch00     715124    G     A     -     -     -     -     -     -     -     -
Solyc00g005060.1.1 SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
Solyc00g005080.2.1 SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
Solyc00g005060.1.1 SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
Solyc00g005080.2.1 SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
Solyc00g005060.1.1 SL3.0ch00     723903    T     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     37    1    novel     TOLERATED

Its results match the desired output except for the two mentioned lines plus, mayhap, the field separators (which weren't specified, btw).