I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start and end coordinates (3rd and 4th column), followed by some details
I have to assign the variations (based on chromosome number and its position) to the genes. First, it should look for the matching chromosome number in both files, and if that matches, the coordinate of the variation should be within (including) start and end position of the gene. I have made an attempt to print the entries of variation table. I want to append the gene IDs also from annotation table in the first column of the output. How do I do that ?
variation.txt
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch01 711131 A G - - - - - - - -
SL3.0ch00 715124 G A - - - - - - - -
SL3.0ch00 719289 C T - - - - - - - -
SL3.0ch00 720926 A C - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
annotation.txt
Solyc00g005000.3.1 SL3.0ch02 702600 702900 + Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1 SL3.0ch01 715100 715200 + Potassium channel
Solyc00g005050.3.1 SL3.0ch00 715150 715300 - UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1 SL3.0ch00 723741 724013 - LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1 SL3.0ch00 723800 723900 - LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1 SL3.0ch05 809593 813633 + UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1 SL3.0ch07 1061632 1061916 - LOW QUALITYYNAMIN-like 1B
Solyc00g005092.1.1 SL3.0ch01 1127794 1144385 + Serine/threonine phosphatase-like protein
Solyc00g005094.1.1 SL3.0ch00 1144958 1146952 - Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1 SL3.0ch00 1734562 1736567 + RWP-RK domain-containing protein
awk script:
awk '
NR==FNR {
a[$2][$3 " " $4]=$0
next
}
($1 in a){
for(i in a[$1])
if(split(i,t)&&$2>=t[1]&&$2<=t[2])
print
}' annotation.txt variation.txt
Desired output:
Solyc00g005060.1.1 SL3.0ch02 702679 C A - - - - - - - -
Solyc00g005060.1.1 SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005080.2.1 SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005060.1.1 SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
Solyc00g005080.2.1 SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
Solyc00g005060.1.1 SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
Current output:
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED