Map snps into a ref gene file

marwah · January 19, 2017, 7:27pm

I have the following data set about the snps ID txt file

   POS ID	
    	78599583	rs987435
    	33395779	rs345783
    	189807684	rs955894
    	33907909	rs6088791
    	75664046	rs11180435
    	218890658	rs17571465
    	127630276	rs17011450
    	90919465	rs6919430

and a gene reference file, txt file

 genename	name	chrom	strand	txstart	txend
    CDK1	NM_001786	chr10	+	62208217	62224616
    CALB2	NM_001740	chr16	+	69950116	69981843
    STK38	NM_007271	chr6	-	36569637	36623271
    YWHAE	NM_006761	chr17	-	1194583	1250306
    SYT1	NM_005639	chr12	+	77782579	78369919
    ARHGAP22	NM_001347736	chr10	-	49452323	49534316
    PRMT2	NM_001535	chr21	+	46879934	46909464
    CELSR3	NM_001407	chr3	-	48648899	48675352

I'm trying to match the genes with the SNps using snps location, so include the snps that has

POS >= txstart and POS <= txend

for example I want a data set that has the following columns

genename   SNPID   chrom   position   txstart   txend

Don_Cragun · January 19, 2017, 8:47pm

And what output are you trying to get from the two sample input files you provided?

What happens if there is no ID in the 1st file that appears in a range specified by the 2nd file?

What happens if there is more than one ID in the 1st file that fits in a range specified by a single line in the 2nd file?

What happens if there is no range in the 2nd file for a position specified in the 1st file?

What have you tried to solve this problem on your own?

marwah · January 19, 2017, 9:06pm

I'm expecting that a gene might have more than snpID,
and there might be genes that don't have snps it will be NA
and there might be one snpID for pre one gene

awk 'FNR==1 {next} FILENAME=="pre_snpinfo_tumor.txt" {k++; POS[k]=$2; ID[k]=$2;} \  
                   FILENAME=="refFlat.txt" {i++; \
                                     if(POS>=$5 && POS<=$6) \
                                          print $1, ID, $3, POS, $5, $6} \
    ' pre_snpinfo_tumor.txt  refFlat.txt

but there is an error can you help please

Don_Cragun · January 19, 2017, 9:44pm

One might think that something more like:

awk '
FNR==1 {next}
FNR == NR {
        POS[++k]=$1
        ID[k]=$2
        next
}
{       for(i = 1; i <= k; i++)
                if(POS>=$5 && POS<=$6)
                        print $1, ID, $3, POS, $5, $6
}' pre_snpinfo_tumor.txt  refFlat.txt

would work, but since absolutely none of the positions specified in your 1st sample input file are in any of the ranges specified by your 2nd sample input file, no output is produced. I guess that is to be expected because I asked you what output you wanted your script to produce from your sample input files and you didn't give an answer to that question.

If this doesn't work for your real data, you might consider giving us some sample input that you think should produce some output and actually show us what output you are trying to produce from those inputs.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

marwah · January 19, 2017, 9:52pm

the output is file2 which is the gene info and add to it the SNPID

**

 names seqnames**** start****** end**  GENEID 
* rs3753344**** chr1** 1142150** 1142150** ** TNFRSF18******
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA

Don_Cragun · January 19, 2017, 9:58pm

One last time: Please show us exactly what output you want your code to produce when given the input files your provided in post #1 in this thread. If you are unwilling to do that, I'll close the thread.

marwah · January 19, 2017, 10:01pm

NO please I have added the data I want to see as an output

the output is file2 which is the gene info and add to it the SNPID

**

 names   seqnames     start    end      GENEID 
* rs3753344    chr1    1142150    1142150   TNFRSF18
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA

I don't know where the stars came from but this is the data without the stars

Don_Cragun · January 19, 2017, 10:24pm

No you have not. None of the data shown in the columns of the output you said you wanted in post #5 and in post #7 (even if the asterisks are removed) in this thread:

genename   SNPID   chrom   position   txstart   txend

show up anywhere in either of the sample input files shown in post #1 except for the chrom field.

This thread is closed!

Please consider opening a new thread where you show us two small sample input files and show us the exact output that you want your script to produce from those sample input files. Make sure that the data given in those files includes data that tests all of your corner cases and be sure that your description clearly specifies what should happen if multiple values match, what should happen if no values match, what should happen if one value matches, and any special cases that haven't been identified so far by this list.