Script to search and extract the gene sub-location from gff file.

reena2305 · June 27, 2011, 11:21am

Hi, my problem is that I have two files. File no. 1 is a gff text file (say gi1) that has gene information like :

********************

   gene            39389788..39395643
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39389788..39389839,39390696..39390861,
                     39391681..39391799,39393855..39394100,39394750..39394878,
                     39394997..39395162,39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_002295.4"
                     /db_xref="GI:70609879"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39390696..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 2"
                     /exception="unclassified transcription discrepancy"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_001012321.1"
                     /db_xref="GI:59859884"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_001012321.1"
                     /db_xref="GI:59859885"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_002286.2"
                     /db_xref="GI:9845502"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     gene            39391466..39391614
                     /gene="SNORA6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     ncRNA           39391466..39391614
                     /gene="SNORA6"
                     /ncRNA_class="snoRNA"
                     /product="small nucleolar RNA, H/ACA box 6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NR_002325.1"
                     /db_xref="GI:68510025"
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     gene            39394155..39394308
                     /gene="SNORA62"
                     /note="Derived by automated computational analysis using...

*****************************************

now, file no. 2 is a mapped txt file like:

*********************************

 Gene_input_file: f3

sno_input_file: chr3


319 found_in_gene 52698648..52707224 at 52704105 and_count: 5457
68 found_in_gene 52698648..52707224 at 52705463 and_count: 6815
82 found_in_gene 52698648..52707224 at 52701967 and_count: 3319
124 found_in_gene 39793218..40244467 at 40222682 and_count: 429464
202 found_in_gene 9443305..10558922 at 10110734 and_count: 667429
228 found_in_gene 46262602..46896241 at 46629723 and_count: 367121
..and so on.

**************************************

so, I need to extract the region from file 2 say, 52698648..52707224 for id-319, which begins from position 52704105 in gff file. And then search it in a file 1, for the sub-location of this gene, say, whether its in cDNA, mRNA etc. If its not fount the output should be:

'319 not found Intron'

else, if its found, output should be

'

319 found_in mRNA.'

please help me with the shell scripting or perl (or both)..I am new to this linux world. :wall:

panyam · June 27, 2011, 11:50am

Neither your statments , nor you sample data explains the problem fully.

Please use code tags when you post the sample data.

reena2305 · June 28, 2011, 12:24am

@panyam

sorry this was my first post, so I didn't have much idea.

Regarding problem:

gene            39389788..39395643

It is a particular gene position in the whole genome, now this gene is madeup of CDS, mRNA, Introns etc..the information is right below it like:

mRNA            join(39389788..39389839,39390696
 CDS             join(39390729..39390861,39391681..39..

etc..until the information of next gene comes..
like: (say gene2)

gene            39391466..39391614

So I have file with these 'gene' location, now I need to extract its sub-location, like whether its in CDS, mRNA or Intron(in case no match found).

The location of gene(that we need to find) is in separate file:

Gene_input_file: f3 
 sno_input_file: chr3  
 319 found_in_gene  52698648..52707224 at 52704105 and_count: 5457
 68 found_in_gene  52698648..52707224 at 52705463 and_count: 6815 
82 found_in_gene  52698648..52707224 at 52701967 and_count: 3319
 124 found_in_gene  39793218..40244467 at 40222682 and_count: 429464
 202 found_in_gene  9443305..10558922 at 10110734 and_count: 667429
 228 found_in_gene  46262602..46896241 at 46629723 and_count: 367121 ..and so on.

I have to read it line by line, extract gene position, then search it in the main gene info. (gff) file. like:

52698648..52707224 (of file2) match it in file1 and print its sub-location.

note: '..' denotes FROM postion 52698648 TO 52707224.