How to get exact match sentences?

vanitham · August 16, 2008, 12:23am

Hi,

I have sentences like this:

$sent=

Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.

User enters a query term like this "RNA binding protein" and i am taking like this.

$word=param('query');
print "\n$word\n";

What i want to do is it should pick up the sentences which has RNA-binding protein also!!

How to write a regular expression such that $word has to pick up these sentences which has "RNA-binding protein" and RNA binding protein?

With regards
Archana

redoubtable · August 16, 2008, 5:53am

the regular expression you wish could be like:

/RNA(?:-| )binding protein/

I don't know exactly what language you want this but you could also do something like

/RNA[-\s]binding protein/

era · August 16, 2008, 10:58am

In the advanced course we will bump into sentences like "RNA-bound protein". In the Nobel Laureate course we will handle text in German and Chinese as well.

Seriously, you could try to generalize your search patterns somewhat (specify all possible verb tenses, etc) but the general problem of language parsing has not been solved completely yet.

Search engines map down each word token to a normalized form so you can find "found" in Google when searching for "find". In some contexts, this is a misfeature -- when you know exactly what you want, you don't want the "sugary" matches at all.

In the meantime, maybe it'd be enough to replace all spaces with dots in your regular expressions for the time being ...

vanitham · August 17, 2008, 11:51pm

redoubtable:

the regular expression you wish could be like:
/RNA(?:-| )binding protein/
I don't know exactly what language you want this but you could also do something like
/RNA[-\s]binding protein/

Hi,

Thanks for the reply!!
I got this expression but i don't know how to check this expression using $word?

With regards
Vanitha

era · August 18, 2008, 2:13am

Assuming it is Perl we are talking about here:

if ($word ~ /RNA[-\s]binding protein/) {
   print "we have a match: $word";
}

vanitham · August 18, 2008, 3:58am

Hi,

This is one example for word but if user enters something like this it has to match and retrieve and i am not getting how to write an expression for $word to retreive match sentences?

Another eg:Transcription-factor,Transcription factor like that many words will be like that!!!

I n a generalized way how to match the words like this?

With regards
Vanitha

redoubtable · August 18, 2008, 5:41am

to retrieve the sentence you want which matches a certain pattern you do like so:

if ($word =~ /(.*?RNA[-\s]binding protein.*?)$/) { 
      print "$1\n"; 
}

If you have multiple patterns you either put them all on a list and check one by one or create an expression that allows spaces or '-' between words (but that could be faulty and you would lose track of things)

era · August 18, 2008, 5:55am

Or do you mean extract lines matching (the query derived from) $word from all the lines in $sent?

# parse query and make a (slightly generalized) regex out of it
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n"; }

vanitham · August 19, 2008, 12:15am

redoubtable:

to retrieve the sentence you want which matches a certain pattern you do like so:
if ($word =~ /(.*?RNA[-\s]binding protein.*?)$/) { 
   print "$1\n"; 
}
If you have multiple patterns you either put them all on a list and check one by one or create an expression that allows spaces or '-' between words (but that could be faulty and you would lose track of things)

Hi

I have to match $sentences with $word and $word can have "RNA binding protein" or RNA-binding protein.

How to match $sentences?

[CODE]

if($sentences=~/$word/) //$word can be "RNA binding protein" or RNA-binding protein.
{

}
I want to check $word for these conditions and match?

how should i do that?

vanitham · August 19, 2008, 12:24am

redoubtable:

to retrieve the sentence you want which matches a certain pattern you do like so:
if ($word =~ /(.*?RNA[-\s]binding protein.*?)$/) { 
   print "$1\n"; 
}
If you have multiple patterns you either put them all on a list and check one by one or create an expression that allows spaces or '-' between words (but that could be faulty and you would lose track of things)

Hi,

I want to match $sentences with $word like this:


if($sentences=~/$word/)
{

}

Here $word refers to "RNA binding protein" OR "RNA-binding protein".

How can i write expression such that $sentences matches with $word(it should match "RNA binding protein" and "RNA-binding protein")?

$word should work for both conditions and $sentences should match $word!!

How can i do that?

Annihilannic · August 19, 2008, 1:01am

As far as I can see redoubtable's post already answers your question. Does it not work for you?

redoubtable · August 19, 2008, 6:08am

I'm not sure I understand what the original poster wants but I think this should answer any possible question he/she might have.

$sentences = "

Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";

$word = '(.*?RNA[-\s]binding protein.*?)(?:\n|$)';

while ($sentences =~ /$word/g)
{
        print "$1\n";
}

This will print all matches of the specified pattern $word in $sentences.

vanitham · August 19, 2008, 11:51pm

era:

Or do you mean extract lines matching (the query derived from) $word from all the lines in $sent?
# parse query and make a (slightly generalized) regex out of it
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n"; }

Hi ,

Thank u very much!!!

Its working!!!

vanitham · August 20, 2008, 1:14am

era:

Or do you mean extract lines matching (the query derived from) $word from all the lines in $sent?
# parse query and make a (slightly generalized) regex out of it
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n"; }

Hi,

I have highlight these words in sentences (words include "RNA binding proteins" and "RNA binding protein" and "RNA-binding protein" and "RNA-binding proteins").

How to highlight all these words in sentences?

I tried using like this but its highlighting only 2 sentences.


$sentences=~s/(\b$regex\b)/<span style="background-color:#E1FF77">$1<\/span>/img;

How to highlight all words?

With regards
Vanitha

era · August 20, 2008, 10:23pm

Your code looks okay, are you sure \b$regex\b really matches more than twice?

vanitham · August 20, 2008, 11:31pm

Hi,

No thats what !!!!

I am not getting how to match!

Is there any other way to do??

With regards
Vanitha

era · August 20, 2008, 11:38pm

I'm afraid it's still not entirely clear what the problem is. Can you show the $regex and the text you are matching it against, and where you are expecting it to match?

vanitham · August 22, 2008, 12:03am

Hi,

Ya sure i will tell u!!

$word="RNA binding protein";


$sent="Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";

my $regex = $word;

$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n";

$sent=~s/(\b$regex\b)/<span style="background-color:#E1FF77">$1<\/span>/img

print $sent;

 }

It has to highlight $word("RNA binding protein" or "RNA binding proteins" OR "RNA-binding protein" OR "RNA-binding proteins") in sentences!!

How that can be done?

With regards
Vanitha

era · August 22, 2008, 12:09am

So you want to tack on a trailing s? on the regex, I guess?

$regex =~ s/$/s?/ unless $regex =~ /[xzcs]$/;

The "unless" condition is not strictly necessary, you might want to take it out; I just wanted to highligt a possible complication. For this particular case you'd probably rather take the risk of a (highly unlikely) false positive rather than make it too sophisticated.

What if the user types in "RNA-binding proteins" as the input, do you want to normalize that back to "RNA[\s-]binding[\s-]proteins?" as well?

vanitham · August 22, 2008, 11:43pm

era:

So you want to tack on a trailing s? on the regex, I guess?
$regex =~ s/$/s?/ unless $regex =~ /[xzcs]$/;
The "unless" condition is not strictly necessary, you might want to take it out; I just wanted to highligt a possible complication. For this particular case you'd probably rather take the risk of a (highly unlikely) false positive rather than make it too sophisticated.

What if the user types in "RNA-binding proteins" as the input, do you want to normalize that back to "RNA[\s-]binding[\s-]proteins?" as well?

Hi,

No in such case it is not required.

If user enters "RNA binding protein*" in that case i have to pick up and highlight the words("RNA binding protein" and "RNA binding proteins" and "RNA-binding protein" and "RNA-binding proteins").

With regards
Vanitha