How to get exact match sentences?

Hi,

I have sentences like this:

$sent=

Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.

User enters a query term like this "RNA binding protein" and i am taking like this.

$word=param('query');
print "\n$word\n";

What i want to do is it should pick up the sentences which has RNA-binding protein also!!

How to write a regular expression such that $word has to pick up these sentences which has "RNA-binding protein" and RNA binding protein?

With regards
Archana

the regular expression you wish could be like:

/RNA(?:-| )binding protein/

I don't know exactly what language you want this but you could also do something like

/RNA[-\s]binding protein/

In the advanced course we will bump into sentences like "RNA-bound protein". In the Nobel Laureate course we will handle text in German and Chinese as well.

Seriously, you could try to generalize your search patterns somewhat (specify all possible verb tenses, etc) but the general problem of language parsing has not been solved completely yet.

Search engines map down each word token to a normalized form so you can find "found" in Google when searching for "find". In some contexts, this is a misfeature -- when you know exactly what you want, you don't want the "sugary" matches at all.

In the meantime, maybe it'd be enough to replace all spaces with dots in your regular expressions for the time being ...

Hi,

Thanks for the reply!!
I got this expression but i don't know how to check this expression using $word?

With regards
Vanitha

Assuming it is Perl we are talking about here:

if ($word ~ /RNA[-\s]binding protein/) {
   print "we have a match: $word";
}

Hi,

This is one example for word but if user enters something like this it has to match and retrieve and i am not getting how to write an expression for $word to retreive match sentences?

Another eg:Transcription-factor,Transcription factor like that many words will be like that!!!

I n a generalized way how to match the words like this?

With regards
Vanitha

to retrieve the sentence you want which matches a certain pattern you do like so:

if ($word =~ /(.*?RNA[-\s]binding protein.*?)$/) { 
      print "$1\n"; 
}

If you have multiple patterns you either put them all on a list and check one by one or create an expression that allows spaces or '-' between words (but that could be faulty and you would lose track of things)

Or do you mean extract lines matching (the query derived from) $word from all the lines in $sent?

# parse query and make a (slightly generalized) regex out of it
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n"; }

Hi

I have to match $sentences with $word and $word can have "RNA binding protein" or RNA-binding protein.

How to match $sentences?

[CODE]

if($sentences=~/$word/) //$word can be "RNA binding protein" or RNA-binding protein.
{

}
I want to check $word for these conditions and match?

how should i do that?

Hi,

I want to match $sentences with $word like this:


if($sentences=~/$word/)
{

}

Here $word refers to "RNA binding protein" OR "RNA-binding protein".

How can i write expression such that $sentences matches with $word(it should match "RNA binding protein" and "RNA-binding protein")?

$word should work for both conditions and $sentences should match $word!!

How can i do that?

As far as I can see redoubtable's post already answers your question. Does it not work for you?

I'm not sure I understand what the original poster wants but I think this should answer any possible question he/she might have.

$sentences = "

Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";

$word = '(.*?RNA[-\s]binding protein.*?)(?:\n|$)';

while ($sentences =~ /$word/g)
{
        print "$1\n";
}

This will print all matches of the specified pattern $word in $sentences.

Hi ,

Thank u very much!!!

Its working!!!

Hi,

I have highlight these words in sentences (words include "RNA binding proteins" and "RNA binding protein" and "RNA-binding protein" and "RNA-binding proteins").

How to highlight all these words in sentences?

I tried using like this but its highlighting only 2 sentences.


$sentences=~s/(\b$regex\b)/<span style="background-color:#E1FF77">$1<\/span>/img; 

How to highlight all words?

With regards
Vanitha

Your code looks okay, are you sure \b$regex\b really matches more than twice?

Hi,

No thats what !!!!

I am not getting how to match!

Is there any other way to do??

With regards
Vanitha

I'm afraid it's still not entirely clear what the problem is. Can you show the $regex and the text you are matching it against, and where you are expecting it to match?

Hi,

Ya sure i will tell u!!

$word="RNA binding protein";


$sent="Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.

Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.

Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.

PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";

my $regex = $word;

$regex =~ s/\s+/[- \\s]/g;

# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n";

$sent=~s/(\b$regex\b)/<span style="background-color:#E1FF77">$1<\/span>/img

print $sent;

 }

It has to highlight $word("RNA binding protein" or "RNA binding proteins" OR "RNA-binding protein" OR "RNA-binding proteins") in sentences!!

How that can be done?

With regards
Vanitha

So you want to tack on a trailing s? on the regex, I guess?

$regex =~ s/$/s?/ unless $regex =~ /[xzcs]$/;

The "unless" condition is not strictly necessary, you might want to take it out; I just wanted to highligt a possible complication. For this particular case you'd probably rather take the risk of a (highly unlikely) false positive rather than make it too sophisticated.

What if the user types in "RNA-binding proteins" as the input, do you want to normalize that back to "RNA[\s-]binding[\s-]proteins?" as well?

Hi,

No in such case it is not required.

If user enters "RNA binding protein*" in that case i have to pick up and highlight the words("RNA binding protein" and "RNA binding proteins" and "RNA-binding protein" and "RNA-binding proteins").

With regards
Vanitha