$sent=
Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform for nucleic acid interactions.
Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.
Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.
PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.
User enters a query term like this "RNA binding protein" and i am taking like this.
$word=param('query');
print "\n$word\n";
What i want to do is it should pick up the sentences which has RNA-binding protein also!!
How to write a regular expression such that $word has to pick up these sentences which has "RNA-binding protein" and RNA binding protein?
In the advanced course we will bump into sentences like "RNA-bound protein". In the Nobel Laureate course we will handle text in German and Chinese as well.
Seriously, you could try to generalize your search patterns somewhat (specify all possible verb tenses, etc) but the general problem of language parsing has not been solved completely yet.
Search engines map down each word token to a normalized form so you can find "found" in Google when searching for "find". In some contexts, this is a misfeature -- when you know exactly what you want, you don't want the "sugary" matches at all.
In the meantime, maybe it'd be enough to replace all spaces with dots in your regular expressions for the time being ...
This is one example for word but if user enters something like this it has to match and retrieve and i am not getting how to write an expression for $word to retreive match sentences?
Another eg:Transcription-factor,Transcription factor like that many words will be like that!!!
I n a generalized way how to match the words like this?
to retrieve the sentence you want which matches a certain pattern you do like so:
if ($word =~ /(.*?RNA[-\s]binding protein.*?)$/) {
print "$1\n";
}
If you have multiple patterns you either put them all on a list and check one by one or create an expression that allows spaces or '-' between words (but that could be faulty and you would lose track of things)
Or do you mean extract lines matching (the query derived from) $word from all the lines in $sent?
# parse query and make a (slightly generalized) regex out of it
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;
# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n"; }
I'm not sure I understand what the original poster wants but I think this should answer any possible question he/she might have.
$sentences = "
Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.
Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.
Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.
PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";
$word = '(.*?RNA[-\s]binding protein.*?)(?:\n|$)';
while ($sentences =~ /$word/g)
{
print "$1\n";
}
This will print all matches of the specified pattern $word in $sentences.
I have highlight these words in sentences (words include "RNA binding proteins" and "RNA binding protein" and "RNA-binding protein" and "RNA-binding proteins").
How to highlight all these words in sentences?
I tried using like this but its highlighting only 2 sentences.
I'm afraid it's still not entirely clear what the problem is. Can you show the $regex and the text you are matching it against, and where you are expecting it to match?
$sent="Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform
for nucleic acid interactions.
Heterogeneous nuclear ribonucleoparticle (hnRNP) proteins form a family of RNA binding proteins (RBPs) that coat nascent pre-mRNAs.
Finally, we have found that Pumilio2, a member of the PUF family of RNA-binding proteins, is highly concentrated at the vertebrate neuromuscular junction.
PUF proteins comprise a highly conserved family of sequence-specific RNA-binding protein that regulate target mRNAs.";
my $regex = $word;
$regex =~ s/\s+/[- \\s]/g;
# print all lines matching $regex
while ($sent =~ m/(.*$regex.*)/go) { print "$1\n";
$sent=~s/(\b$regex\b)/<span style="background-color:#E1FF77">$1<\/span>/img
print $sent;
}
It has to highlight $word("RNA binding protein" or "RNA binding proteins" OR "RNA-binding protein" OR "RNA-binding proteins") in sentences!!
So you want to tack on a trailing s? on the regex, I guess?
$regex =~ s/$/s?/ unless $regex =~ /[xzcs]$/;
The "unless" condition is not strictly necessary, you might want to take it out; I just wanted to highligt a possible complication. For this particular case you'd probably rather take the risk of a (highly unlikely) false positive rather than make it too sophisticated.
What if the user types in "RNA-binding proteins" as the input, do you want to normalize that back to "RNA[\s-]binding[\s-]proteins?" as well?
If user enters "RNA binding protein*" in that case i have to pick up and highlight the words("RNA binding protein" and "RNA binding proteins" and "RNA-binding protein" and "RNA-binding proteins").