Regex to identify a full-stop as a sentence delimiter

gimley · July 27, 2012, 11:27pm

Hello,
Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use:

just to name a few.

Standard parsers such as the Stanford do not parse this correctlyand treat the full-stop as a delimiter whatever be its occurrence.
A Perl script would do the job, but since I am working on dynamic data where on the fly detection is needed, I am looking for a regex which can do the job and correctly ignore the above cases and identify only valid ones.
Use of close proximity i.e. ignore if between a full-stop and the next full-stop there are only a couple of words is a possibility but does not work in all cases.
Does anyone know of a solution to this thorny issue ? Many thanks in advance for your help

spacebar · July 28, 2012, 1:06am

Do you only want to match the period that is at the end before the (xxxxx)?

Chirel · July 28, 2012, 3:51am

Hi,

The input & output of what you want is not clear for me, but about parsing full-stop.

Maybe you could say that full-stop must be followed by a \w and a capital letter or end of file ?

gimley · July 28, 2012, 3:58am

Hello,
Maybe I was not very clear. What I want is a regex that identifies the full-stop as an end of sentence and excludes all other full-stops as listed in my mail which are not sentence delimiters but delimit entities such as Temperature, Currency, Acronyms, Dates etc.
Many thanks once again

Chirel · July 28, 2012, 4:08am

Hum i guess that when i write in english it's not clear. So let's talk regex

i said :

That could mean something like : '\.\w[A-Z]'

gimley · July 28, 2012, 6:42am

Hi Many thanks.
I tried the regex you had provided.
Here is the input:

What I need is that the regex should identify only sentences delimited with a full-stop.
The expected output would be:

and not for example

The Regex which you furnished and which I applied as a Unix regex gave me the following:

I tried quite a few tweaks but they made it worse.
Any workarounds please. I have a huge database with this type of strings and need to identify valid strings.
Many thanks

rangarasan · July 28, 2012, 6:57am

Hi,

Try this one,

sed -e 's/\. \([A-Z]\)/.\n\1/g' file

i think this would help you.
Cheers,
Ranga:-)

gimley · July 29, 2012, 11:50am

Sorry my net was down and could not ack ur answer.
Many thanks for the script. The only hassle is that it needs to be a regex since I need to process data on the fly dynamically and not off-line using SED.
Any suggestions?
I did tweak your regex to suit my needs but drew a blank.

Chirel · July 30, 2012, 8:33am

Hi,

a regex will match something, then what ?
If don't understand what do you mean by a regex to process on the fly dynamically. .

Can you give me an exemple please ?

The goal even if it's "dynamic on the fly" is to replace the right full-stop by full-stop <new-line>

I don't get it how can you do that only with a regex ? are you using perl ?

perl :

perl -pe 's/\. ([A-Z])/.\n$1/g'

$ perl -pe 's/\. ([A-Z])/.\n$1/g' input-file
The temperature was 32.8 degrees Celsius.
His B.Sc. degree was deemed insufficient.
He owed the bank USD 4000.50 which he had not paid back.
On 27.07.2004 a major earthquake occurred.
It was 17.05 by the clock.

gimley · July 30, 2012, 8:46am

Hi,
Many thanks for the regex. I will try it out and get back to you. By "on the fly", I meant that the regex is inserted within a java string which in turn interrogates a web-site and returns full sentences for searching and indexing.
This is why a Perl script would not help, since it would mean calling the script. I will try and see if the script can be called from Java, but the open source software we are using demands a regex and hence the request.
Many thanks