Help with awk regular expression for RS record separator

1Brajesh · August 7, 2017, 4:57pm

Hi,

I'm using gawk to read a text file and count the sentences.
I want to use a record separator of a period, exclamation mark and a question mark.

The problem is that the file contains words like "Mr. Smith" so the periods in the appellation are tripping my record separator.

This is my code snippet:

BEGIN {
       RS="[.?!]"
}

This actually works fine until the file contains words like Mr. Smith.

So I tried like this:

RS="[^Mr.][.?!]"

Or like this:

RS="!Mr.[.?!]"

Or like this:

RS = "!(Mr.)[.?!]"

But I coudn't get any of them to work

Any ideas how I can do this?

ctac · August 7, 2017, 5:43pm

Hi, try

RS="[^\"Mr.\"][.?!]"

1Brajesh · August 7, 2017, 6:25pm

No it didn't work

It broke the file that was working.
I have a file without any "Mr." words.

By adding your suggestion, even the file without any "Mr." words stops working.

For example, it reads "one." as "on", "two." as tw, "three" as "thre".

This is the same as what was happening with my attempt below too.

---------- Post updated at 06:22 PM ---------- Previous update was at 06:18 PM ----------

Here's my full code:

#!/bin/bash

BEGIN {
       RS="[.?!;:]"           #       There is a problem with Mr. and Mrs. 
       maxWords=0
      }

{

if (maxWords<NF) 
     { 
        maxWords=NF
        longestSentence = $0
     }

for (i=1;i<=NF;i++) 
        a[$i]++

}

END{ 
      i=1;
      for(k in a) 
      {
        print i, k, a[k];
        i++;
      }
      print
      print("There were", NR, "sentences and the longest sentence had", maxWords, "words and there were", length(a), "unique words")
      print ("The longest sentence was:", longestSentence)
}

---------- Post updated at 06:25 PM ---------- Previous update was at 06:22 PM ----------

And the test file I'm using, which works fine as the code is above, but when I start changing the RS expression, even this file which has no "Mr." stops working.

----start of file----

one.
two two. 
three three three!
four four four four five five five five five.
six six six six six six?

------end of file---------

rdrtx1 · August 7, 2017, 6:51pm

awk '
BEGIN {
   eol="[.?!;:]$" # There is a problem with Mr. and Mrs.
   maxWords=0
}

{
   if (NF>maxWords) {
      maxWords=NF
      longestSentence = $0
   }

   for (i=1;i<=NF;i++) {
      sub(eol "$", "", $i)
      a[$i]++
   }
}

END{
for(k in a) print ++ii, k, a[k];
print ""
print("There were", NR, "sentences and the longest sentence had", maxWords, "words and there were", length(a), "unique words")
print ("The longest sentence was:", longestSentence)
}
' infile

1Brajesh · August 7, 2017, 8:07pm

hmmm...interesting...isn't the record separator a newline now?
What if one sentence spans multiple newlines? Won't it be counted as two or more sentences?

Also, I don't understand exactly what the sub command is doing?

thank you

Don_Cragun · August 7, 2017, 9:08pm

If what you want to do is separate records at points where the last character on a line is a <period>, <question-mark>, or <exclamation-point>, you probably want to use:

RS="[.?!]$"

as rdrtx1 suggested.

Using RS="[.?!;:]" splits records on <period>, <question-mark>, <exclamation-point>, <semicolon>, and <colon> anywhere on a line.

Using RS="[^\"Mr.\"][.?!]" splits records on any two character sequence where the first character is not a <backslash>, <double-quote>, <uppercase-M>, <lowercase-r>, <period>, <backslash>, or <double-quote> and the second character is a <period>, <question-mark>, or <exclamation-point>. This ERE makes no sense to me for this use.

If, in addition to splitting when a set of characters is found at the end of a line, you also wanted to find that set of characters followed by two <space> characters (which is the common way of separating sentences in old fashioned text files), you could use:

RS="[.?!](  |$)"

Note that most of the above is talking about gawk and does not necessarily apply to other standards-conforming versions of awk . The standards state that it if more than one character is assigned to RS, it is unspecified whether RS is treated as a multi-character ERE that acts as the record separator or only the 1st character assigned to RS acts as the record separator. If RS is set to an empty string, the record separator is a sequence of two or more adjacent <newline> characters.

The default record separator is a <newline>. When RS is set to something other than a <newline>, <newline> (in addition to whatever FS is set to) is a field separator.

1Brajesh · August 7, 2017, 11:42pm

Hi Don,

I'm trying to capture an english sentence in a record.
This sentence could be very long and span multiple lines in a file.

My perfect record separator would be a period, exclamation point, question mark, semicolon or colon.

However, my code sees the word "Mr." it thinks that's the end of the sentence because of the period that is part of Mr. So I want it detect that "Mr." is NOT part of the record separator.

Semantically:
Not (Mr.) but ok with any of these [.!?;:]

But syntactically I don't know how to do this, I'm trying like this:

 RS = (^Mr. | [.!?;:])

But its not working?

itkamaraj · August 8, 2017, 12:32am

why dont you just change Mr. to Mr# using sed command then pass the file to awk command. later.. change Mr# to Mr.

can you give the sample input file and expected output file

sed 's/Mr\./Mr#/g' input | awk -F"[.!?;:]" '{do whatever...}'   | sed 's/Mr#/Mr./g'

1Brajesh · August 8, 2017, 12:48am

Yes, I think that's a great idea, the input file is in my control, I could even just replace "Mr." with "Mr". Thanks for this suggestion, at least I can move on now!

The input is for a neural network, so retaining the period after the Mr is not even important and it can be discarded.

Continuing this just for academic discussion, is it not possible to do a regular expression for what I want?

thank you

Don_Cragun · August 8, 2017, 2:27am

It's not just Mr. . It's also Mrs. , Ms. , Dr. , Sr. , Jr. , and hundreds of other abbreviations. And these abbreviations don't always appear at the start of a sentence. (Or maybe you thought that the caret in (^Mr. | [?!]) means "not". It doesn't; it anchors that part of the ERE to the start of a string. And the <space> before the bracket expression is a literal <space> that must be matched exactly (and that <space> would never appear before a sentence terminating character in English text).

If your sentences all end at the end of a line, anchoring (i.e. [.!?]$ as I suggested in post #6 in this thread) should work for you. If you have multiple sentences that take multiple lines or multiple sentences on a line AND sentences that do not end at the end of a line have a sentence terminating character immediately followed by two <space> characters, then the RS value I suggested i post #6 (i.e.

RS="[.?!](  |$)"

with exactly two spaces before the vertical bar in that ERE) should give you records that are sentences (without the character that terminates the sentence).

But if you have abbreviations followed by a single space and sentence terminating characters followed by a single space (not a double space) and not appearing at the end of a line, you are going to find it very difficult to guess which periods terminate abbreviations and which periods terminate sentences. (Note that it is also possible for an abbrevition to appear at the end of a sentence.

And, semicolons and colons do not end English sentences. I don't understand why you're including them in your EREs.

1Brajesh · August 8, 2017, 10:00am

Hi Don,

Thank you for your detailed reply and examples of RS expressions.
You are correct, I would need to consider all the other abbreviations as well.

I'm using these sentences for training a neural net.

The training corpus would be classic books like Pride and Prejudice from project Gutenberg for example : Pride and Prejudice by Jane Austen - Free Ebook

Looking at this example above:

-Not all sentences end on a newline.
-There aren't two spaces after the end of a sentence
-The reason I'm considering : and ; because even though they are not complete sentences, they are in "general" complete "thoughts" and as such I can approximate them as complete sentences and reduce the complexity of the neural net.

I think the simplest solution for me seems to be to do massage the input file and do a search and replace all of the exceptions as itkamaraj suggested.

If by looking at this text example you might have another suggestion please let me know.

thank you all !

drl · August 8, 2017, 10:19am

Hi.

If you can use perl instead of awk , see thread Normalizing files for sentence count post 6.

Best wishes ... cheers, drl

1Brajesh · August 8, 2017, 11:10am

thanks drl I will take a look