Hi.
This is a perl approach to this problem. One of the modules at CPAN is Sentence. I won't post the less-than-40-line perl code, p1, unless necessary. Here is a sample use on a small data file:
#!/usr/bin/env bash
# @(#) s1 Demonstrate identifying English sentences, perl modules.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl divepm
pl " perl modules:"
divepm Sentence Slurp
FILE=${1-data1}
pl " Input data file $FILE:"
cat $FILE
pl " Results:"
./p1 $FILE
exit 0
producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
bash GNU bash 3.2.39
perl 5.10.0
divepm (local) 1.2
-----
perl modules:
Note - /usr/lib/perl/5.10 points to 5.10.0
Note - /usr/share/perl/5.10 points to 5.10.0
0.25 Lingua::EN::Sentence
0.03 Perl6::Slurp
-----
Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.
Gobble, gobble.
Mr. Erickson said to Dr.
Olson, "Three, e.g.".
The AAA came out to change my tire! Isn't that great?
-----
Results:
1) Now is the time
for all good men
to come to the aid
of their country.
1 [ \n to space ]) Now is the time for all good men to come to the aid of their country.
2) Gobble, gobble.
3) Mr. Erickson said to Dr.
Olson, "Three, e.g.".
3 [ \n to space ]) Mr. Erickson said to Dr. Olson, "Three, e.g.".
4) The AAA came out to change my tire!
5) Isn't that great?
Found 5 sentences in data1
For the 60957 lines in the posted link, it found 31017 sentences in 260 seconds, so it's not the fastest code, but it seems to get the job done.
Obviously this of little value if the OP desires awk, although the regular expression might be able to be used, along with the algorithm of the perl module of marking the possible sentences, and then checking for exceptions like the list of known abbreviations.
Best wishes ... cheers, drl