Normalizing files for sentence count

A-V · November 23, 2012, 10:52am

I have files with many different formats and breaks in odd places. now I want to normalize them to be able to count the sentence in each file

1: I want to count the sentences is they finish with ! . ?
2: but I don't want it to count if there is no space after the Full stop. e.g. S.O.L

I have the following line but don't know how to make it work with second condition

FILES="basic/*"
for X in $FILES
do
	name=$(basename $X) 
	sed -n -e ":a" -e "$ s/\n/ /gp;N;b a" $X| tr '\. ' '\n '| tr '\? ' '\n '|tr '\! ' '\n '| grep -v "^[[:blank:]]*$" | wc -l > count/${name}
done

can someone please help me in this regards?:

jim_mcnamara · November 23, 2012, 11:50am

I am translating your requirement to mean count all of the . ! and ? characters in a file.
This is part of what it means to find sentences. It will have problems, ex.: in text with numbers that have decimals in them. And sentences that end in an ellipsis.... < that is one! Neat. I made a self-referential sentence.

awk '{ total+=gsub(/[\.\?\!]/,"", $0); next}
END{print "total sentences=",total} ' somefile.txt

You have to decide on the correctness of your approach, based on your data.

A-V · November 23, 2012, 12:14pm

Thank you very much for the code
I have to break the files into sentence per line as well and dont want it to divide the lines if there is a word or number of the "." so i have to know how to identify it.
can you explain this bit please?

Yoda · November 23, 2012, 1:21pm

$0 represents the whole record. Below is the syntax of gsub function:-

gsub(regexp, replacement, target)

The gsub function returns the number of substitutions made.

Don_Cragun · November 23, 2012, 1:31pm

The normal way of doing this is to change spaces and tabs to newlines and then count the number of lines that end in ., !, and ?.

tr '[ \t]' '\n' file|grep -c '[.!?]$'

drl · November 24, 2012, 7:25am

Hi.

I have been looking at the topic of processing English sentence lately. Here is a demonstration of a perl script to place sentences on separate lines (minimal version):

#!/usr/bin/env bash

# @(#) s1	Demonstrate minimal English sentence separation.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl divepm 

pl " Perl modules:"
divepm -q --input=minimal-sese

FILE=${1-data5}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
./minimal-sese -d $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
perl 5.10.0
divepm (local) 1.2

-----
 Perl modules:
 1.04	strict
 1.06	warnings
 0.03	Perl6::Slurp
 0.25	Lingua::EN::Sentence

-----
 Input data file data5:
Now is the time
for all good men
to come to the aid
of their country.
Gobble, gobble.
Mr. Erickson said to Dr.
Olson, "Pi is approximated by 3.1415, that's S.O.P.". The AAA
came out to change my tire!  Isn't that great?

-----
 Results:
1) Now is the time
for all good men
to come to the aid
of their country.
Now is the time for all good men to come to the aid of their country.
2) Gobble, gobble.
Gobble, gobble.
3) Mr. Erickson said to Dr.
Olson, "Pi is approximated by 3.1415, that's S.O.P.".
Mr. Erickson said to Dr. Olson, "Pi is approximated by 3.1415, that's S.O.P.".
4) The AAA
came out to change my tire!
The AAA came out to change my tire!
5) Isn't that great?
Isn't that great?

The file uploaded needs to be copied to file minimal-sese and then made executable. The perl module Lingua/EN/Sentence.pm may be available in your repository. Otherwise it needs to be copied from the URL noted in the script comments.

Posting samples of your input and desired output will help invite on-point solutions.

Best wishes ... cheers, drl

A-V · November 26, 2012, 10:30am

Thank you very much... I will give this a go and ask if I have any question