Using awk to find sentences.

danbroz · November 19, 2012, 8:46am

I am trying to print out sentences that meets a regular expression in awk (I�m open to using other tools, too).

I got the regular expression I want to use,

"([^)(]\+ [0-9]\{4\})"

from user ripat in a grep forum. Unfortunately with grep I couldn't print only the sentence.

While searching for awk solutions as suggested I came across this: filtering - How to print out a specific field in AWK? - Stack Overflow
which states �

awk '/word1/'

will print out the whole sentence" I also looked else-were including the man page for direct sentence retrieval with no luck.

I tried

$ awk '/([^)(]\+ [0-9]\{4\})/' BioPsych10.txt

and

$ awk '/"([^)(]\+ [0-9]\{4\})"/' BioPsych10.txt

but they returned nothing.

The text I am searching through is at http://dl.dropbox.com/u/4235339/BioPsych10.txt

Thank you,
DanBroz

PikK45 · November 19, 2012, 9:17am

I think you are missing the printing part in awk

Why don't you try like this??

 awk '/pattern/ {print $0}' file

Jotne · November 19, 2012, 9:18am

I can not download the file to download.
Do you like to find the exact match of the string "([^)(]\+ [0-9]\{4\})" , or is this a search pattern?

Example string:

sfgs grg wrrefg wreg wre "([^)(]\+ [0-9]\{4\})"rwe trt wre 
wretrtwretwret  ret rt wretwret wret 
wt wret wretrt wret wret  rt wretw rett254t 5 tt

and you like the hits of line #1?

danbroz · November 19, 2012, 9:49am

It is a search pattern.
trying downloading the file here:
http://dl.dropbox.com/u/4235339/BioPsych10.txt

Scrutinizer · November 19, 2012, 9:53am

The regex: "([^)(]\+ [0-9]\{4\})" is not extended regex but GNU basic regex...
This is the same with POSIX extended regex:

grep -Eo '\([^)]+ [0-9]{4}\)' infile

if your grep supports the -o option then it will return the occurrences on a single line. Without the -o option (just grep -E ) it will return the results on a single line plus the line the pattern was found on...

Likewise awk:

awk '/\([^)]+ [0-9]{4}\)/' infile

Would return the pattern plus the line it was found on:

Try something like this for multi-line results:

awk 'NR>1 && $1~/^[^)]+ [0-9]{4}$/{print RS $1 FS}' RS=\( FS=\) infile

GNU awk prior to 4.0 does not support repetition ( {4} ) by default. Use (g)awk --posix

danbroz · November 19, 2012, 5:07pm

This seems to not return the complete sentence.

It would work if I managed the carriage returns of the file by first deleting them then adding new lines
after every �).� The very talented user ripat told the first part by

 tr -d '\n' < file |

tl/dr need code to add a newline after every �).�

jim_mcnamara · November 19, 2012, 6:17pm

Please post relevant text on this site, not external sites. The external site will age out the text post, then some future searcher will not have a clue as what this thread is really about.

In fact it went to lala land (404) just now..... Nobody can effectively help you now.

Thank you.

danbroz · November 19, 2012, 9:49pm

I'm going to start a new thread with sample text and simpler request. Thank you for the guidance.

drl · November 20, 2012, 7:36am

Hi.

This is a perl approach to this problem. One of the modules at CPAN is Sentence. I won't post the less-than-40-line perl code, p1, unless necessary. Here is a sample use on a small data file:

#!/usr/bin/env bash

# @(#) s1	Demonstrate identifying English sentences, perl modules.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl divepm
pl " perl modules:"
divepm Sentence Slurp

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
./p1 $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
perl 5.10.0
divepm (local) 1.2

-----
 perl modules:
 Note - /usr/lib/perl/5.10 points to 5.10.0
 Note - /usr/share/perl/5.10 points to 5.10.0
 0.25	Lingua::EN::Sentence
 0.03	Perl6::Slurp

-----
 Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.
Gobble, gobble.
Mr. Erickson said to Dr.
Olson, "Three, e.g.".
The AAA came out to change my tire!  Isn't that great?

-----
 Results:
1) Now is the time
for all good men
to come to the aid
of their country.
1 [ \n to space ]) Now is the time for all good men to come to the aid of their country.
2) Gobble, gobble.
3) Mr. Erickson said to Dr.
Olson, "Three, e.g.".
3 [ \n to space ]) Mr. Erickson said to Dr. Olson, "Three, e.g.".
4) The AAA came out to change my tire!
5) Isn't that great?
 Found 5 sentences in data1

For the 60957 lines in the posted link, it found 31017 sentences in 260 seconds, so it's not the fastest code, but it seems to get the job done.

Obviously this of little value if the OP desires awk, although the regular expression might be able to be used, along with the algorithm of the perl module of marking the possible sentences, and then checking for exceptions like the list of known abbreviations.

Best wishes ... cheers, drl