Extracting text between two strings

JamesForeman · June 27, 2010, 1:52am

Hi,

I've looked at a few existing posts on this, but they don't seem to work for my inputs.

I have a text file where I want to extract all the text between two strings, every time that occurs.

Eg my input file is

Anna said that she would fetch the bucket.
Anna and Ben moved the bucket.
I would not like Anna to do it.

I was expecting that

sed -n '/Anna/,/would/p' inputfile > outputfile

would give me

said that she
and Ben moved the bucket.
I

But instead I get back

Anna said that she would fetch the bucket.
Anna and Ben moved the bucket.
I would not like Anna to do it.

What am I missing?

Thanks

bartus11 · June 27, 2010, 4:02am

Try

perl -0777 -ne '/(?<=Anna).*(?=would)/s;print $&;' file

or

perl -0777 -ne '/(?<=Anna).*?(?=would)/s;print $&;' file

Scrutinizer · June 27, 2010, 4:36am

sed -n '/Anna/,/would/p' inputfile > outputfile

prints the whole line that contains "Anna" upto and including any next line that contains "would"

bartus11 · June 27, 2010, 5:23am

From what OP wrote, he already tried that code, and its result didn't meet his needs.

Scrutinizer · June 27, 2010, 5:30am

Hi Bartus11, I know, I did not try to provide a solution, I just tried to explain what a sed construction such as he used does, since it did not work as he expected (actually that was what he was asking).

bartus11 · June 27, 2010, 5:37am

Sorry for misunderstanding your post

JamesForeman · June 27, 2010, 7:06am

Thanks all, now I have a slightly improved understanding of sed (and perl as well)

Bartus11's second bit of perl gives me almost what I want: it gives me the text between the first instance of 'Anna' and the first 'would' after that. But if I have multiple occurrences of 'Anna' and 'would' in my file, how do I get all of them?

Just to clarify, if the text file was

Anna A would Anna B would Anna C would

then I'd want the output to be
A
B
C

and not
A
AB
B
BC
C

or any similar permutation. Should I just get rid of the first occurence in the file and then run Bartus11's second script again (and again and again) until I get no more output? Or is there an elegant way to avoid doing that? (Not that it has to be elegant: I'm quite happy with brute force )

bartus11 · June 27, 2010, 7:26am

Perl can do it for you

perl -0777 -ne 'while (/(?<=Anna).*?(?=would)/s){print $& . "\n"; s/Anna.*?would//s}' file

Other way:

perl -0777 -ne 'print $1 . "\n" while s/Anna(.*?)would//s' file

Scrutinizer · June 27, 2010, 10:08am

Perl's lazy matching capability is a real advantage here. You could use:

sed 's/\(Anna\|would\)/\n&\n/g' infile | 
awk  '/would/{p=0;printf s;s=""}p{s=s$0"\n"};/Anna/{p=1}'

or

sed 's/\(Anna\|would\)/\n&\n/g' infile | 
awk  '/would/{p=0;printf s;s=""}p{$1=$1;s=s$0"\n"};/Anna/{p=1}'

to get rid of spacing..

JamesForeman · June 27, 2010, 10:13am

Terrific! Thanks very much!