Running sed from a script query

bgnersoon2be_1 · July 25, 2013, 11:24am

Hello!

I'm trying to run this code to print the body of an html document (all text in between <body> and </body>) from a script but am unsure how to call it from the command line interface.

/<body>/,/<\/body>/
1s/.*<body>//
$s/<\/body>.*//p

I have tried to call it using this:

sed -n -f sedscript1.sed test.txt

text.txt being where the html text is stored.
I get the error message:

sed: file sedscript1.sed line 2: unknown command: `
'

when trying to run it though

What am I doing wrong!?

Thanks for your help!

Corona688 · July 25, 2013, 11:59am

1) Your first line is missing a command to do. 'for all lines between <body> and </body>" -- do something, but "something" is missing. /<body>/,/<\/body>/ p is more complete, p to print.

2) It probably won't work. It will match all lines between <body> and </body>, including parts of the line before and after these tags. If the HTML is one giant line it will print everything.

You can do similar things in awk, but you get to tell it what a 'line' is, which is useful for matching one tag per 'line'.

This will match tags more properly, and also split each tag onto a line:

awk -v RS="<" '/^[bB][oO][dD][yY]/,/^\/[bB][oO][dD][yY]/ { $1="<"$1 ; print }' file.html

bgnersoon2be_1 · July 25, 2013, 3:10pm

Thanks for the help! But when I use it like this (from the command line):

sed -n '/<body>/,/<\/body>/p' test.txt | sed -e '1s/.*<body>//' -e '$s/<\/body>.*//'

it will take the input:

<!DOCTYPE html><html lang="en">
<head><title>Images</title></head><body><ul>
<li><a href="IMG_1389.JPG">IMG_1389.JPG<\a> (1.7M)<\li>
<li><a href="IMG_1390.JPG">IMG_1390.JPG<\a> (1.5M)<\li>
<li><a href="IMG_1391.JPG">IMG_1391.JPG<\a> (1.4M)<\li>
</ul></body></html>

and output exactly what I need:

<ul>
<li><a href="IMG_1389.JPG">IMG_1389.JPG<\a> (1.7M)<\li>
<li><a href="IMG_1390.JPG">IMG_1390.JPG<\a> (1.5M)<\li>
<li><a href="IMG_1391.JPG">IMG_1391.JPG<\a> (1.4M)<\li>
</ul>

But I need to be able to call it from a script...

Corona688 · July 25, 2013, 3:16pm

Well, you could paste that command into a script?

bgnersoon2be_1 · July 25, 2013, 3:19pm

Trouble is, it doesn't accept the commas and I thought each expression had to be written on a new line?

Corona688 · July 25, 2013, 3:23pm

No, I mean, the whole line you gave, into a script file. Otherwise you're going to need more than one file to feed all those sed | sed | sed.

awk '/^[uU][lL]/,/^\/[uU][lL]/ { $1="<"$1 ; print }; END { printf("\n"); }' RS="<" ORS="" FS="" OFS="" inputfile

Corona688 · July 25, 2013, 3:29pm

Try this in your sed file:

s/.*<body>//
s/<\/body>.*//
/<ul>/,/<\/ul>/p

Beware that if your HTML changes slightly, it will break down.