How to remove html tag which has multiple lines in SHELL?

I want to clean a html file.

I try to remove the script part in the html and remove the rest of tags and empty lines.

The code I try to use is the following:

sed '/<script/,/<\/script>/d' webpage.html | sed -e 's/<[^>]*>//g' | sed '/^\s*$/d' > output.txt

However, in this method, I can not handle the case which a tag with multiple lines of properties, like:

<body class="three-col logged-out Streams" 
data-fouc-class-names="swift-loading"
 dir="ltr">

Are there any other methods to deal with this kind of case in SHELL?

To edit HTML files there's better tools than sed out there. However, for a sed solution, this might point you in the right direction:

sed '/<script/,/<\/script>/d; s/<[^>]*>//g; /</{:L;N;/>/!bL;d}; /^\s*$/d' file 

Could you please explain what is the third part doing? Thanks.
And it seems that it do not work well in my machine.

This /</{:L;N;/>/!bL;d}; is a loop, entered when a < is encountered, appending next lines until an > is read, then deleting the pattern space. It is far from bullet proof, not accounting for e.g. nested tags, but should give you an idea on how you could proceed.

If you want further help you need to be way more specific (details, samples, error msgs, ...).

The error is that
bL: Event not found.

Check the quoting ('...') of the sed script.

It's still the same error. I use tcsh. Is there anything wrong with this?

I can't make any statement on tcsh. Please post your command line exactly as is.

The command line is show as the following:

% sed '/<script/,/<\/script>/d; s/<[^>]*>//g; /</{:L;N;/>/!bl;d}; /^\s*$/d' webpage.html > test.txt
bl: Event not found.

Sorry, can't help. You may want to scrutinize your shell's man pages to find out what causes the error ...

---------- Post updated at 19:20 ---------- Previous update was at 19:19 ----------

BTW - WHY is that a lower case l (L)? This is not what I posted!

1 Like

Sorry, that's my technical error, and I tried the "L". And it's the same error.

And again, thanks for your help! You give me an idea how to solve it. I'd better do more exploration in this direction.

---------- Post updated at 11:46 PM ---------- Previous update was at 01:23 PM ----------

I have solved it! Since the (!) in csh can not be resolved appropriately. (/) should be added before. Then it can work well! Thanks!