I want to clean a html file.
I try to remove the script part in the html and remove the rest of tags and empty lines.
The code I try to use is the following:
sed '/<script/,/<\/script>/d' webpage.html | sed -e 's/<[^>]*>//g' | sed '/^\s*$/d' > output.txt
However, in this method, I can not handle the case which a tag with multiple lines of properties, like:
<body class="three-col logged-out Streams"
data-fouc-class-names="swift-loading"
dir="ltr">
Are there any other methods to deal with this kind of case in SHELL?
RudiC
November 30, 2015, 11:34am
2
To edit HTML files there's better tools than sed
out there. However, for a sed
solution, this might point you in the right direction:
sed '/<script/,/<\/script>/d; s/<[^>]*>//g; /</{:L;N;/>/!bL;d}; /^\s*$/d' file
rudic:
To edit HTML files there's better tools than sed
out there. However, for a sed
solution, this might point you in the right direction:
sed '/<script/,/<\/script>/d; s/<[^>]*>//g; /</{:L;N;/>/!bL;d}; /^\s*$/d' file
Could you please explain what is the third part doing? Thanks.
And it seems that it do not work well in my machine.
RudiC
November 30, 2015, 11:55am
4
This /</{:L;N;/>/!bL;d};
is a loop, entered when a <
is encountered, appending next lines until an >
is read, then deleting the pattern space. It is far from bullet proof, not accounting for e.g. nested tags, but should give you an idea on how you could proceed.
If you want further help you need to be way more specific (details, samples, error msgs, ...).
rudic:
This /</{:L;N;/>/!bL;d};
is a loop, entered when a <
is encountered, appending next lines until an >
is read, then deleting the pattern space. It is far from bullet proof, not accounting for e.g. nested tags, but should give you an idea on how you could proceed.
If you want further help you need to be way more specific (details, samples, error msgs, ...).
The error is that
bL: Event not found.
RudiC
November 30, 2015, 12:14pm
6
Check the quoting ('...') of the sed
script.
It's still the same error. I use tcsh. Is there anything wrong with this?
RudiC
November 30, 2015, 1:13pm
8
I can't make any statement on tcsh. Please post your command line exactly as is.
The command line is show as the following:
% sed '/<script/,/<\/script>/d; s/<[^>]*>//g; /</{:L;N;/>/!bl;d}; /^\s*$/d' webpage.html > test.txt
bl: Event not found.
RudiC
November 30, 2015, 1:20pm
10
Sorry, can't help. You may want to scrutinize your shell's man
pages to find out what causes the error ...
---------- Post updated at 19:20 ---------- Previous update was at 19:19 ----------
BTW - WHY is that a lower case l (L)? This is not what I posted!
1 Like
rudic:
Sorry, can't help. You may want to scrutinize your shell's man
pages to find out what causes the error ...
---------- Post updated at 19:20 ---------- Previous update was at 19:19 ----------
BTW - WHY is that a lower case l (L)? This is not what I posted!
Sorry, that's my technical error, and I tried the "L". And it's the same error.
And again, thanks for your help! You give me an idea how to solve it. I'd better do more exploration in this direction.
---------- Post updated at 11:46 PM ---------- Previous update was at 01:23 PM ----------
rudic:
Sorry, can't help. You may want to scrutinize your shell's man
pages to find out what causes the error ...
---------- Post updated at 19:20 ---------- Previous update was at 19:19 ----------
BTW - WHY is that a lower case l (L)? This is not what I posted!
I have solved it! Since the (!) in csh can not be resolved appropriately. (/) should be added before. Then it can work well! Thanks!