sed to extract HTML content

stargazerr · March 4, 2009, 1:29pm

Hiya,

I am trying to extract a news article from a web page. The sed I have written brings back a lot of Javascript code and sometimes advertisments too. Can anyone please help with this one ??? I need to fix this sed so it picks up the article ONLY (don't worry about the title or date .. i got those using a separate sed) ..

The sed I am running is:

tr -d '\n' <03climate.html | sed -e 's/�//g' -e 's/.*nyt_text[^;];//' -e 's/<\/p>.//g' -e 's/<[^>]>//g' -e s'/[&][#]//g' -e 's/<[^>]>//g' >> articletest

The file I am trying to extract from (03climate.html) and the result (articletest.txt) are both attached with this post ..

Thanks.
SG

aaaaargh · March 18, 2009, 11:12pm

IMO, its really a bad idea to parse HTML files using SED/AWK

Your best bet is to use PERL,

HTML::TreeBuilder - Parser that builds a HTML syntax tree - search.cpan.org

septima.pars · March 21, 2009, 3:31pm

This is an interesting post...

I am wondering if there is a shell utility like the poppler tools (which convert pdf format to text), which perform conversion of html to text? Maybe I am asking the same question please advise................................