HTML code remove

nrbhole · April 7, 2010, 3:45am

Hello,

I have one file which has been inserted intermittently with HTML web page.
I would like to remove all text between "<html xmlns="http://www.w3.org/1999/xhtml">" and </html> tags.
Can any one please suggest me sed regular expression for it.
Thanks

vino · April 7, 2010, 4:06am

You could do

sed -n -e "s/\(.*\)\(<html xmlns=.http:..www.w3.org.1999.xhtml.>\).*\(<\/html\)/\1\2\3/p" $file

edidataguy · April 7, 2010, 4:23am

Try one of these:

sed '/<html /,/<\/html>/d' inputfile
 
sed '/<html xmlns="http:\/\/www.w3.org\/1999\/xhtml/,/<\/html>/d' inputfile

nrbhole · April 7, 2010, 5:29am

My issue resolved with the help of above RE's
Thanks all of you for your help.