remove large portion of web page code between two tags

georgi58 · April 29, 2012, 11:17am

Hi everybody,

I am trying to remove bunch of lines from web pages between two tags:
one is <h1> and the other is <table

it looks like

<h1>Anniversary cards roses</h1>
many
lines here
<table summary="Free anniversary greeting cards." cellspacing="8" cellpadding="8" width="70%">

my goal is to delete all including <h1> but keep untouched

<table summary="Free anniversary greeting cards" cellspacing="8" cellpadding="8" width="70%">

any help is greatly appreciated.

bartus11 · April 29, 2012, 11:23am

perl -lp0e 's/<h1>.*<table/<table/s' infile > outfile

jim_mcnamara · April 29, 2012, 11:27am

awk 'BEGIN{ok=1}
       /^<h1>/ {ok=0}
       /<^table summary="Free anniversary greeting cards" cellspacing="8" cellpadding="8" width="70%"> {ok=1}
      ok==1 {print}
      ok==0 {next} '   inputfile > outputfile

This clobbers everything including the FIRST <hl> tag onward. It stops clobbering at the exact table summary statement you gave.

georgi58 · April 29, 2012, 12:39pm

bartus11, your command works great.

perl -lp0e 's/<h1>.*<table/<table/s' infile > outfile

but, would please advise how to make all changes in place and create backup file.

thank you very much.

jim, when i ran your code it came up with an error message

awk: cmd. line:3:  ^ unterminated regexp

but i also thank you for your time and effort.

bartus11 · April 29, 2012, 12:40pm

perl -i.bak -lp0e 's/<h1>.*<table/<table/s' infile

georgi58 · April 29, 2012, 1:46pm

thank you again.

really appreciate it