How to remove multiline HTML tags from a file?

threesixtyfive · September 8, 2015, 12:03pm

I am trying to remove a multiline HTML tag and its contents from a few HTML files following the same basic pattern. So far using regex and sed have been unsuccessful. The HTML has a basic structure like this (with the normal HTML stuff around it):

<div id="div1">
 <div class="div2">
  <other random tags here></other random tags here>
 </div><a id="a1" href="#"><span>Some text</span></a>
</div>

I would like to remove this tag and its contents but I don't know how using scripts or command line programs.

RavinderSingh13 · September 8, 2015, 12:22pm

Hello threesixtyfive,

Welcome to forum, a special thanks for using code tags for inputs/html codes in your posts . Following may help you in same.
Let's say we have following input file:

cat test12121344
<html>
<title>
test
</title>
<body>
I am testing here, R. Singh
<div id="div1">
 <div class="div2">
  <other random tags here></other random tags here>
 </div><a id="a1" href="#"><span>Some text</span></a>
</div>
</body>
</html>

Now following code may help you in same.

awk '{if($0 ~ /<div id=\"div1\">/){getline;if($0 ~ / <div class=\"div2\">/){getline;if($0 ~ /  <other random tags here><\/other random tags here>/){getline;if($0 ~ / <\/div><a id=\"a1\" href=\"#\"><span>Some text<\/span><\/a>/){getline;if($0 ~ /<\/div>/){next}}}}}}{print}' test12121344

Output will be as follows.

 <html>
<title>
test
</title>
<body>
I am testing here, R. Singh
</body>
</html>

But please make sure that each html file which you are using this code must have same data as you shown else we need to modify the code accordingly, hope this helps. Following is code which it will NOT catch in html files.

<div id="div1">
 <div class="div2">
  <other random tags here></other random tags here>
 </div><a id="a1" href="#"><span>Some text</span></a>
</div>

EDIT: Adding a non one-liner form of solution here.

 awk '{if($0 ~ /<div id=\"div1\">/){
                                        getline;
                                        if($0 ~ / <div class=\"div2\">/){
                                                                        getline;if($0 ~ /  <other random tags here><\/other random tags here>/){
                                                                        getline;if($0 ~ / <\/div><a id=\"a1\" href=\"#\"><span>Some text<\/span><\/a>/){
                                                                        getline;if($0 ~ /<\/div>/){next}
                                                                                                                                                        }
                                                                                                                                                        }
                                                                        }
                                  }
      }
      {
        print
      }
    ' test12121344

Thanks,
R. Singh

threesixtyfive · September 8, 2015, 12:28pm

What I meant by the

<other random tags here></other random tags here>

was that there would be other divs, a span, or just nothing. How can I make it so where the <other random tags here> is, it just accepts anything until it finds those last to HTML tags specified by awk?

RavinderSingh13 · September 8, 2015, 1:56pm

Hello threesixtyfive,

Could you please try following and let me know if this helps you.

awk 'BEGIN{A=1}/<\/div>/{A=1;next} /<div.*>/{A=0}; A{print}' test12121344

Where test12121344 is the Input_file as follows.

 cat test12121344
<html>
<title>
test
</title>
<body>
I am testing here, R. Singh
<div id="div1">
 <div class="div2">
  <other random tags here></other random tags here>
 </div><a id="a1" href="#"><span>Some text</span></a>
</div>
</body>
</html>

Following will be output after running code.

 awk 'BEGIN{A=1}/<\/div>/{A=1;next} /<div.*>/{A=0}; A{print}' test12121344
<html>
<title>
test
</title>
<body>
I am testing here, R. Singh
</body>
</html>

Thanks,
R. Singh

RudiC · September 8, 2015, 3:59pm

Try

awk '
/<div id="div1">/       {P=1
                         next
                        }
/<div/ && P             {P++       
                         next
                        }
/<\/div/ && P           {P--       
                         next
                        }
!P
' file