Remove html tags with bash

dejavu88 · May 22, 2008, 7:50am

Hello,

is there a way to go through a file and remove certain html tags with bash? If it needs sed or awk, that'll do too.

The reason why I want this is, because I have a monitor script which generates a logfile in HTML and every time it generates a logfile, the tags are reproduced. The tags I want removed are </body> and </html> and are the last two lines in the HTML file.

I found similar topics, but none of them do what I need.

Thanks in advance for the help.

Franklin52 · May 22, 2008, 8:09am

Try this:

awk '/<\/body>/ || /<\/html>/{next}1' file

Regards

dejavu88 · May 22, 2008, 12:53pm

It kinda works, but somehow I have to forward the output to a new file.

awk '/<\/body>/ || /<\/html>/{next}1' file.html > file2.html

is there a way to make it return the output to the original file? (file.html)

When I use:

awk '/<\/body>/ || /<\/html>/{next}1' file.html > file.html

I get a blank file.

All the code before the </body> and </html> tags should remain in the file.

Thanks

Franklin52 · May 22, 2008, 1:36pm

You can't redirect the output to the inputfile. Redirect the output to a temporary file and move it to the original file, something like this:

awk '/<\/body>/ || /<\/html>/{next}1' file.html > file1.html

mv file1.html file.html

Regards

dejavu88 · May 22, 2008, 1:58pm

Just figured it out some minutes ago, same way like you wrote the code, before you replied. Thanks for all the help