remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script should prompt the user for the file name.

So far I have

sed -n'/^$/![s/<[^>]*>//g;p;}' file_name.html

not sure how to combine that with code to remove consecutive duplicate lines

remove consecutive duplicate lines :

... | uniq

Please provide Inputs and output expected.

sed 's/<[^>]*>//g' yourfile.html | uniq >newfile.txt
1 Like

input

<html><head><title>CIS013: Operating System - Unix</title></head>
<body>
<h1>Week 1</h1>
<h2>Chapter 1</h2>
<h3>Getting Started With Unix</h3>
<p>Getting Started With Unix</p>
<h1>Week 2</h1>
<h2>Chapter 2</h2>
<h3>Using Directories and Files</h3>
<p>Using Directories and Files</p>
<h2>Chapter 3</h2>
<h3>Working with Your Shell</h3>
<p>Working with Your Shell</p>
<h1>Week 3</h1>
<h2>Chapter 4</h2>
<h3>Creating and Editing Files</h3>
<p>Creating and Editing Files</p>
<h2>Chapter 5</h2>
<h3>Controlling Ownership and Permissions</h3>
<p>Controlling Ownership and Permissions</p>
<h1>Week 4</h1>
<h2>Chapter 6</h2>
<h3>Manipulating Files</h3>
<p>Manipulating Files</p>
<h2>Chapter 7</h2>
<h3>Getting Information About the System</h3>
<p>Getting Information About the System</p>
<h1>Week 5</h1>
<h2>Chapter 8</h2>
<h3>Configuring Your Unix Environment</h3>
<p>Configuring Your Unix Environment</p>
<h2>Chapter 9</h2>
<h3>Running Scripts and Programs</h3>
<p>Running Scripts and Programs</p>
<h1>Week 6</h1>
<h2>Chapter 10</h2>
<h3>Writing Basic Scripts</h3>
<p>Writing Basic Scripts</p>
</body>
<html>
 Week 1

  Chapter 1

  Getting Started With Unix

  Getting Started With Unix
  Week 2

  Chapter 2

  Using Directories and Files

  Using Directories and Files
  Chapter 3


output

Week 1

Chapter 1

Getting Started With Unix

Week 2

Chapter 2

Using Directories and Files

Chapter 3

---------- Post updated at 03:52 AM ---------- Previous update was at 03:51 AM ----------

pipeline... I remember that now

If only the <p>...<p> html tag contains the duplicate values then try..

sed -n '/<p>/!s/<[^>]*>//gp'  intputfile > outfile

Try this and let me know if this works

 
perl -00 -F'<\w+>|</\w+>' -i.bak -lane 'foreach(@F){if ($_=~/\w+/ && ($a ne $_)){print "$_";$a=$_;}}' Input.txt
sed 's!\(<[a-z][.0-9]>\)!!g;s!\(<[a-z]>\)!!;s!\(<.*>\)!!g' input.txt|uniq|sed 'G'

Thanks
Sha