I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script should prompt the user for the file name.
So far I have
sed -n'/^$/![s/<[^>]*>//g;p;}' file_name.html
not sure how to combine that with code to remove consecutive duplicate lines
remove consecutive duplicate lines :
... | uniq
Please provide Inputs and output expected.
sed 's/<[^>]*>//g' yourfile.html | uniq >newfile.txt
1 Like
input
<html><head><title>CIS013: Operating System - Unix</title></head>
<body>
<h1>Week 1</h1>
<h2>Chapter 1</h2>
<h3>Getting Started With Unix</h3>
<p>Getting Started With Unix</p>
<h1>Week 2</h1>
<h2>Chapter 2</h2>
<h3>Using Directories and Files</h3>
<p>Using Directories and Files</p>
<h2>Chapter 3</h2>
<h3>Working with Your Shell</h3>
<p>Working with Your Shell</p>
<h1>Week 3</h1>
<h2>Chapter 4</h2>
<h3>Creating and Editing Files</h3>
<p>Creating and Editing Files</p>
<h2>Chapter 5</h2>
<h3>Controlling Ownership and Permissions</h3>
<p>Controlling Ownership and Permissions</p>
<h1>Week 4</h1>
<h2>Chapter 6</h2>
<h3>Manipulating Files</h3>
<p>Manipulating Files</p>
<h2>Chapter 7</h2>
<h3>Getting Information About the System</h3>
<p>Getting Information About the System</p>
<h1>Week 5</h1>
<h2>Chapter 8</h2>
<h3>Configuring Your Unix Environment</h3>
<p>Configuring Your Unix Environment</p>
<h2>Chapter 9</h2>
<h3>Running Scripts and Programs</h3>
<p>Running Scripts and Programs</p>
<h1>Week 6</h1>
<h2>Chapter 10</h2>
<h3>Writing Basic Scripts</h3>
<p>Writing Basic Scripts</p>
</body>
<html>
Week 1
Chapter 1
Getting Started With Unix
Getting Started With Unix
Week 2
Chapter 2
Using Directories and Files
Using Directories and Files
Chapter 3
output
Week 1
Chapter 1
Getting Started With Unix
Week 2
Chapter 2
Using Directories and Files
Chapter 3
---------- Post updated at 03:52 AM ---------- Previous update was at 03:51 AM ----------
pipeline... I remember that now
If only the <p>...<p> html tag contains the duplicate values then try..
sed -n '/<p>/!s/<[^>]*>//gp' intputfile > outfile
Try this and let me know if this works
perl -00 -F'<\w+>|</\w+>' -i.bak -lane 'foreach(@F){if ($_=~/\w+/ && ($a ne $_)){print "$_";$a=$_;}}' Input.txt
sed 's!\(<[a-z][.0-9]>\)!!g;s!\(<[a-z]>\)!!;s!\(<.*>\)!!g' input.txt|uniq|sed 'G'
Thanks
Sha