I am cleaning forum posts to convert them in offline reading version with clean html text. All files are with html extension and reside in one folder. There is some java script i would like to remove, which looks like
<script LANGUAGE="JavaScript1.1">
<!--
function mMz()
{
var mPz = "";
for(var prop in this) {
if ((prop.charAt(0) == '_' && prop.charAt(prop.length-1)=='_')
|| ((typeof this[prop]) == 'function'))
continue;
if (mPz != "") mPz += '&';
mPz += prop + ':' + escape(this[prop]);
}
var cookie = this.gHa + '=' + mPz;
if (this._expiration_)
cookie += '; expires=' + this._expiration_.toGMTString();
if (this._path_) cookie += '; path=' + this._path_;
if (this._domain_) cookie += '; domain=' + this._domain_;
if (this._secure_) cookie += '; secure';
this._document_.cookie = cookie;
}
//-->
</script>
I tried using following sed
command
sed -i.bak '/\<script LANGUAGE="JavaScript\1\.\1"\>/,/<\/script\>/d' *.html
but it results in error
sed: -e expression #1, char 40: Invalid back reference
Any help will be greatly appreciated
Welcome to the forum.
First off, I never reccomend editing in-place, but you've had the wisdom to use backup files so that's good.
You don't need to escape the numbers, which is probably why it's complaining about backreferences. Don't think you need to escape <> either.
Another trick is you can put characters inside [] to 'escape' them, since sed won't treat most special characters in a range block as special. In some situations this is easier to read.
This seems to work.
$ cat data
BEFORE
<script LANGUAGE="JavaScript1.1">
<!--
function mMz()
{
var mPz = "";
for(var prop in this) {
if ((prop.charAt(0) == '_' && prop.charAt(prop.length-1)=='_')
|| ((typeof this[prop]) == 'function'))
continue;
if (mPz != "") mPz += '&';
mPz += prop + ':' + escape(this[prop]);
}
var cookie = this.gHa + '=' + mPz;
if (this._expiration_)
cookie += '; expires=' + this._expiration_.toGMTString();
if (this._path_) cookie += '; path=' + this._path_;
if (this._domain_) cookie += '; domain=' + this._domain_;
if (this._secure_) cookie += '; secure';
this._document_.cookie = cookie;
}
//-->
</script>
AFTER
$ sed '/<script LANGUAGE="JavaScript1[.]1">/,/<[/]script>/d' data
BEFORE
AFTER
$
1 Like
you are great! thank you so much
would you be also willing to give me some more help? here is the point:
i would like to clear some stuff between particular html tags and leave the tags intact. example
-------------quote
<td bgcolor="#eeeeee"><b><a href="Spiritual Treasures - Kriya Yoga download">vandool</a></b> </td>
--------------unquote
i would like to leave only
<td bgcolor="#eeeeee"></b> </td>
there might be up to several such occurrences in a page.
best regards