searching & replacing/removing only certain HTML tags

naphelge · April 20, 2010, 10:23pm

I generally save a lot of web pages for reading offline which works out great for school. Now I have to spend a lot of time on the bus and I am looking for the best way to read some of these webpages using my Nokia 7610.

I have uploaded the files to my phone, but they are deadly deadly slow to open and very clunky to navigate using the phones keys.

I have since been copying and pasting content from the webpages into gedit and adding some basic html tags for basic formatting that will make the content layout somewhat pleasant. It looks goos and navigates much quicker than the original webpages viewed on the phone.

but now I am realizing I need a way to add, remove, search and replace HTML a little more automated. So, I am wondering what tools might be available to the ubuntu/xubuntu user for searching and replacing certain tags while leaving other tags in tact?

for example:

<tr><td class="padleft12"><i>And so she did. (3.3.18)</i></td></tr>
<tr><td class="padleft6"><b>Thought:</b> When Iago wants to make Othello ... observe her well with Cassio;</i></td></tr>
<tr><td class="padleft12"><i>Wear your eye thus, not jealous nor secure:</i></td></tr>

Using gedit I have no problem searching for all instances of:

padleft12"><i>Thought:</i>

and replacing with:

padleft12"><b>Thought:</b>

but now I find I have to remove the </i> tags at the end of that same HTML row. But I wonder if there is a way or application to select one tage and tell the search for the next instance of a character like '<' for example.

So in this example:

<tr><td class="padleft6"><b>Thought:</b> When Iago wants to make Othello ... observe her well with Cassio;</i></td></tr>

I would like to search for the next occurrence of '</i>' after the '</b>' tag while ignoring all regular text in between. Is that possible?

I hope that made sense.

naphelge · April 21, 2010, 11:16pm

Since posting I looked harder in SED and it seems like it is capable of doing everything I need and then some, however, I think I might be on the brink of suicide here. I have a command that looks like it should work, but well it does not. I am certain it is user error, so if anyone can help me with this one my sanity would sure appreciate it.

sed "/Thought\:/s/<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td /<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td /g" oth1.html > oth2.html

I thought for sure it had to do with the escaping new lines or tabs, but I have made this work fine:

sed "/Thought\:/s/<\/tr>/<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>/g" oth1.html > oth2.html

I have made some pretty dumb mistakes sofar trying to learn how to tame SED, but I cannot see why the first example above deos not work.

BTW, is there a way to show current col position of the cursor in either XTerm or gnomeTerm? that might be a big help.

cheers,
nap

naphelge · April 23, 2010, 1:23am

hey guys things are coming slowly but surely with sed. now I have sed script files that do almost everything I need, actually they do, but still with quite a bit of manual effort.

I am trying to use a sed script inside of a bash for loop to automatically change $name variables, but I am having some problems because the file names overwrite each each iteration of the for loop, and by the end only the last name in the for list gets changed in the filename.

can someone please help me fine tune this so that somehow (with a counter I think, but I have played with counters and cannot get seem to get the desired result) the for loop goes thru and changes the name, saves the result, changes the name to the next one and using the saved file from the previous iteration.

#!/bin/bash
for name in BAPTISTA TRANIO HORTENSIO GREMIO GRUMIO PETRUCHIO WIDOW
do		
		#append colon after any lines that only contain name
		#put name on new line whenever a period immediately proceeds it
		#put name on new line whenever a closing bracket immediately proceeds it
		#all instances of '�' need to be substitued for '"'
		#rm any chars coming after the colon following the name
	sed -e "/^$name$/s/$name/$name\:/g" -e	"/$name/s/.$name/.\n$name\:/g" -e "/$name/s/)$name/)\n$name\:/g" -e "/$name/s/\"$name/\"\n$name\:/g" -e "s/\($name\:\).*/\1/g" -e "/$name/s/$name\:/<tr><td class=\"padleft6\"><b>$name\:<\/b><\/td><\/tr>/g" $filename1 > $filename2
done

All of the sed subs work fine if I call them using the sed file command. And I am pretty sure they should work here as is, since the end result I have, the name WIDOW gets changed in the file as desired. I think the rest of the names do also, but then get written over.

I know that I still need to manually enter the names into the bash script, but doing that and having the for loop correct will still save me a schwack of time.

Thanks for any help because I have never done any programming before, so lots of new ideas here that are probably in need of modification

cheers,