noob question - is awk the tool to clean dirty text files?

Hi,

nevermind. I think I've found the answer. It appears I was looking for index, match, sub, and gsub.

I want to write a shell script that will clean the html out of a bunch of files and format the data for import into excel.

Awk seems like a powerful tool, but it seems oriented to text that is already formatted and delimited. From my cursory study, awk seems to only be able to access lines and words. Is there a way to find and manipulate chunks of text within an awk "word".

Or perhaps there are better tools...?

html2text exists