harvesting posts from html code

audiophile · December 11, 2010, 2:36pm

How could I use sed to find a string, then take the contents of the next line to a new file? I want to try to collect data from thread. If I look at the html for the page, it seems like I can cut out all the junk by keying on the phrase <div class="postmsg"> then printing the next line to a new file I can then further refine with sed. How is this best accomplished with just bash?

Corona688 · December 11, 2010, 5:46pm

With just bash? That'll be painful. For that matter, so'd sed, or any other line-based tool. It'd be hard even to do it in awk, since without it nesting tags for you, you wouldn't know which </div> to end at, and if you did it yourself that's a mountain of work...

Using a perl module that actually parses HTML instead of trying to sed/grep for something in a file that doesn't even have proper lines is much more reliable.