Parse HTML tag parameters and text

senszey · November 5, 2009, 7:48pm

Hi!

I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.

With awk and sed, I managed to put every table row in separate lines. So my file looks like this:

<TR> .... </TR>
<TR> .... </TR>
...

One line looks like this:

<TR><A NAME="1,1"><TD CLASS="small" WIDTH="30" ALIGN="right" VALIGN="top">1,1</TD><TD WIDTH="380" ALIGN="left" VALIGN="top">
<FONT COLOR="black">Here is a text part</FONT></TD>
    <TD BGCOLOR="green" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD CLASS="small" ALIGN="left" VALIGN="top">
    <A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=121&v=2&SID=...'>Textlink1</A>; <A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=146&v=6-8&SID=...'>Textlink2</A></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD><TD BGCOLOR="white" WIDTH="1px"></TD><TD CLASS="small" ALIGN="left" VALIGN="top"></TD></TR>

I need these information:
<A NAME="1,1">
Here is a text part
1,1,19,121,2
1,1,19,146,6-8

name(1),name(2),between font tags,atarget1,atarget2...atargetN
NUMBER,NUMBER,TEXTPART,LINK1,LINK2,...,LINKN
where LINKi is like:
from(1),from(2),b,ch,v

The number of links can be none, or more. I don't know the maximum.

Can you help me with extracting these infos? I can find these parts with regexp, but don't know how to put the info in parameters and how to it for every line.. And the number of links is unknown, but it's fine, I'll can parse the csv.

Thx,

Andras

RandiR · December 13, 2009, 1:56pm

Sounds like this question is still unanswered.

Here is a possible solution. The script SS_WebPageToCSV ( http://www.biterscripting.com/SS_WebPageToCSV.html ) exactly does what you need. It takes a URL and a table number, and extracts the data in that table into a CSV. The output by default is written to screen. But, you can redirect the CSV data to a CSV file. Here are couple of example commands.

script ss_webpagetocsv.txt page("http://finance.yahoo.com/q?s=YHOO") number(1)

Or,

script ss_webpagetocsv.txt page("http://finance.yahoo.com/q?s=YHOO") number(1) > "Output.CSV"

First command will show the output on screen. Second command will create the CSV file "Output.CSV" (in current directory) with the data from the table.

The number of the table you want to extract (an HTML document may have more than one table), is supplied thru the number() argument to the script. The URL is supplied thru the page() argument. It can extract tables from many document types - .html, .php, .asp, etc.