HTML table to CSV

Hi !!
I have HTML Tables through which i want to generate graphs, but for creating graphs i need the file in CSV format so can anyone can please help me in how can i convert my HTML table file to CSV format.
Thanks in Advance

try this perl script (tested with input html file containing only one html table):

#!/usr/bin/perl
# csv_from_table.pl
use strict;
my $html_file = shift;
my $csv_file  = shift;
open (F_CSV, ">", $csv_file)    or  die "Failed to write to file $csv_file : $!";
open (F_HTML, "<", $html_file)  or  die "Failed to read file $html_file : $!";
while (<F_HTML>) {
# read html file line by line
    while (m#<TD>\s*(\d+)\s*</TD>\s*(</TR>)*#gi) {
    # keep searching for numbers within TD tags, with an optional /TR tag at the end
        if (! $2) {
        # this TD is not the last TD in the TR
            print F_CSV "$1,";
              # so write comma after this number
        }
        else {
        # this is the last TD in the TR
            print F_CSV "$1\n";
              # so write newline after this number
        }
    }
}
close (F_HTML);
close (F_CSV);

run this script as:

perl csv_from_table.pl table_data.html newfile.csv

html file that i used as input (table_data.html):

<HTML>
<HEAD>
<TITLE>Table with numeric data</TITLE>
</HEAD>
<BODY>
<TABLE border="1">
  <TR> <TD>5</TD> <TD>4</TD>
 <TD>23</TD> </TR> <TR> <TD>10</TD> <TD>3</TD> <TD>24</TD> </TR>
  <TR> <TD>6</TD> <TD>12</TD> <TD>28</TD> </TR>
  <TR> <TD>17</TD> <TD>20</TD> <TD>32</TD> </TR>
</TABLE>
</BODY>
</HTML>

Hi.

If you have command lynx (a text-mode browser) installed, it does a good job of removing markup tags:

% cat s1
#!/usr/bin/env sh

# @(#) s1       Demonstrate lynx -dump to eliminate html tags.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash lynx sed tr

echo

FILE=${1-data1.html}

echo " Input data:"
cat $FILE

echo
echo " Final results:"

lynx -dump $FILE |
tee t1 |
sed -e 's/^ *//' |
tr -s ' ' ','

echo
echo " Intermediate results from lynx:"
cat t1

exit 0

Producing:

% ./s1

(Versions displayed with local utility "version")
GNU bash 2.05b.0
Lynx Version 2.8.5rel.1 (04 Feb 2004)
GNU sed version 4.1.2
tr (coreutils) 5.2.1

 Input data:
<HTML>
<HEAD>
<TITLE>Table with numeric data</TITLE>
</HEAD>
<BODY>
<TABLE border="1">
  <TR> <TD>5</TD> <TD>4</TD>
 <TD>23</TD> </TR> <TR> <TD>10</TD> <TD>3</TD> <TD>24</TD> </TR>
  <TR> <TD>6</TD> <TD>12</TD> <TD>28</TD> </TR>
  <TR> <TD>17</TD> <TD>20</TD> <TD>32</TD> </TR>
</TABLE>
</BODY>
</HTML>

 Final results:

5,4,23
10,3,24
6,12,28
17,20,32

 Intermediate results from lynx:

   5  4  23
   10 3  24
   6  12 28
   17 20 32

The lynx -dump output needs only a bit of a massage to get it into CSV format. See man lynx for details ... cheers, drl