Hi !!
I have HTML Tables through which i want to generate graphs, but for creating graphs i need the file in CSV format so can anyone can please help me in how can i convert my HTML table file to CSV format.
Thanks in Advance
try this perl script (tested with input html file containing only one html table):
#!/usr/bin/perl
# csv_from_table.pl
use strict;
my $html_file = shift;
my $csv_file = shift;
open (F_CSV, ">", $csv_file) or die "Failed to write to file $csv_file : $!";
open (F_HTML, "<", $html_file) or die "Failed to read file $html_file : $!";
while (<F_HTML>) {
# read html file line by line
while (m#<TD>\s*(\d+)\s*</TD>\s*(</TR>)*#gi) {
# keep searching for numbers within TD tags, with an optional /TR tag at the end
if (! $2) {
# this TD is not the last TD in the TR
print F_CSV "$1,";
# so write comma after this number
}
else {
# this is the last TD in the TR
print F_CSV "$1\n";
# so write newline after this number
}
}
}
close (F_HTML);
close (F_CSV);
run this script as:
perl csv_from_table.pl table_data.html newfile.csv
html file that i used as input (table_data.html):
<HTML>
<HEAD>
<TITLE>Table with numeric data</TITLE>
</HEAD>
<BODY>
<TABLE border="1">
<TR> <TD>5</TD> <TD>4</TD>
<TD>23</TD> </TR> <TR> <TD>10</TD> <TD>3</TD> <TD>24</TD> </TR>
<TR> <TD>6</TD> <TD>12</TD> <TD>28</TD> </TR>
<TR> <TD>17</TD> <TD>20</TD> <TD>32</TD> </TR>
</TABLE>
</BODY>
</HTML>
Hi.
If you have command lynx (a text-mode browser) installed, it does a good job of removing markup tags:
% cat s1
#!/usr/bin/env sh
# @(#) s1 Demonstrate lynx -dump to eliminate html tags.
set -o nounset
echo
debug=":"
debug="echo"
## Use local command version for the commands in this demonstration.
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash lynx sed tr
echo
FILE=${1-data1.html}
echo " Input data:"
cat $FILE
echo
echo " Final results:"
lynx -dump $FILE |
tee t1 |
sed -e 's/^ *//' |
tr -s ' ' ','
echo
echo " Intermediate results from lynx:"
cat t1
exit 0
Producing:
% ./s1
(Versions displayed with local utility "version")
GNU bash 2.05b.0
Lynx Version 2.8.5rel.1 (04 Feb 2004)
GNU sed version 4.1.2
tr (coreutils) 5.2.1
Input data:
<HTML>
<HEAD>
<TITLE>Table with numeric data</TITLE>
</HEAD>
<BODY>
<TABLE border="1">
<TR> <TD>5</TD> <TD>4</TD>
<TD>23</TD> </TR> <TR> <TD>10</TD> <TD>3</TD> <TD>24</TD> </TR>
<TR> <TD>6</TD> <TD>12</TD> <TD>28</TD> </TR>
<TR> <TD>17</TD> <TD>20</TD> <TD>32</TD> </TR>
</TABLE>
</BODY>
</HTML>
Final results:
5,4,23
10,3,24
6,12,28
17,20,32
Intermediate results from lynx:
5 4 23
10 3 24
6 12 28
17 20 32
The lynx -dump output needs only a bit of a massage to get it into CSV format. See man lynx for details ... cheers, drl