Extract/Parse information from html (website)

Hello,

I want to extract some informations from a html (website, http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it properly..

All the information is displayed on one line, here an example (copy/paste raw data into your favorite text editor):

http://pastebin.com/DL1KERT4

so I've reformated it by hand just to give you a better understanding on what information I need and where the problem lies:

or

and I need the following (all) information:

status (Partnerunternehmen, Contractor etc. )
company name (BRANDES GmbH, BRASST Energiedienstleistungen GmbH etc.)
company address (13088 Berlin etc.)
company contact person (Karin Brandes etc.)
telephon, email, weburl

now like I already mentioned before, I can't find a way to extract the info properly because of how the code is formated.. I can't see any usuable start/end points because of how the information differs, likes sometimes there's no email, no website, no contact person etc.

I'd be greatful for any help, pretty sure that one of the experts here has the required knownledge to beat it :slight_smile:

---------- Post updated at 12:33 PM ---------- Previous update was at 12:31 AM ----------

Hmm, so nobody good enough to give it a try?

Hi TehOne,

I want to give a try, of course. It's an interesting problem, but parsing html is not an easy task. I will post a solution if can solve it, but try by yourself too.

Bumping up posts or double posting is not permitted in these forums.

Please read the rules, which you agreed to when you registered, if you have not already done so.

You may receive an infraction for this. If so, don't worry, just try to follow the rules more carefully. The infraction will expire in the near future

Thank You.

The UNIX and Linux Forums.

One way:

$ cat script.pl 
use warnings;
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
use HTML::Entities;

my $uri = q[http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=1];

## Get the agent to explore the web page.
my $mech = WWW::Mechanize->new();
$mech->agent_alias( q[Linux Mozilla] );
$mech->get( $uri );

## Get last page.
my @c = $mech->find_all_links(
                q[url_regex] => qr/(?i:seite)=/,
);
my %d = map { $_->[0] => do { $_->[0] =~ m/(\d+)\Z/; $1 } } @c;
my $last_page = (sort { $b <=> $a } values %d)[0];

my @text;
for my $page ( 1 .. $last_page ) {

        my $tp = HTML::TokeParser->new( \$mech->content() ) or die qq[ERROR in HTML::TokeParser\n];
        $tp->get_tag( q );

        while ( 1 ) {
                my $t = $tp->get_text();
                if ( $t ) {
                        last if $t =~ m/\A(?i)seite/;
                        push @text, $t;
                }
                my $token = $tp->get_token;
                if ( $token->[0] eq q[E] && $token->[1] eq q[p] ) {
                        printf qq[%s\n], join q[,], @text;
                        @text = ();
                        next;
                }
                if ( $token->[0] eq q[E] && $token->[1] eq q[div] ) {
                        last;
                }
        }

        $uri =~ s/(\d+)\Z/$1 + 1/e;
        $mech->get( $uri );
}

exit 0;
$ perl script.pl
Siegeltrger,Badische Kraftwerk GmbH & Co. KG,76532 Baden-Baden
Contractor,Bayerische Elektrizittswerke GmbH,86150 Augsburg,Tel.: +49 (0821) 328 - 0,Fax: +49 (0821) 328 - 4160,undine.maidl@lew.de,www.bew-augsburg.de
Siegeltrger,BayWa Energie Dienstleistungs GmbH,81925 Mnchen,Projekte dieser Firma ansehen
Siegeltrger,BEG Energiegesellschaft mbH,12681 Berlin
Partnerunternehmen,Beratungs- und Planungsbro fr MULTIVALENTE Beheizungssysteme,Dipl.-Ing. Gnter Schlagowski,28213 Bremen,Tel.: +49 (0421) 211210,Fax: +49 (0421) 212772,g.s.nestwaerme@t-online.de,www.schlagowski.de,Weitere Informationen
Interessent,Bernd Wiggenhauser,78234 Engen
Interessent,Berndorff Contracting GmbH,50674 Kln
Contractor,beta GmbH Betrieb energietechnischer Anlagen,30451 Hannover,Tel.: +49 (0511) 45001109,Fax: +49 (0511) 497574,brosziewski@beta-energie.de,www.beta-energie.de
Siegeltrger,BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG,23684 Schulendorf
Siegeltrger,BHK-Systeme GmbH,10243 Berlin
Interessent,Bi. En GmbH & Co. KG,24109 Kiel
Siegeltrger,BIBER Biomasse GmbH,94333 Geiselhring,Projekte dieser Firma ansehen
Siegeltrger,Bio Wrme Rhn GmbH & Co. KG,36145 Hofbieber-Obernst,Projekte dieser Firma ansehen
Siegeltrger,Bio-Wrme-Innovation GmbH,06449 Aschersleben
Interessent,Bioenergie-Regional GmbH,74199 Untergruppenbach
Siegeltrger,Bioenergiehof Bhme GmbH,01762 Obercarsdorf
Siegeltrger,Bisser-Putz re-Solution Energietechnik GbR,78606 Seitingen-Oberflacht
Siegeltrger,Blume Wrmelieferungs GmbH,14728 Rhinow
Siegeltrger,Bosch Energy and Building Solutions GmbH,70499 Stuttgart
Partnerunternehmen,Bosch Thermotechnik GmbH Buderus Deutschland,Dipl.-Ing. Jens Gierok,21035 Hamburg,Tel.: +49 (040) 73417 - 0,Fax: +49 (040) 73417 - 267,jens.gierok@buderus.de,www.buderus.de,Weitere Informationen
Partnerunternehmen,BRANDES GmbH,Karin Brandes,23701 Eutin,Tel.: +49 (04521) 807 - 0,Fax: +49 (04521) 807 - 77,karin.brandes@brandes.de,www.brandes.de,Weitere Informationen
Contractor,BRASST Energiedienstleistungen GmbH,13088 Berlin,Tel.: +49 (030) 556885 - 0,Fax: +49 (030) 556885 - 99,brasst@bln.de,www.brasst.de
Contractor,BTB  Blockheizkraftwerks- Trger- und Betreiberges. mbH Berlin,10589 Berlin,Tel.: +49 (030) 349907 - 61,Fax: +49 (030) 349907 - 88,karl.meyer@btb-berlin.de,www.btb-berlin.de,Projekte dieser Firma ansehen

Genuine HTML parsing is preferable I think, but FWIW this is with a bit of awk using <div id="text"><img class="ab-bottom" src="/7-mitglieder/images/mitglieder.j - Pastebin.com as the input file :

awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,ORS,$1); print $1}' RS= 
Siegeltr�ger
Badische Kraftwerk GmbH & Co. KG
76532 Baden-Baden


Contractor
Bayerische Elektrizit�tswerke GmbH
86150 Augsburg
Tel.: +49 (0821) 328 - 0
Fax: +49 (0821) 328 - 4160


Siegeltr�ger
BayWa Energie Dienstleistungs GmbH
81925 M�nchen


Siegeltr�ger
BEG Energiegesellschaft mbH
12681 Berlin


Partnerunternehmen
Beratungs- und Planungsb�ro f�r MULTIVALENTE Beheizungssysteme
Dipl.-Ing. G�nter Schlagowski
28213 Bremen
Tel.: +49 (0421) 211210
Fax: +49 (0421) 212772


Interessent
Bernd Wiggenhauser
78234 Engen


Interessent
Berndorff Contracting GmbH
50674 K�ln


Contractor
beta GmbH Betrieb energietechnischer Anlagen
30451 Hannover
Tel.: +49 (0511) 45001109
Fax: +49 (0511) 497574


Siegeltr�ger
BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG
23684 Schulendorf


Siegeltr�ger
BHK-Systeme GmbH
10243 Berlin

or

awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,"|",$1); print $1}' RS=
|Siegeltr�ger|Badische Kraftwerk GmbH & Co. KG|76532 Baden-Baden|
|Contractor|Bayerische Elektrizit�tswerke GmbH|86150 Augsburg|Tel.: +49 (0821) 328 - 0|Fax: +49 (0821) 328 - 4160|
|Siegeltr�ger|BayWa Energie Dienstleistungs GmbH|81925 M�nchen|
|Siegeltr�ger|BEG Energiegesellschaft mbH|12681 Berlin|
|Partnerunternehmen|Beratungs- und Planungsb�ro f�r MULTIVALENTE Beheizungssysteme|Dipl.-Ing. G�nter Schlagowski|28213 Bremen|Tel.: +49 (0421) 211210|Fax: +49 (0421) 212772|
|Interessent|Bernd Wiggenhauser|78234 Engen|
|Interessent|Berndorff Contracting GmbH|50674 K�ln|
|Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover|Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|
|Siegeltr�ger|BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG|23684 Schulendorf|
|Siegeltr�ger|BHK-Systeme GmbH|10243 Berlin|

Damn that looks great, thanks allot!

Hmm but it also gets me thinking, how would I parse the output properly to arrange all that information into the appropriate rows..eg. to match the .csv format.. if something is not available it would have to be represented by an empty field... like:

Status|Company Name|Company Address|Contact|Telephone|Fax|Email|Weburl
----------------------------------------------------------------------------
Interessent|Berndorff Contracting GmbH|50674 K�ln|||||
Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover||Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|||

etc..

while getting the the status, address, phone, fax, email is easy.. the contact and company name is not.. as both are just [a-zA-Z] so hard to separate, especially since not each company has a "GmbH" etc. in its name..