indexing list of words in a file

Johanni · November 24, 2011, 5:36am

Hey all,

I'm doing a project currently and want to index words in a webpage.
So there would be a file with webpage content and a file with list of words, I want an output file with true and false that would show which word exists in the webpage.

example:

Webpage content data.html

References

   1. http://console.online.net/
   2. http://webmail.online.net/
   3. http://console.online.net/assistance/
   4. http://www.online.net/
   5. http://www.online.net/nom-de-domaine/comparatif-des-extensions-geographiques.xhtml
   6. http://www.online.net/nom-de-domaine/comparatif-des-extensions-geographiques.xhtml
   7. http://console.online.net/commande/index/
   8. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
   9. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
  10. http://www.online.net/hebergement-mutualise/offre-online-basic.xhtml
  11. http://www.online.net/hebergement-mutualise/offre-online-pro.xhtml
  12. http://www.online.net/hebergement-mutualise/offre-online-illimite.xhtml
  13. http://www.online.net/serveur-dedie/comparatif-offres-serveur-dedie.xhtml
  14. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-start.xhtml
  15. http://www.online.net/serveur-dedie/offre-dedibox-sc.xhtml
  16. http://www.online.net/serveur-dedie/offre-dedibox-classic.xhtml
  17. http://www.online.net/serveur-dedie/offre-dedibox-dc.xhtml
  18. http://www.online.net/serveur-dedie/offre-dedibox-qc.xhtml
  19. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-pro.xhtml
  20. http://www.online.net/serveur-dedie/offre-dedibox-pro-r210.xhtml
  21. http://www.online.net/serveur-dedie/offre-dedibox-pro-r410.xhtml
  22. http://www.online.net/serveur-dedie/offre-dedibox-pro-r510.xhtml
  23. http://www.online.net/serveur-dedie/offre-dedibox-storage.xhtml
  24. http://www.online.net/serveur-dedie/offre-dedibox-housing-dedirack.xhtml
  25. http://www.online.net/serveur-dedie/offre-dedibox-housing-dedirack.xhtml
  26. http://www.iliad-entreprises.fr/
  27. http://www.online.net/infogerance-serveur/infogerance-serveur-dedie.xhtml
  28. http://www.iliad-datacenter.fr/
  29. https://console.online.net/commande/server/?server=110
  30. http://www.online.net/
  31. http://console.online.net/assistance/
  32. http://twitter.com/online_fr
  33. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
  34. https://console.online.net/commande/index/
  35. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-start.xhtml
  36. https://console.online.net/commande/server/?server=110
  37. http://www.online.net/fiche-tarifaire.pdf
  38. http://www.online.net/cgv.pdf
  39. http://www.online.net/document-legal/mentions-legales.xhtml
  40. http://www.online.net/

list of words words.dat

online
hebergement
ftp
35
php
.fr
.se

file with true false that would show the existence of the words
output.dat

true
false
true
false
true
false

thnx

Franklin52 · November 24, 2011, 6:53am

You can try something like this:

awk '
NR==FNR{a[$1]; next}
{for(i in a)a+=gsub(i,x)}
END{for(i in a){print i,a==0?"false":"true"}}
' words.dat data.html

m.d.ludwig · November 24, 2011, 7:31am

I read the problem as determine if a word from the list exists in the content of the webpage:

#! /usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

$\ = "\n";

my $url  = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR $0, ': missing url';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

while (<>) {
    chomp;
    print index($content, $_) < 0 ? 'false' : 'true';
}

which would be invoked as:

perl hindex http://www.online.net/serveur-dedie/offre-dedibox-qc.xhtml wordlist

which results in:

true
true
false
false
false
true
false

Now this returns 'true' or 'false' depending on the existence of a sequence of characters in the content of the webpage, not splitting out words, removing html tags, and the like. You would need something like HTML::Parser to do that:

#! /usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::Parser;

$\ = "\n";

my %WORDS = ();

sub text {
    my $text = shift(@_);
    return unless defined $text;
    foreach my $w (split ' ', $text) { $WORDS{lc $w}++; }
}

my $url = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR 'USAGE: ', $0, ' <url> [<wordlist>]';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

HTML::Parser->new(text_h => [ \&text, 'text' ])->parse($content);

while (<>) {
    chomp;
    print defined $WORDS{$_} ? 'true' : 'false';
}

Which, when invoked, returns:

true
false
false
false
false
false
false

Please note that this example does not skip over the contents of <script> tags and the like.