Search Engine in C

semash · August 31, 2009, 11:38pm

Hello everybody,

I need help with this,
I need to design a CGI search engine in C but i have no idea on what or how to do it.

Do i have to open all the html files one by one and search for the given strings? i think this process will be slow, and will take too much of the server processing resources.

Please, give me some examples, some source code i can use to study. All i have found are PERL scripts, language which i don't like and do not understand.

Thanks!.

pludi · September 1, 2009, 2:03am

Obvious question first: Why re-invent the wheel? ht://Dig might do what you need.

Other than that, go by logic. If you have to search for something (often), create a database of possible matches in certain intervals. That way you don't have to open all the files every time.

And what do you not like about Perl? It was written for effective text processing, so it might just be the right tool for a job like this.

semash · September 1, 2009, 6:29am

Thank you very much for your reply,

i tried htdig, but it has a problem with the customized search/results pages, it doesn't load the ".css" file and the page shows the results without the attributes specified in the css file. That's why i wanted to create a search engine by my own.

Believe me, i'm not crazy, i've searched in internet all over and there are no solutions, do you know how to solve the .css problem?

And thanks A LOT again man, i appreciate it.

pludi · September 1, 2009, 7:22am

According to this, the configuration for the pages is pretty simple. Are you sure that the CSS file could be found?

But if that's your only problem, why start a new search engine? ht://Dig is Open Source, you can modify it to your hearts desire. Or just use the code as a starting point for your own.

semash · September 1, 2009, 6:58pm

Yes, the page is correctly configured, if i load it directly in the URL it shows perfectly, but when it's loaded by htsearch.cgi, then it doesn't show the css attributes...

I know it's kinda insane to "reinvent the wheel", but its my desperate solution. i've tried everything with htdig, modified the header.html, footer.html, wrapper.html, nomatch.html, changed the $(common_dir) variable, copied the .css header everywhere, everything!

I think it might be that htsearch.cgi doesn't recognize the html "type="text/css" value, or something like that... i've set the "link href="file.css" in hundreds of ways... tried direct path "/srv/www/htdocs/htdig/file.css", local path "file.css", etc. copied it to the root directory, cgi-bin directory, htdig directory, pff...

And i can't use it as example because it's programmed in C++, and i don't have a clue of it...

Thank you very much pludi.

pludi · September 2, 2009, 1:35am

htsearch.cgi probably couldn't care less about any HTML-Tags, as they are meant to be interpreted by the browser. When you copied the file.css around, did you try to access it directly from your browser? Did it load OK?

If you don't know C++, you can still use it as a starting point, as long as you can read it. You can at least get some ideas on how the search algorithm works and how the database is created/used.

Neo · September 2, 2009, 3:04am

If I were doing it I would use wget or curl and a bit of logic.

Most search engines index the pages off-line, not in real-time as the pages are downloaded.

It is not trivial to write text classification code that indexed well for search and retrieval. If it was so easy, Google would not be so successful and the competition would be much greater.

semash · September 2, 2009, 8:50am

Hello pludi, neo,

Yes, if i access directly from the browser it shows the page ok with css attributes and everything, but obviously with the variables unset, for example:

'$&(LOGICAL_WORDS)' was not found.

When is htdig who shows me the page, it shows with the variables set, for example:

Search results for 'car crash'

But it doesn't load the css attributes at all.

Neo,
I think i do not need to download pages since the search engine runs locally and only for local search, so they're always going to be "online" and available.

I've been reading the source code for htdig, and it's complex as hell, it uses dictionaries, word exceptions, a database, and more... i don't think i can write a search engine by my own at required time.

But i'm wondering something... could the problem be relationed to the directory paths?
i mean, htdig directory is in /srv/www/htdocs/htdig, and the search engine, htsearch, is in /srv/www/cgi-bin/htsearch.

Thanks a lot for your answers and please excuse my poor english.
This is driving me insane.

Corona688 · September 2, 2009, 11:07am

Have you tried accessing just the .css file? It may not be where you think it is, URL-wise.

semash · September 3, 2009, 10:31am

Hey Corona688,
You always have an answer =)

When i access via URL to the following address:

http://www.example.com/cgi-bin/htsearch

I get the page without loading the attributes of the file.css, which is in the same directory than htsearch, but, and here's the interesting thing, when i access to the following address:

http://www.example.com/cgi-bin/file.css

i get this:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, you@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Obviously, if i can't see it, i think htsearch neither, but, what's the thing then? what's wrong? the file.css has 777 permission. Or what's wrong with the cgi-bin directory?

Thanks!

Corona688 · September 3, 2009, 10:39am

The cgi-bin directory contains executable scripts. It will assume it should execute files with executable permissions instead of just handing out their content. Which for a css file, isn't gonna work. If you insist, you might be able to get away with it by removing the executable bits from file.css but it really doesn't belong there.

semash · September 3, 2009, 1:28pm

Hey Corona688, you're GOD.

I didn't know that about cgi-bin directory, knowing that, i just changed the path to the file.css in the htdig headers like this:

<... href="../file.css" ...>

I moved the file one directory back and now it works GREAT.

Thank you Corona688, Thank you VERY MUCH pludi, and neo.