Hey guys. I have a quick question. My friends and I are working on a search engine project that will hopefully be up and running by December of 2011. Here's my concern. What programs should I use to create the search engine. [Note: I have already been recommended to use PHP.] Thanks guys!
PHP is just part of the web front end. What sort of data do you have to search? I made a front end for CScope so I could search files and source code, for instance. You could grep -E / fgrep in files, display hits, display clicked hit files. Is there an open source Google out there? Probably not, or just a primitive early version !
Thanks for replying. I actually don't want to search "files". However, I want to search the internet. [Nothing complex. Just being able to find links works for me!] Any ideas?
I'd say you need (at least) 3 components:
- A crawler that downloads pages, and follows links on those pages.
- An indexer that builds a list of words used on each page (maybe in relation to other words nearby), and saves that to a database.
- A front-end to query the database.
For the crawler you can use just about any language since the main limitation is the network speed. For the indexer I'd recommend either C/C++ (for speed) or a language geared towards natural language processing (like Perl). For the front-end you can again choose whatever language you're comfortable with.
Why write when you can download (and if necessary modify and contribute)?
List of search engines - Wikipedia, the free encyclopedia
Search Tools with Open Source Code
A Comparison of Free Search Engine Software by Yiling Chen on SearchTools.com
Thank you to Pludi and DG Pickett. I want to write. though.
---------- Post updated at 01:15 AM ---------- Previous update was at 01:13 AM ----------
Can anyone link the software I need?
What do you mean by "I want to write though"? And you have to decide what software you'll need. If you want to use a scripting language you'll need the interpreter for that. If you want to use C/C++ you'll need a compiler for that.
Let me ask you a question: have you, as of now, written a program more complex than a Fibonacci number calculator before?
- I want to write the program myself. I don't want to download, modify, and contribute.
- I think I'll go with the interpreter.
- I have not written a program more complex than a Fibonacci number calculator before.
Well, it is a project. You have to have an acquisition engine to find target documents. You need a repository friendly to search. You need a user interface to submit searches and present finds. An admin interface for submitting new target areas for the acquisition engine. You need a computer, usually on a network and usually lots of storage.
You want a data structure that expands, updates, deletes in a other-user-invisible way, like leaf-to-root modification. Users coming down the old tree are not bothered by new trees you build to replace, or new subtrees.
A lot of code and though goes into dealing with kill-words, words and phrases that happen so often you never want to index them. You can discover them as they hit a threshold, or just trim them as needed for space.
JAVA using persistent objects may work well for this. You might want to make your own persistent objects out of map'd flat files.