Nutch
Crawler
Indexer
Search Application
Luke
Installation
The Running Example
Nutch
It's possible to setup your own Open Source search engine.
The
solution is Java based so it's cross platform. You may be
wondering how to do this. Now, you don't
have to do it on a PC. You can use your standard Linux box or whatever
else you have at hand. Basically, the solution I'm proposing here is to
use Lucene and Nutch. Lucene is the indexer and Nutch is the Search
engine piece which does page ranking and has a Crawler.
Crawler
Crawler - this does exactly what it says on the tin. It crawls your chosen site[s]. The solution I'm proposing is really suited for an intranet but it can be extended to full Internet crawling. To crawl the entire Web requires server farms etc; but Nutch will do it if configured properly.
Indexer
An Indexer - Lucene does this this job. An excellent book I
bought on this is Lucene
in Action (In Action series) The job of
the indexer is to service the user query. The Crawler sources the pages
and breaks it into indexable content. There is also a ranking engine
which determines what is the most suitable response.
Search Application
A Search Application. This is achieved by downloading Tomcat and installing the Search WAR file. Tomcat is free and can also be installed on your PC/Linux box etc; To run on a PC, you'll need Cygwin which is a PC fiendly set of packages that allows you to run commands like tar etc;.
Luke
There's also a cool indexing utility called Luke which I used to look at the generated index plus many others highlighted in the book. If you're interested in Search Tecnhologies I'd recommend this book and downloading Nutch and all the Open Source utilities.
Luke is useful because it allows you to look at what is inside the Lucene index.Nutch Installation
The version of Nutch I chose was 0.7.2 becuase 0.8 just plain
broke my heart. It didn't work for me. 0.7.2 is a dream in comparison.
You can try 0.8 but it's up to you.
Install instructions for Nutch 0.7.2
It's all Open Source and a great introduction to Search Engines. You can also use Heretrix for but I went with Nutch as it was recommended in the Lucene In Action book.
The Running Example
I've setup Nutch on the my Kattare instance where I get my Web hosting. The package comes with its own crawler which uses a Lucene index and gives you a Tomcat WAR file. It's all Opensource so it's worth a look. I crawled the Lucene Wiki page (and I may change this later but as I type, if you Search you'll get Lucene results.) The running instance can be accessed at http://www.seadarie.kattare.com/nutch/en/