7 May 2007

Document Search

by mo

How to build your own documents search!

  • Lucene.NET
  • Seekafile Server
  • Google Desktop Search

The purpose of this document is to describe how to build a document search using different API’s available. All code snippets are taken from sample projects and written in C#. Some samples code was removed for clarity!

Lucene.NET: (http://www.dotlucene.net/)

Lucene.NET is an open source project that allows the consumer of the library the ability to parse index files and format the results. (Useful for searching for documents!)

Some notable features are:

  • Ranked search results
  • Query highlighting
  • .NET assembly. (No reference to COM)
  • API available for building your own index files.

For the time being… let’s assume we have an index created that we can use to search from. (We will cover creating an index when we get to Seekafile Server!)

The guts of Lucene.NET come from a few Mr. Do-it-all objects named:

  • IndexSearcher: takes in a query and searches the index files for results. It returns a “Hits” object.
  • Query
  • QueryParser
  • QueryHighlightExtractor
  • Hits: object contains an “ArrayList” of documents. (Kind of poorly crafter, I would have preferred a strongly type generic list… of say document objects?)

lucene hits

The basic usage of Lucene is as follows…

lucene the just

There is also a QueryHighlightExtractor object that wraps your search tag with “” markup for display on the web.

Seekafile Server

  • Source: http://www.dotlucene.net/seekafile-server-open-source-indexing-server

To build an index to search from I used the Seekafile server open source project. It is currently in version 1.5 beta 3 release.

Seekafile server is a windows service that searches through specified directories and create a search index. To configure the “index” directory and “documents” directory you can either use the Seekafile Server manager.


Or you can configure the settings in the config.xml file located in the Seekafile server installation directory.

seekafile server

You can include additional IFilters to search different document types like “.rtf”, “.pdf”, “.vsd” etc.

This was pretty easy to install and setup, but during my test development I found that I had to stop the Seekafile server service, wipe out the contents of the index and lock directories and restart the service. I have not determined what was causing this problem.

The Google Desktop Search is an application that you can download from Google, it indexes files on your PC and/or network and allows you to search for them through your browser. It has built in support to search for file types such as…

google desktop search file types

The Component Object Model Way!

The Google Desktop Search SDK (Software Development Kit) is available for use, with documentation available at http://desktop.google.com/dev/index.html

The SDK ships with 3 “.idl” files:

  • GoogleDesktopActionAPI.idl
  • GoogleDesktopAPI.idl
  • GoogleDesktopDisplayAPI.idl

The first requirement when using the SDK is to register your application with the Google Desktop Search application loaded on the machine. I was unable to get this to actually work when testing I kept receiving a COMException.

I’m not a big fan of COM, probably because my experience is limited, and I have access to the power of .NET!

The .NET Way! (Without the use of Interoperation)

The way I found was a much easier way to interact with Google Desktop Search was through a web request. Google Desktop receives all requests through

google local host

By sending a request to this URL, with a specially crafted query string I am able to retrieve search results back in the form of XML.

The URL and port number can be retrieved from the Windows Registry.

get google desktop

If Google Desktop Search is installed on the machine it stores this Uri in the registry.

search registry entry

Once we have the Uri to connect to Google Desktop we can start sending search requests. Google Desktop takes care of managing the search index and creates a layer of abstraction for us by allowing us to make direct queries to it and retrieving results.

The query string parameters:

  • “&format=xml” This tells Google Desktop to return the results in an Xml format.
  • “&num=10” This tells Google Desktop to return 10 results.
  • “&in=C:\Documents and Settings\mkhan\My Documents” This tells Google Desktop to return only results from “My Documents” directory.
  • “&q=Microsoft” This tells Google Desktop to search for the word “Microsoft”.

There are a number of other parameters that we can send, in fact just about every option in the Advanced search has an equivelent query string parameter.

For the time being we can use the above parameters to send a request to Google Desktop. We will do this using the WebClient object.

Google Desktop will reply with a stream of Xml that looks like…

We can now parse out these results and format them in whatever context we like. This is usable in a Web application as well as a Desktop Application. However, it may be a little trickier with a Desktop Application. We would have to expose a web service or a method of exposing the search results from Google desktop to external clients.

For example if all the documents were stored on a central server and Google Desktop was installed on that server, we would have to create a Web service that takes in search requests, passes that request to Google Desktop and returns the xml results. The Desktop application could then search for those documents from wherever it has access to that web service.

If the web service were exposed to the internet, this potentially means that a user could search for documents stored on the business server from home.