The Google Desktop Search (GDS) engine is a tool created by Google that indexes all of the files on your (Microsoft-Windows-based) computer and then provides the ability to search those files. The types of files that it indexes include all files written to disk (text files, web pages, media files, etc.), email, instant messages, and web pages and media files visited on the Web. GDS creates a deskbar in the toolbar which enables quick searching on criteria you specify. It returns search results by directing your default browser to a web server running on your machine. The browser-based search interface has an obviously Googlish look and feel. Interestingly, if you have GDS installed, you will have a Desktop search option (in addition to Web, Images, Groups, News, Froogle, and Local) when you visit Google. When you perform a search on the main Google page, GDS matches for that search may also show up in the form of "<Some number of> results stored on your computer" as the first search result.
As cool as this is, an even cooler aspect of GDS is that it is an extensible framework. Google has released an SDK so developers can write plugins for GDS. One such plugin is Kongulo, a web spider. Kongulo provides a command-line interface to crawl, starting at a specified URL, and index the resources it finds there within GDS. Command-line options include depth, URL match, loop, sleep time between loops, and passwords. Kongulo can be a useful tool for indexing intranets or private wikis ... or to see an example of a good plugin written for GDS.
How does a plugin tie into GDS? COM. As I mentioned above, GDS is an application for MS Windows systems. On Friday, May 27, 2005, Google released the source code for Kongulo. Here is the meat of how Kongulo pushes spidered web pages to GDS. (The pieces of the code that pertain specifically to spidering are interesting, but this article won't detail that aspect of Kongulo.)
First, Kongulo creates an event factory object attached to the
Crawler object, like this:
self.event_factory = win32com.client.Dispatch('GoogleDesktopSearch.EventFactory')
An item of note here is that Kongulo uses the
win32com libraries, so if you
plan on running the source code, install the Win32 extensions for
Python or use the ActiveState Python
Next, every time Kongulo wants GDS to index a page, it has to create an event from the event factory like this:
event = self.event_factory.CreateEvent(_GUID, 'Google.Desktop.WebPage')
The first argument the crawler passes into
CreateEvent is the
guid that Kongulo registers for
itself the first time it runs. The second argument is a text string containing
the fully qualified name of the type of event. Kongulo only uses
Google.Desktop.WebPage, but other options include
Google.Desktop.Indexable (which is the parent of all of
the following indexable resources),
The next steps entail adding properties. The
event object has
AddProperty method that takes two arguments: a property name
and a property value. The crawler adds the following four properties to all
pages it finds:
event.AddProperty('format', doctype) event.AddProperty('content', content) event.AddProperty('uri', url) event.AddProperty('last_modified_time', pywintypes.Time(time.time() + time.timezone))
doctype is the document type, pulled from the HTTP headers.
Kongulo will only index documents of the type
content is the body of the web page.
uri is the web location of the resource, and
last_modified_time is actually the current local time, but there
is a note in the source code to use the
last-modified HTTP header
The crawler adds the following property for HTML pages that contain a title:
Interestingly, Kongulo uses regular expressions to find titles, frames, and links, as opposed to using an HTML parser. The Kongulo team felt this would provide a less strict processing of web pages.
The final step is to send the page to GDS, like this:
Send expects a bitwise
OR of the following
EventFlagIndexable = 0x00000001 EventFlagHistorical = 0x00000010
EventFlagIndexable just indicates an event that GDS
should index, and
EventFlagHistorical indicates a historical event (as opposed to
an event that is currently happening in realtime). The Kongulo source code
indicates that if the crawler passes in the historical flag, GDS will not
process the event until the user's system becomes idle.
At this point, GDS has the web page and it is available for searching. That's all there is to it.
The GDS team has done an excellent job of providing a great tool that is easy to extend. The more I play with GDS, the more it impresses me. Of course, I would play with it more if it ran on Linux (hint, hint). Likewise, the Kongulo team has done an excellent job of providing a useful plugin to GDS, but more importantly, of providing clean, readable source code (being written in Python doesn't hurt its readability) to serve as an example of how to write a plugin for GDS. While there are plenty of plugins already available for GDS, this ease of creating a plugin makes me expect many more in the future.
Jeremy Jones is a software engineer who works for Predictix. His weapon of choice is Python.
Return to the Python DevCenter.
Copyright © 2009 O'Reilly Media, Inc.