O'Reilly Network    

 Published on The O'Reilly Network (http://www.oreillynet.com/)
 http://www.oreillynet.com/pub/wlg/6475

The Greatest Test of Open Source: Beating Google

by Steve Mallett
Feb. 12, 2005

In the last couple of years one of the greatest software engineering projects has surfaced and become a household name. Google. One thing powers its greatness. Software.


The open source world prides itself rightly on its incredible successes. Apache Server, Linux, all email software worth mentioning, and recently Firefox. These are, have been, and will forever be marvelous feats.


What real technological competition have they been up against? Firefox vs the long abandoned IE. Apache against ISS. Linux vs Windows Server (Unix is technologically great, but cut its own throat to succeed en mass). Frankly, these successes have balanced more on putting out the word that they exist, disarming FUD, and the willingness of people to try something new.


Google. Its technological greatness is revered by all. Others like Yahoo are chasing it, but at best they'll do nothing more than chase it. They have no real advantage over Google.


A search site takes a lot more than just bitchin' software. There are a lot of costs. Bandwidth of crawling is the biggie, serving results, hardware, people.


Enter Nutch. Nutch is an open source search engine crawler, indexer, etc. The project appears to have been a bit dormant since its first media splash a few years ago, but has just recently become incubated with the Apache Software Foundation.


As I write this I have Nutch crawling a few sites just to test it out on my own. It's the fifth of my tests. I'm increasing the search depth, and playing with a few of its knobs & buttons. The first few tests worked, but weren't terribly compelling. Not that the Nutch site doesn't give you the straight goods upfront. Their site says, "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines."


Attract some more developers and I bet this happens sooner than later.


I think a commercial search engine based on Nutch could be a huge deal. Such an operation requires a ton of money for equipment and bandwidth so it would have to pay its own bills. However; the open source software component would give such an operation a scrappy little advantage. If open source can take on a truly great competitor, the operation would have the distinct advantage of better results and not the overhead of personnel like Google, Yahoo!, and their ilk have. The new search site would want to hire key people so they don't have to worry about paying the rent and feeding the kids, but that's a lot more talent available for less.


I think, and I'm really only guessing, that Nutch hasn't prospered to where I would like it to be because of the costs of running the operation. To truly test the system you need a big index. You need to spend a lot of money crawling. To test it against Google anyway. How big? Well, the Internet Archive hosts "some work" of Nutch's. They seem to have more bandwidth than the average bear.


Back to the main point... given the resources could an open software based search engine beat a great proprietary competitor. There's only one way to find out, and what counts most of all is real results.

Steve Mallett is the founder and managing editor of OSDir.com (Open Source Directory), and does a bunch of other stuff.

oreillynet.com Copyright © 2006 O'Reilly Media, Inc.