oreilly.comSafari Books Online.Conferences.


Bosworth's Web of Data

by Daniel H. Steinberg

In his Thursday morning keynote at the MySQL Users Conference 2005, Google's Adam Bosworth suggested that we "do for information what HTTP did for user interface." Ten years ago, when he first started paying attention to the web, he was interested in the idea that he could zero install applications and that they could be accessed from anywhere at any time. He said that a personal computer to him is like a phone: it is a useful access point but it is not where he stores stuff.

How the Web Happened

Bosworth explained that the key factors that enabled the web began with simplicity. HTTP was simple enough that any "P" language or JavaScript programmer could build applications. On the consumption side, web browsers such as Internet Explorer 4 were committed to rendering whatever they got. This meant that people could be sloppy and they didn't need to be high priests of syntax. Because it was a sloppy standard, people who otherwise couldn't have authored content did. The fact that it was a standard allowed this single, simple, sloppy, open wire format to run on every platform.

The final "S" word that Bosworth attributes to the success of the web is scale. Consider the number of hits after being Slashdotted. What allows the web to scale is DNS, where the web can distribute information across many machines and use effective caching to weather a spike without hitting the database. Also, course-grained interactions reduce the number of queries to the database.

As a result of a simple, sloppy, standards-based, scalable platform, we have information at our fingertips from Google, Amazon, eBay, and Salesforce. Bosworth's own company, Google, gets hundreds of millions of hard queries a day. He said they see it as putting Ph.Ds in tanks to drive through walls rather than around them.

In addition to the advantages in software, there have been great gains in hardware. Bosworth said that one million dollars buys you five hundred machines with 2TB of in-memory data, a PetaByte of on-disk data, and a reasonable throughput of fifty thousand requests per second. This amounts to one billion requests per day.

Serving Up Information

Having this sort of power changes the way you think. For example, organizing things into folders declines in importance. You can't remember which folder you put something in, and searches are more efficient ways of finding things. The challenge is to take a database and do for the web what was done for content. Bosworth explained that you "need a model that allows for massively linear scalability and federation of information that can spread effortlessly across a federated web."

Solutions that were suggested were to use XML and XQuery. The problem with XML is that unlike HTML, there is not a single grammar. This removed the simple and sloppy aspects of the web. The problem with XQuery is the time it took to finish the specification. Bosworth noted that it took more than four years and that "anything that takes four years is not worth doing. It is over-designed. Intead, take six months and learn from customers."

The next solution used web services, which began as an easy idea: you send an XML request and you get XML back. Instead, the collection of WS-* specs were huge and again, overly complicated. Bosworth said that this was a deliberate effort on the part of the companies that control the specs, like IBM and Microsoft, which deliberately made the specification hard, because then only they could deliver technology to do it.

Bosworth cautioned the audience that MySQL should not be pushed to become an open source version of Oracle. Recent additions to the 5.0 release include triggers, views, and stored procedures, which all support centralizing the processing logic in the database. Bosworth said, "This isn't good because it doesn't scale. True power comes from decentralization and open standards. Centralization is going to give you anti-scale." He asked the audience what they would do when they find that Oracle isn't big enough.

An Open Model

Bosworth advocated an open model for data. Although he was not referring to open source, he expanded upon the example by explaining that customers like open source software because of the transparency. For many, they know what they are getting because they can read the source. For the most part, they do not actually read the source, but it is comforting to know that if the software doesn't work, you or someone else can fix the code if that is required.

When you look at data, the wire-level protocol to the database is not open. There isn't a stand. That is what prevents query engines from running directly across multiple stores. The database community can learn from the lesson of the last ten years that when you open up your formats directly, you get an explosion in the data field.

Imagine if you can query any data that is available anywhere in the world. Bosworth said that what this requires is a single, simple, open wire format for items. The format needs to be simple for any P programmer to deliver and any JavaScript programmer to consume. He also pointed out that "complex things tend to break and simple things tend to work." Google has the simplest query language in the world. There is no structure and no syntax.

Where 2.0 Conference.

Join us at the first Where 2.0 Conference June 29-30, 2005 in sunny San Francisco. Location-based services and mapping are becoming mainstream technologies. Explore the emerging consumer and enterprise ecosystems around location-aware technologies--ecosystems that increasingly impact the way we work and play. Need more reasons to attend?

Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything.

If you build an open source stack that delivers globally available information, how do you massively distribute it and cause it to scale? Bosworth said you need to limit your queries to those that can be easily implemented by everybody and those that can be handled by a single machine. This requires that your queries run at the item level. This might feel odd to those used to dealing with databases, as this means you are not likely to perform joins, aggregations, or subqueries. There is plenty of SQL that cannot be supported.

Bosworth concluded his keynote by saying the potential is that "you guys can handle hundreds of millions of queries per day and scale up and out in ways that Oracle can only dream of. You will be able to effortlessly support hard questions."

Daniel H. Steinberg is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.

Return to

Sponsored by: