OpenP2P.com    
 Published on OpenP2P.com (http://www.openp2p.com/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: Peer-to-Peer Harnessing the Power of Disruptive Technologies

The Power of Metadata

Related Reading

Peer-to-Peer
Harnessing the Power of Disruptive Technologies
By Nelson Minar, Marc Hedlund, Clay Shirky, Tim O'Reilly, Dan Bricklin, David Anderson, Jeremie Miller, Adam Langley, Gene Kan, Alan Brown, Marc Waldman, Lorrie Faith Cranor, Aviel Rubin, Roger Dingledine, Michael Freedman, David Molnar, Rael Dornfest, Dan Brickley, Theodore Hong, Richard Lethin, Jon Udell, Nimisha Asthagiri, Walter Tuvell, Brandon Wiley

by Rael Dornfest and Dan Brickley

This essay is an excerpt from the forthcoming book Peer-to-Peer Harnessing the Power of Disruptive Technologies. It presents the goals that drive the developers of the best-known peer-to-peer systems, the problems they've faced, and the technical solutions they've found. Dornfest and Brickley will speak at the O'Reilly Peer-to-Peer Conference, February 14-16 in San Francisco.

Today's Web is a great, big, glorious mess. Spiders, robots, screen-scraping, and plain text searches are standard practices that indicate a desperate attempt to draw arbitrary distinctions between needles and hay. And they go only so far as the data we've taken the trouble to make available online.

Now peer-to-peer promises to turn your desktop, laptop, palmtop, and fridge into peers, chattering away with one another and making swaths of their data stores available online. Of course, if every single device on the network exposes even a small percentage of the resources it manages, it will exacerbate the problem by piling on more hay and needles in heaps. How will we cope with the sudden logarithmic influx of disparate data sources?

The new protocols being developed at breakneck speed for peer-to-peer applications also add to the mess by disconnecting data from the fairly bounded arena of the Web and the ubiquitous port 80. Loosening the hyperlinks that bind all these various resources together threatens to scatter hay and needles to the winds. Where previously we had application user interfaces for each and every information system, the Web gave us a single user interface -- the browser -- along with an organizing principle -- the hyperlink -- that allowed us to reach all the material, at least in theory. Peer-to-peer might undo all this good and throw us back into the dark ages of one application for each application type or application service. We already have Napster for MP3s and work has begun on Docster for documents -- can JPEGster and Palmster be very far off?

And how shall we search these disparate, transitory clumps of data, winking in and out of existence as our devices go on and offline, to say nothing of finding the clumps in the first place? Napster is held up as a reassurance that everything can work out on its own. The inherent ubiquity of any one MP3 track gets around the problem of resource transience. However, isn't this abundance simply the direct result of its rather constrained problem space? MP3 files are popular, and MP3 rippers make it easy for huge numbers of people to create decent-quality files.

As industry attention turns to peer-to-peer technologies, and as the content within these systems becomes more heterogeneous, the technology will have to accommodate content that is harder to accumulate and less popular; the critical mass of replicated files will not be attained. Familiar problems associated with finding a particular item may reemerge, this time in a decentralized environment rather than around the familiar Web hub.

Whether or not peer-to-peer fares any better than the Web, it certainly presents a new challenge for people concerned with describing and classifying information resources. Peer-to-peer provides a rich environment and a promising early stage for putting in place all we've learned about metadata over the past decade.

So, before we go much further, what exactly is metadata?

Data about data

Metadata is the stuff of card catalogues, television guides, Rolodexes, taxonomies, tables of contents -- to borrow a Zen concept, the finger pointing at the moon. It is labels like "title," "author," "type," "height," and "language" used to describe a book, person, television program, species, etc. Metadata is, quite simply, data about data.

There are communities of specialists who have spent years working on -- and indeed solving some of -- the hard problems of categorizing, cataloguing, and making it possible to find things. Even in the early days of the Web, developers enlisted the help of these information scientists and architects, realizing that otherwise we'd be in for quite a mess. The Dublin Core Metadata Initiative (DMCI) [1] is just such an effort. An interdisciplinary, international group founded in 1994, the DCMI's charter is to use a minimal set of metadata constructs to make it easier to find things on the Web. We'll take a closer look at Dublin Core in a moment.

Yet, while well-understood systems exist for cataloguing and classifying some classic types of information, such as books (e.g., MARC records and the Dewey Decimal System), equivalent facilities were late to arrive on the Web -- some would say far too late. They are emerging, however, just in time for peer-to-peer.

Metadata lessons from the Web

Peer-to-peer's power lies in its willingness to rethink old assumptions and reinvent the way we do things. This can be quite constructive, even revolutionary, but it also risks being hugely destructive in that we can throw out lessons previously learned from the Web experience. In particular, we know that the Web suffered because metadata infrastructure was added relatively late (1997+), an add-on situation that had an impact on various levels.

The Web burst onto the scene before we managed to agree on common descriptive practices -- ways of describing "stuff." Consequently, the vast majority of web-related tools lack any common infrastructure for specifying or using the properties of web content. WYSIWYG HTML editors don't go out of their way to make their metadata support (if they have any) visible, nor do they request metadata for a document when authors press the "Save" button. Search engines provide little room for registering metadata along with their associated sites. Robots and spiders often discard any metadata in the form of HTML <meta> tags they might find. This has resulted in an enormous hodgepodge of a data set with little rhyme or reason. The Web is hardly the intricately organized masterpiece represented by its namesake in nature.

Early peer-to-peer applications come from relatively limited spheres (MP3 file-sharing, messaging, Weblogs, groupware, etc.) with pretty well understood semantics and implicit metadata -- we know it's an MP3 because it's in Napster. These communities have the opportunity, before heterogeneity and ubiquity muddy the waters, to describe and codify their semantics to allow for better organization, extraction, and search functionality down the road. Yet even at this early stage, we're already seeing the same mistakes creeping in.

Resource description

Until recently, the means available to content providers for describing the resources they make available on the Web have been inconsistent at best. About the only consistent metadata in an HTML document is the <title> element, which provides only a hint at best as to the content of the page. HTML's <meta> element is supposed to provide a method for embedding arbitrary metadata -- but that creates more of a problem than a solution, because applications, books, articles, tutorials, and standards bodies alike express little guidance as to what good metadata should look like and how best to express it.

The work of the aforementioned Dublin Core offers a wonderful start. The Dublin Core Metadata Element Set is a set of 15 elements (title, description, creator, date, publisher, etc.) that are useful in describing almost any web resource. Rather than attempt to define semantics for specific instances and situations, the DCMI focused on the commonalities found in resources of various shapes and flavors. The Dublin Core may just as easily be used to describe "a journal article in PDF format," "an MPEG encoding of an episode of Buffy the Vampire Slayer recorded on a hacked TiVO," or "a healthcare speech given by the U.S. President on March 2, 2000."

Example 1 shows a typical appearance of Dublin Core metadata in a fragment of HTML. Each <meta> tag contains an element of metadata defined by Dublin Core.

Example 13-1: Dublin Core metadata in an HTML document


<html>
  <head>
    <title>Distributed Metadata</title>
    <meta name="description" content="This article addresses...">
    <meta name="subject" content="metadata, rdf, peer-to-peer">
    <meta name="creator" content="Dan Brickley and Rael Dornfest">
    <meta name="publisher" content="O'Reilly & Associates">
    <meta name="date" content="2000-10-29T00:34:00+00:00">
    <meta name="type" content="article">
    <meta name="language" content="en-us">
    <meta name="rights" content="Copyright 2000, O'Reilly & Associates, Inc.">
    ...
  </head>
  ...

While useful up to a point, the original HTML mechanism for embedding metadata has proven limited. There is no built-in convention to control the names given to the various embedded metadata fields. As a consequence, HTML <meta> tags can be ambiguous: we don't know which sense of "title" or "date" is being used.

XML represents another evolution in web architecture, and along with XML come namespaces. Example 2 illustrates some namespaces in use. Like peer-to-peer, namespaces exemplify decentralization. We can now mix descriptive elements defined by independent communities, without fear of naming clashes, since each piece of data is tied a URI that provides a context and definition for it.

Example 2: Dublin Core metadata in an XML document


<?xml version="1.0" encoding="iso-8859-1"?>
 
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns="http://purl.org/rss/1.0/"
>
...
  <item rdf:about="http://www.oreillynet.com/.../metadata.html">
    <title>Distributed Metadata</title>
    <link>http://www.oreillynet.com/.../metadata.html </link>
    <dc:description>This article addresses...</dc:description>
    <dc:subject>metadata, rdf, peer-to-peer </dc:subject>
    <dc:creator>Dan Brickley and Rael Dornfest </dc:creator>
    <dc:publisher>O'Reilly & Associates</dc:publisher>
    <dc:date>2000-10-29T00:34:00+00:00</dc:date>
    <dc:type>article</dc:type>
    <dc:language>en-us</dc:language>
    <dc:format>text/html</dc:format>
    <dc:rights>Copyright 2000, O'Reilly & Associates, Inc.</dc:rights>
    ...
  </item>
  ...

In the example above, Dublin Core elements are prepended by the namespace name dc:. The name is associated with the URI http://purl.org/dc/elements/1.1 by the xmlns:dc construct at the beginning of the document. dc:subject is therefore understood to mean "the subject element in the dc namespace as defined at http://purl.org/dc/elements/1.1."

Namespaces let each author weave additional semantics required by particular types of resources or appropriate to a specific realm with the more general resource description such as that provided by the Dublin Core. In the book world, an additional definition might be the ISBN or Library of Congress number, while in the music world, it might be some form of compact disc identifier.

Now, we're not insisting that each and every document be described using all 15 Dublin Core elements and along various other lines as well. Something to keep in mind, however, is that every bit of metadata provides a logarithmic increase in available semantics, making resources less ambiguous and easier to find. Peer-to-peer application developers may then use the descriptions provided by a resource rather than having to resort to guesswork or such extremes as sequestering resources of a certain type to their own network.

Searching

Searching is the bane of the Web's existence, despite the plethora of search tools -- Yahoo currently lists 193 registered web search engines.[2] Search engines typically suffer from a lack of semantics on both the gathering and querying ends. On the gathering side, search engines typically utilize one of two methods:

On the querying end, while some sites do make an attempt to narrow the context for particular word searches (using such categories as "all the words," "any of the words," or "in the title"), successful searching still comes down to keywords and best guess. It's virtually impossible to remove the ambiguity between concepts like "by" and "about" -- "find me all articles written by Andy Oram" versus "find me anything about Andy Oram." Queries like "find me anything on Perl written by the person whose e-mail address is larry@wall.org" are out of the question.

While the needs of users clearly call for semantically rich queries, some peer-to-peer applications and systems are doing little to provide even the simplest of keyword searches. While Freenet does provide the boon of an optional accompanying metadata file to accompany any resource added to the cloud, this is currently of minimal use since no guidance exists on what this metadata file should contain, and there is currently no search functionality. Gnutella's InfraSearch allows for a wonderfully diverse interpretation and subsequent processing of search terms: While a dictionary node sees "country" as a term to be looked up, an MP3 node may see it as a music genre. Unfortunately, however, the InfraSearch user interface still provides only a simple text entry field and little chance for the user to be an active participant in defining the parameters of his or her search.

Hopefully we'll see peer-to-peer applications emerging that empower both the content provider and end user by providing semantically rich environments for the description and subsequent retrieval of content. This should be reflected both in the user interface and in the engine itself.

Resources and relationships: A historical overview

So where does this all leave us? How do we infuse our peer-to-peer applications with the metadata lessons learned from the Web?

The core of the World Wide Web Consortium's (W3C) metadata vision is a concept known as the Semantic Web. This is not a separate Web from the one we currently weave and wander, but a layer of metadata providing richer relationships between the ostensibly disparate resources we visit with our mouse clicks. While HTML's hyperlinks are simple linear paths lacking any obvious meaning, such semantics do exist and need only a means of expression.

Enter the Resource Description Framework[3] (RDF) -- a data model and XML serialization syntax for describing resources both on and off the Web. RDF turns those flat hyperlinks into arcs, allowing us to label not only the endpoints, but the arc itself -- in other words, ascribe meaning to the relationship between the two resources at hand. A simple link between Andy Oram's homepage and an article on the O'Reilly Network provides little insight into the relationship between the two. RDF disambiguates the relationship: "Andy wrote this particular article" versus "this is an article about Andy" versus "Andy found this article rather interesting."

RDF's history itself shows how emerging peer-to-peer applications can benefit from a generalized and consistent metadata framework. RDF has roots in an earlier effort, the Platform for Internet Content Selection, or PICS. One of the original goals for PICS was to facilitate a wide range of rating and filtering services, particularly in the areas of child protection and filtering of pornographic content. It defined a simple metadata "label" format that could encode a variety of classification and rating vocabularies (e.g., RSACi, MedPICS[4]). It included the goal of allowing diverse communities to create their own content rating languages and networked metadata services for distributing these descriptive labels. While originally it defined a pretty comprehensive set of tools for rating and filtering systems, PICS as initially defined did not play well with other metadata applications. The protocols, data formats, and accompanying infrastructure were too tightly coupled to one narrow application -- it wasn't general enough to be useful for everyone.

One critical piece PICS lacked was a namespaces mechanism that would allow a single PICS label to draw upon multiple, independently managed vocabularies. The designers of PICS eventually realized that all the work they had put into a well-designed query protocol, a digital signatures system, vocabularies, and so forth risked being reinvented for various other, non-PICS-specific metadata applications.

The threat of such duplication led to the invention of RDF. Unlike PICS, RDF has a highly general information model designed from the ground up to allow diverse applications to create data that can be easily intermingled. However diverse, RDF applications all share a common strategy: They talk about unambiguously named properties of unambiguously named resources. To eliminate ambiguous interpretations of properties such as "type" or "format," RDF rests on unique identifiers.

Foundations of resource description: Unique identifiers

Unique identification is the critical empowering technology for metadata. We benefit from having unique identifiers for both the things we describe (resources), and the ways we describe them (properties). In RDF, we call the things we're describing resources regardless of whether they're people, places, documents, movies, images, databases, etc. All RDF applications adopt a common convention for identifying these things (regardless of what else they disagree about!).

We identify the things we're describing with Uniform Resource Identifiers, or URIs.[5] You're most probably familiar with one subset of URIs, the Uniform Resource Locator or URL. While URLs are concerned with the location and retrieval of resources, URIs more generally are unique identifiers for things that may not necessarily be retrievable.

We also need clarity concerning properties, which are how we describe our resources. To say that something is of a particular type, or has a certain relationship to another resource, or has some specified attribute, we need to uniquely identify our descriptive concepts. RDF uses URIs for these too. Different communities can invent new descriptive properties (such as person, employee, price, and classification) and assign URIs to these properties.

Since the assignment of URIs is decentralized, we can be sure that uniquely named descriptive properties don't get mixed up when we integrate metadata from multiple sources. An auto-maker's concept of "type" is different from that of a cheese-maker's. The use of URIs such as http://webuildcars.org/descriptions/types and http://weagecheese.org/descriptions/type/ serves to uniquely identify the particular "type" we're using to describe a resource.

One critical lesson we can take away from the PICS story is that, when it comes to metadata, it is very hard to partition the problem space. The things we want to describe, the things we want to say about them, and the things we want to do with this data are all deeply entangled. RDF is an attempt to provide a generalized framework for all types of metadata. By providing a consistent abstraction layer that goes below surface differences, we gain an elegant core architecture on which to build. There is no limit to the material or applications RDF supports: Through different URIs and namespaces, different groups can extend the common RDF model to describe the needs of the peer-to-peer application at hand. No standards committee or centralized initiative gets to decide how we describe things. Applications can draw upon multiple descriptive vocabularies in a consistent, principled manner. The combination of these two attributes -- consistent framework and decentralized descriptive concepts -- is a powerful architecture for the peer-to-peer applications being built today.

When it comes to metadata, the network becomes a poorer information resource whenever we create artificial boundaries between metadata applications. The Web's own metadata system, RDF, was built in acknowledgment of this. There is little reason to suppose peer-to-peer content is different in this regard since we're talking about pretty much the same kind of content, albeit in a radically new environment.

A contrasting evolution: MP3 and the metadata marketplace

The alternatives to erecting a rigorous metadata architecture like RDF can be illustrated by the most popular decentralized activity on the Internet today: MP3 file exchange.

How do people find out the names of songs on the CDs they're playing on their networked PCs? One immediate problem is that there is nothing resembling a URI scheme for naming CDs; this makes it difficult to agree on a protocol for querying metadata servers about the properties of those CDs. While one might imagine taking one of the various CDDB-like algorithms and proposing a URI scheme for universal adoption (for instance, cd:894120720878192091), in practice this would be time-consuming and somewhat politicized. Meanwhile, peer-to-peer developers just want to build killer apps; they don't want to spend 18 months on a standards committee specifying the identifiers for compact discs (or people or films...). Most of us can't afford the time to create metadata tags, and if we could, we'd doubtless think of more interesting ways of using that time.

What to do? Having just stressed the importance of unique names when describing content, can we get by without them? Actually, it appears so.

Every day thousands of MP3 users work around the unique identification problem without realizing it. Their CD rippers inspect the CD, compute one of several identifying properties for the CD they're digitizing, and use this uniquely identifying property to consult a networked metadata service. This is metadata in action on a massive scale. But it also smacks of the PICS problem. MP3 listeners have settled on an application-specific piece of infrastructure rather than a more useful, generalized approach.

These metadata services exist and operate very successfully today, despite the lack of any canonical "standard" identifier syntax for compact discs. The technique they use to work around the standards bottleneck is simple, being much the same as saying things like "the person whose personal mailbox is..." or "the company whose corporate homepage is...". Being simple, it can (and should) be applied in other contexts where peer-to-peer and web applications want to query networked services for metadata. There's no reason to use a different protocol when asking for a CD track list and when asking for metadata describing any other kind of thing.

The basic protocol being used in CD metadata query is both simple and general: "Tell me what you know about the resource whose CD checksum is some-huge-number" -- a protocol reminiscent of the PICS label bureau protocol. The MP3 community could build enormously useful services on top of this, even without adopting a more general framework such as that provided by RDF, but they have stopped short of the next step.

On the contrary, while MP3 CD rippers currently embed lots of descriptive information (track listings) right into the encoding, they omit the most crucial piece of data from a fan's point of view: the CD and track identifiers. The simple unique identifier for a song on a CD, while only a tiny fragment of data, could allow both peer-to-peer and web applications to hook into a marketplace of descriptive services. How could MP3 services use this information?

One application is to update the metadata inside MP3 files, either to correct errors or to add additional information. If we don't know which CD an MP3 file was derived from, it becomes hard to know which MP3 files to update when we learn more about that CD. MP3s of collected works (i.e., compilations) typically have very poor embedded metadata. Artist names often appear inside the track name, for example. This makes for difficulties in finding information: If I want to generate a browsable listing organized alphabetically by artist, I don't want half the songs filed away under "Various Artists," nor do I want to find dozens of artist names in the "By Track Title" listings. Embedding unique identifiers in MP3s would allow this mess to be fixed at a later date.

Another example can be found in the practice of sharing playlists: Given some convention for identifying songs and tracks, we can describe virtual, personalized compilation albums that another listener can re-create on his personal system by asking a peer-to-peer network for files representing those tracks. Unique identification strategies would provide the architectural glue that would allow us to reconnect fragmented information resources. Were someone to put a unique identification service in place, we could soon expect all kinds of new applications built on top:

The lesson for peer-to-peer metadata architecture is simple. Unique identifiers create markets. If you want to build interesting peer-to-peer applications that hook into a wide range of additional services, adopt the same strategy for uniquely identifying things that others are using.

Conclusion

Metadata applied at a fundamental level, early in the game, will provide rich semantics upon which innovators can build peer-to-peer applications that will amaze us with their flexibility. While the symmetry of peer-to-peer brings about a host of new and interesting ways of interacting, there's no substitute for taking the opportunity to rethink our assumptions and learned from the mistakes made on the Web. Let's not continue the screen-scraping modus operandi; rather, let's replace extrapolation with forethought and rich assertions.

To summarize with a call to action for peer-to-peer architects, project leaders, developers, and end users:

1. Dublin Core Metadata Initiative; "Metadata With a Mission: Dublin Core"; Dublin Core Metadata Element Set, Version 1.1 .

2. Yahoo's "Search Engines" category

3. Resource Description Framework

4. Links to PICS vocabularies and W3C specifications, "Metadata, PICS and Quality" (1997).

5. URI defines a simple text syntax for URLs, URNs and similar controlled names for use on the Internet (http://www.w3.org/Addressing).


Rael Dornfest is Founder and CEO of Portland, Oregon-based Values of n. Rael leads the Values of n charge with passion, unearthly creativity, and a repertoire of puns and jokes — some of which are actually good. Prior to founding Values of n, he was O'Reilly's Chief Technical Officer, program chair for the O'Reilly Emerging Technology Conference (which he continues to chair), series editor of the bestselling Hacks book series, and instigator of O'Reilly's Rough Cuts early access program. He built Meerkat, the first web-based feed aggregator, was champion and co-author of the RSS 1.0 specification, and has written and contributed to six O'Reilly books. Rael's programmatic pride and joy is the nimble, open source blogging application Blosxom, the principles of which you'll find in the Values of n philosophy and embodied in Stikkit: Little yellow notes that think.


Related Articles:

Dublin Core Resources

RDF Schema Resoures


Discuss this article in the O'Reilly Network General Forum.

Return to the P2P DevCenter.

 

Copyright © 2009 O'Reilly Media, Inc.