PHP DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Using PHP 5's SimpleXML
Pages: 1, 2

XML Namespaces

SimpleXML even makes processing RSS 1.0 feeds easy. RSS 1.0 uses XML namespaces, which can present a bit of a headache during parsing. With XML namespaces, each element lives under a URL, which acts as a package name. This allows you to distinguish between, say, the HTML <title> element and the RSS <title> element.



All of a sudden things became more complex. You can no longer refer to title, since an unadorned title doesn't let the processor know which <title> you mean. You could be thinking of the RSS item <title>, but there's also an HTML <title> in the document.

As a result, there's now {http://www.w3.org/1999/xhtml}:title and also {http://purl.org/rss/1.0}:title instead. XML uses the colon (:) as a demarcation character between the URL and the plain tag name. In technical language, the complete name is called the qualified name, or the qname for short. (Really!)

Since URLs are long, you can map a short word to the URL. So, you frequently end up referring to these elements as <xhtml:title> and <rss:title>. These short names are known as namespace prefixes. However, it's the URL that's important, so prefixes like xhtml and rss are conventions, not actual namespaces. (It's important to mention that the URL doesn't have to resolve to a web page, it's just an easy way for people to create non-conflicting namespaces.)

SimpleXML likes the world to be simple, so it pretends the namespaces don't exist. (I know a whole crowd of readers feel this cure is worse than the disease. Remember, however, this is SimpleXML. If you're worried about namespace clashes use DOM.)

Here's the same data as before, encoded as RSS 1.0 and saved as rss-1.0.xml:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns="http://purl.org/rss/1.0/"
>
<channel rdf:about="http://www.php.net/">
    <title>PHP: Hypertext Preprocessor</title>
    <link>http://www.php.net/</link>
    <description>The PHP scripting language web site</description>
</channel>

<item rdf:about="http://www.php.net/downloads.php">
    <title>PHP 5.0.0 Beta 3 Released</title>
    <link>http://www.php.net/downloads.php</link>
    <description>
    PHP 5.0 Beta 3 has been released. The third beta of PHP is 
    also scheduled to be the last one (barring unexpected surprises).
    </description>
    <dc:date>2004-01-02</dc:date>
</item>

<item rdf:about="http://shiflett.org/archive/19">
    <title>PHP Community Site Project Announced</title>
    <link>http://shiflett.org/archive/19</link>
    <description>
    Members of the PHP community are seeking volunteers to help 
    develop the first web site that is created both by the community and for 
    the community.
    </description>
    <dc:date>2003-12-18</dc:date>
</item>

</rdf:RDF>

This XML document has three different namespaces. Looking at the top of the file, two namespaces have explicit namespace prefix mappings. That's what xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" and the following line does. It associates those URLs to rdf and dc. You can see rdf:RDF, rdf:about, and dc:date elements and attributes within the document.

RDF is "Yet Another XML Spec" (YAXMLS). I won't go into it here, but you can learn more on the W3 RDF site and in Tim Bray's article, What is RDF?, on XML.com. O'Reilly also has a book on RDF titled, Practical RDF.

There's also one entity without a prefix, xmlns="http://purl.org/rss/1.0/". That's the default namespace, since there's no colon after xmlns. Elements without a prefix, like item and title, live in the default namespace. This is different from RSS 0.91, where elements do not live in any namespace.

To search for elements in a namespace under DOM, you need to switch to a new set of methods, where you pass in the tag and the namespace. As I said earlier, SimpleXML just barges forward with its head down. You can use the exact same syntax with RSS 1.0 as earlier:

foreach ($s->item as $item) {
    print $item->title . "\n";
}

PHP 5.0.0 Beta Released
PHP Community Site Project Announced

This is not a problem because, despite all the namespace vigilance, there are no name clashes in the document.

XML Namespaces and XPath

However, SimpleXML is not completely naive. It recognizes the potential for problems with this attitude. Therefore, you can distinguish between two namespaced elements with XPath, but you need to use namespace prefixes.

SimpleXML automatically registers all the non-default namespace prefixes, but you need to handle the default namespace. (This lack of default namespace mapping is a deficit in XPath 1.0, not SimpleXML.)

To find and print all rss:title entries:

$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item/rss:title');

foreach ($titles as $title) {
    print "$title\n";
}

PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced

After loading the file, manually register a namespace prefix to go with http://purl.org/rss/1.0/. You're free to select any prefix you want, but rss is a natural choice.

The new XPath query now looks for //rss:item/rss:title instead of plain old //item/title, since it needs namespace prefixes. It's a little funny that there's no way to define a default namespace prefix for an XPath search, but that's how it is. Even though these elements don't have explicit prefixes in the document, they need prefixes in the XPath query.

You can use XPath to take advantage of the additional data in the RSS feed. For instance, to find and print all the entries from January 2004:

$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item[
               starts-with(dc:date, "2004-01-")]/rss:title');

foreach ($titles as $title) {
    print "$title\n";
}

PHP 5.0.0 Beta 3 Released

The first two lines are the same, but I've modified the XPath query to filter the results. In XPath, you can request a subset of elements in a level by requiring them to match a test inside of square brackets ([]). This test requires the dc:date element under the current rss:item to begin with the string 2004-01-. If so, starts-with() returns true, and XPath knows to include it in the results. (These dates are part of the Dublin Core Metadata specification, hence the prefix of dc.)

This prints only one title because the Community Site item was posted in December, while Beta 3 came out in January. (Actually, it came out at the end of December, but it makes the example easier to explain.)

Other Features

SimpleXML has a few more features: you can edit elements and attributes in place by assigning them a new value. Then, you can save the modified XML document to a file or store it in a PHP variable. Additionally, you can validate XML documents using XML Schema.

Besides RSS, SimpleXML is also perfect for parsing configuration files and consuming web services with REST. Additionally, I'm sure that as PHP 5 evolves, SimpleXML will gain even more functionality. Keep an eye peeled for the announcements and enjoy playing with SimpleXML.

Adam Trachtenberg is the manager of technical evangelism for eBay and is the author of two O'Reilly books, "Upgrading to PHP 5" and "PHP Cookbook." In February he will be speaking at Web Services Edge 2005 on "Developing E-Commerce Applications with Web Services" and at the O'Reilly booth at LinuxWorld on "Writing eBay Web Services Applications with PHP 5."


Return to the PHP DevCenter.



Valuable Online Certification Training

Online Certification for Your Career
Earn a Certificate for Professional Development from the University of Illinois Office of Continuing Education upon completion of each online certificate program.

PHP/SQL Programming Certificate — The PHP/SQL Programming Certificate series is comprised of four courses covering beginning to advanced PHP programming, beginning to advanced database programming using the SQL language, database theory, and integrated Web 2.0 programming using PHP and SQL on the Unix/Linux mySQL platform.

Enroll today!


Sponsored by: