ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


PHP Cookbook

Using PHP 5's SimpleXML

by Adam Trachtenberg, coauthor of PHP Cookbook
01/15/2004

XML is great, but I've constantly wondered why it's so difficult to parse. Most languages provide you with three options: SAX, DOM, and XSLT. Each has its own problems:

SimpleXML is a new and unique feature of PHP 5 that solves these problems by turning an XML document into a data structure you can iterate through like a collection of arrays and objects. It excels when you're only interested in an element's attributes and text and you know the document's layout ahead of time. SimpleXML is easy to use because it handles only the most common XML tasks, leaving the rest for other extensions.

This article shows how to use SimpleXML to read an XML file, parse the results into a useful form, and query the document with XPath. I use RSS for the examples, since some versions of RSS are nice and easy. Then there's RSS 1.0. It uses RDF, multiple namespaces, and defines a default namespace for its elements. (Not so nice and easy.)

Along the way, there's a brief discussion on XML namespaces and XPath, since they're necessary to process XML documents that expand beyond the basics. In particular, to handle RSS 1.0, you need to work with these XML specifications.

To try SimpleXML, you need a copy of PHP 5 Beta 3, as not everything described here works in earlier versions. SimpleXML also requires libxml2, an open source XML parsing library that all of PHP 5's XML extensions now use. SimpleXML support is enabled by default, so it's automatically installed when you build PHP 5.

Like PHP 5, SimpleXML is beta quality. There are still a few bugs, memory leaks, and unimplemented features, but overall it's coming together nicely.

Reading XML

The first set of examples use the following chunk of RSS, which is stored in rss-0.91.xml:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="0.91">
<channel>
    <title>PHP: Hypertext Preprocessor</title>
    <link>http://www.php.net/</link>
    <description>The PHP scripting language web site</description>
</channel>

<item>
    <title>PHP 5.0.0 Beta 3 Released</title>
    <link>http://www.php.net/downloads.php</link>
    <description>PHP 5.0 Beta 3 has been released. The third beta 
    of PHP is also scheduled to be the last one (barring unexpected 
    surprises).</description>
</item>
<item>
    <title>PHP Community Site Project Announced</title>
    <link>http://shiflett.org/archive/19</link>
    <description>
    Members of the PHP community are seeking volunteers to help 
    develop the first web site that is created both by the community and for 
    the community.</description>
</item>
</rss>

To begin, create a new SimpleXML object. For XML on disk, use simplexml_load_file('/path/to/file.xml'). If it's stored in a PHP variable, use simplexml_load_string($xml). So, to load the RSS, do:

$s = simplexml_load_file('rss-0.91.xml');

Element text is accessed like object properties:

print $s->channel->title . "\n";

PHP: Hypertext Preprocessor

If there's more than one element in the same level in document, they're placed inside an array. In this example, there's only one <channel>, but two <items>s. To access an <item>, use its location in the array:

print $s->item[0]->title . "\n";

PHP 5.0.0 Beta 3 Released

To print all titles, use a foreach loop:

foreach ($s->item as $item) {
    print $item->title . "\n";
}

PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced

Use array notation to read element attributes:

print $s['version'] . "\n";

0.91

Other XML features, like comments and processing instructions, are unsupported. You can't (yet) access these entities. However, since most XML documents don't place vital information in comments or use processing instructions, this isn't a big drawback.

Querying with XPath

SimpleXML uses XPath to allow you to gather information from a document. Find and print all the text inside title elements with:

foreach ($s->xsearch('//title') as $title) { 
    print "$title\n";
}

PHP: Hypertext Preprocessor
PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced

The xsearch() method searches a SimpleXML object and returns an array of matching nodes. Pass your XPath query as the argument. In this case, //title finds all title elements regardless of location in the tree. Or, restrict the search to only <title>s inside of <item>s with //item/title.

If you've used XSLT, you're familiar with XPath. XSLT templates use XPath expressions to determine when to process a node. For more on XPath, read John E. Simpson's XPath and XPointer (O'Reilly) or John's XML.com article, Top Ten Tips to Using XPath and XPointer. Additionally, Chapter 9 of XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means (O'Reilly), covers XPath and is available free online.

XPath and XPointer

Related Reading

XPath and XPointer
Locating Content in XML Documents
By John E. Simpson

While these examples are somewhat trivial, XPath is quite useful with complex documents, as you can create sophisticated queries to return finely tuned results.

XML Namespaces

SimpleXML even makes processing RSS 1.0 feeds easy. RSS 1.0 uses XML namespaces, which can present a bit of a headache during parsing. With XML namespaces, each element lives under a URL, which acts as a package name. This allows you to distinguish between, say, the HTML <title> element and the RSS <title> element.

All of a sudden things became more complex. You can no longer refer to title, since an unadorned title doesn't let the processor know which <title> you mean. You could be thinking of the RSS item <title>, but there's also an HTML <title> in the document.

As a result, there's now {http://www.w3.org/1999/xhtml}:title and also {http://purl.org/rss/1.0}:title instead. XML uses the colon (:) as a demarcation character between the URL and the plain tag name. In technical language, the complete name is called the qualified name, or the qname for short. (Really!)

Since URLs are long, you can map a short word to the URL. So, you frequently end up referring to these elements as <xhtml:title> and <rss:title>. These short names are known as namespace prefixes. However, it's the URL that's important, so prefixes like xhtml and rss are conventions, not actual namespaces. (It's important to mention that the URL doesn't have to resolve to a web page, it's just an easy way for people to create non-conflicting namespaces.)

SimpleXML likes the world to be simple, so it pretends the namespaces don't exist. (I know a whole crowd of readers feel this cure is worse than the disease. Remember, however, this is SimpleXML. If you're worried about namespace clashes use DOM.)

Here's the same data as before, encoded as RSS 1.0 and saved as rss-1.0.xml:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns="http://purl.org/rss/1.0/"
>
<channel rdf:about="http://www.php.net/">
    <title>PHP: Hypertext Preprocessor</title>
    <link>http://www.php.net/</link>
    <description>The PHP scripting language web site</description>
</channel>

<item rdf:about="http://www.php.net/downloads.php">
    <title>PHP 5.0.0 Beta 3 Released</title>
    <link>http://www.php.net/downloads.php</link>
    <description>
    PHP 5.0 Beta 3 has been released. The third beta of PHP is 
    also scheduled to be the last one (barring unexpected surprises).
    </description>
    <dc:date>2004-01-02</dc:date>
</item>

<item rdf:about="http://shiflett.org/archive/19">
    <title>PHP Community Site Project Announced</title>
    <link>http://shiflett.org/archive/19</link>
    <description>
    Members of the PHP community are seeking volunteers to help 
    develop the first web site that is created both by the community and for 
    the community.
    </description>
    <dc:date>2003-12-18</dc:date>
</item>

</rdf:RDF>

This XML document has three different namespaces. Looking at the top of the file, two namespaces have explicit namespace prefix mappings. That's what xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" and the following line does. It associates those URLs to rdf and dc. You can see rdf:RDF, rdf:about, and dc:date elements and attributes within the document.

RDF is "Yet Another XML Spec" (YAXMLS). I won't go into it here, but you can learn more on the W3 RDF site and in Tim Bray's article, What is RDF?, on XML.com. O'Reilly also has a book on RDF titled, Practical RDF.

There's also one entity without a prefix, xmlns="http://purl.org/rss/1.0/". That's the default namespace, since there's no colon after xmlns. Elements without a prefix, like item and title, live in the default namespace. This is different from RSS 0.91, where elements do not live in any namespace.

To search for elements in a namespace under DOM, you need to switch to a new set of methods, where you pass in the tag and the namespace. As I said earlier, SimpleXML just barges forward with its head down. You can use the exact same syntax with RSS 1.0 as earlier:

foreach ($s->item as $item) {
    print $item->title . "\n";
}

PHP 5.0.0 Beta Released
PHP Community Site Project Announced

This is not a problem because, despite all the namespace vigilance, there are no name clashes in the document.

XML Namespaces and XPath

However, SimpleXML is not completely naive. It recognizes the potential for problems with this attitude. Therefore, you can distinguish between two namespaced elements with XPath, but you need to use namespace prefixes.

SimpleXML automatically registers all the non-default namespace prefixes, but you need to handle the default namespace. (This lack of default namespace mapping is a deficit in XPath 1.0, not SimpleXML.)

To find and print all rss:title entries:

$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item/rss:title');

foreach ($titles as $title) {
    print "$title\n";
}

PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced

After loading the file, manually register a namespace prefix to go with http://purl.org/rss/1.0/. You're free to select any prefix you want, but rss is a natural choice.

The new XPath query now looks for //rss:item/rss:title instead of plain old //item/title, since it needs namespace prefixes. It's a little funny that there's no way to define a default namespace prefix for an XPath search, but that's how it is. Even though these elements don't have explicit prefixes in the document, they need prefixes in the XPath query.

You can use XPath to take advantage of the additional data in the RSS feed. For instance, to find and print all the entries from January 2004:

$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item[
               starts-with(dc:date, "2004-01-")]/rss:title');

foreach ($titles as $title) {
    print "$title\n";
}

PHP 5.0.0 Beta 3 Released

The first two lines are the same, but I've modified the XPath query to filter the results. In XPath, you can request a subset of elements in a level by requiring them to match a test inside of square brackets ([]). This test requires the dc:date element under the current rss:item to begin with the string 2004-01-. If so, starts-with() returns true, and XPath knows to include it in the results. (These dates are part of the Dublin Core Metadata specification, hence the prefix of dc.)

This prints only one title because the Community Site item was posted in December, while Beta 3 came out in January. (Actually, it came out at the end of December, but it makes the example easier to explain.)

Other Features

SimpleXML has a few more features: you can edit elements and attributes in place by assigning them a new value. Then, you can save the modified XML document to a file or store it in a PHP variable. Additionally, you can validate XML documents using XML Schema.

Besides RSS, SimpleXML is also perfect for parsing configuration files and consuming web services with REST. Additionally, I'm sure that as PHP 5 evolves, SimpleXML will gain even more functionality. Keep an eye peeled for the announcements and enjoy playing with SimpleXML.

Adam Trachtenberg is the manager of technical evangelism for eBay and is the author of two O'Reilly books, "Upgrading to PHP 5" and "PHP Cookbook." In February he will be speaking at Web Services Edge 2005 on "Developing E-Commerce Applications with Web Services" and at the O'Reilly booth at LinuxWorld on "Writing eBay Web Services Applications with PHP 5."


Return to the PHP DevCenter.


Copyright © 2009 O'Reilly Media, Inc.