Using PHP 5's SimpleXML
Pages: 1, 2
XML Namespaces
SimpleXML even makes processing RSS 1.0 feeds easy. RSS 1.0 uses XML
namespaces, which can present a bit of a headache during parsing. With XML
namespaces, each element lives under a URL, which acts as a package name. This
allows you to distinguish between, say, the HTML <title>
element and the RSS <title> element.
All of a sudden things became more complex. You can no longer refer to
title, since an unadorned title doesn't let the processor know
which <title> you mean. You could be thinking of the RSS
item <title>, but there's also an HTML
<title> in the document.
As a result, there's now {http://www.w3.org/1999/xhtml}:title
and also {http://purl.org/rss/1.0}:title instead. XML uses the
colon (:) as a demarcation character between the URL and the
plain tag name. In technical language, the complete name is called the
qualified name, or the qname for short. (Really!)
Since URLs are long, you can map a short word to the URL. So, you frequently
end up referring to these elements as <xhtml:title> and
<rss:title>. These short names are known as namespace
prefixes. However, it's the URL that's important, so prefixes like
xhtml and rss are conventions, not actual namespaces.
(It's important to mention that the URL doesn't have to resolve to a web page, it's just an easy way for people to create non-conflicting namespaces.)
SimpleXML likes the world to be simple, so it pretends the namespaces don't exist. (I know a whole crowd of readers feel this cure is worse than the disease. Remember, however, this is SimpleXML. If you're worried about namespace clashes use DOM.)
Here's the same data as before, encoded as RSS 1.0 and saved as
rss-1.0.xml:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/rss/1.0/"
>
<channel rdf:about="http://www.php.net/">
<title>PHP: Hypertext Preprocessor</title>
<link>http://www.php.net/</link>
<description>The PHP scripting language web site</description>
</channel>
<item rdf:about="http://www.php.net/downloads.php">
<title>PHP 5.0.0 Beta 3 Released</title>
<link>http://www.php.net/downloads.php</link>
<description>
PHP 5.0 Beta 3 has been released. The third beta of PHP is
also scheduled to be the last one (barring unexpected surprises).
</description>
<dc:date>2004-01-02</dc:date>
</item>
<item rdf:about="http://shiflett.org/archive/19">
<title>PHP Community Site Project Announced</title>
<link>http://shiflett.org/archive/19</link>
<description>
Members of the PHP community are seeking volunteers to help
develop the first web site that is created both by the community and for
the community.
</description>
<dc:date>2003-12-18</dc:date>
</item>
</rdf:RDF>
This XML document has three different namespaces. Looking at the top of the
file, two namespaces have explicit namespace prefix mappings. That's what
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" and the
following line does. It associates those URLs to rdf and
dc. You can see rdf:RDF, rdf:about, and
dc:date elements and attributes within the document.
RDF is "Yet Another XML Spec" (YAXMLS). I won't go into it here, but you can learn more on the W3 RDF site and in Tim Bray's article, What is RDF?, on XML.com. O'Reilly also has a book on RDF titled, Practical RDF.
There's also one entity without a prefix,
xmlns="http://purl.org/rss/1.0/". That's the default namespace,
since there's no colon after xmlns. Elements without a prefix,
like item and title, live in the default namespace.
This is different from RSS 0.91, where elements do not live in any
namespace.
To search for elements in a namespace under DOM, you need to switch to a new set of methods, where you pass in the tag and the namespace. As I said earlier, SimpleXML just barges forward with its head down. You can use the exact same syntax with RSS 1.0 as earlier:
foreach ($s->item as $item) {
print $item->title . "\n";
}
PHP 5.0.0 Beta Released
PHP Community Site Project Announced
This is not a problem because, despite all the namespace vigilance, there are no name clashes in the document.
XML Namespaces and XPath
However, SimpleXML is not completely naive. It recognizes the potential for problems with this attitude. Therefore, you can distinguish between two namespaced elements with XPath, but you need to use namespace prefixes.
SimpleXML automatically registers all the non-default namespace prefixes, but you need to handle the default namespace. (This lack of default namespace mapping is a deficit in XPath 1.0, not SimpleXML.)
To find and print all rss:title entries:
$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item/rss:title');
foreach ($titles as $title) {
print "$title\n";
}
PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced
After loading the file, manually register a namespace prefix to go with
http://purl.org/rss/1.0/. You're free to select any prefix you
want, but rss is a natural choice.
The new XPath query now looks for //rss:item/rss:title instead
of plain old //item/title, since it needs namespace prefixes.
It's a little funny that there's no way to define a default namespace prefix
for an XPath search, but that's how it is. Even though these elements don't
have explicit prefixes in the document, they need prefixes in the XPath
query.
You can use XPath to take advantage of the additional data in the RSS feed. For instance, to find and print all the entries from January 2004:
$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item[
starts-with(dc:date, "2004-01-")]/rss:title');
foreach ($titles as $title) {
print "$title\n";
}
PHP 5.0.0 Beta 3 Released
The first two lines are the same, but I've modified the XPath query to
filter the results. In XPath, you can request a subset of elements in a level
by requiring them to match a test inside of square brackets ([]).
This test requires the dc:date element under the current
rss:item to begin with the string 2004-01-. If so,
starts-with() returns true, and XPath knows to include it in the
results. (These dates are part of the Dublin Core Metadata specification, hence
the prefix of dc.)
This prints only one title because the Community Site item was posted in December, while Beta 3 came out in January. (Actually, it came out at the end of December, but it makes the example easier to explain.)
Other Features
SimpleXML has a few more features: you can edit elements and attributes in place by assigning them a new value. Then, you can save the modified XML document to a file or store it in a PHP variable. Additionally, you can validate XML documents using XML Schema.
Besides RSS, SimpleXML is also perfect for parsing configuration files and consuming web services with REST. Additionally, I'm sure that as PHP 5 evolves, SimpleXML will gain even more functionality. Keep an eye peeled for the announcements and enjoy playing with SimpleXML.
Adam Trachtenberg is the manager of technical evangelism for eBay and is the author of two O'Reilly books, "Upgrading to PHP 5" and "PHP Cookbook." In February he will be speaking at Web Services Edge 2005 on "Developing E-Commerce Applications with Web Services" and at the O'Reilly booth at LinuxWorld on "Writing eBay Web Services Applications with PHP 5."
Return to the PHP DevCenter.