The Python comunity has too many deceptive XML benchmarks

   Print.Print
Email.Email weblog link
Blog this.Blog this
Uche Ogbuji

Uche Ogbuji
Jan. 23, 2005 06:17 PM
Permalink

Atom feed for this author. RSS 1.0 feed for this author. RSS 2.0 feed for this author.

The Python/XML community has an unfortunately long tradition of dodgy benchmarks. I had a lot to say about probably the most egregious example in my article on PyRXP. PyRXP is called an XML parser, and its developers benchmark it as such against other Python/XML parsers. The problem is that it turns out PyRXP is not an XML parser. It fails the most fundamental conformance to the most important aspect of XML: Unicode support. As a result, a benchmark of PyRXP against an XML parser is ludicrously unfair. In my article I had a lot to say about how poisonous such unfair benchmarks are.

On the less egregious end are benchmarks of libxml2's default Python binding, which is in many ways so gnomic (no pun intended) and trecherous that it's also an unfair comparison against most Pythonic XML tools. It sounds as if Martijn Faassen's lxml is making decent progress towards rectifying this.

But I must say that the benchmarks that were the last straw for me came from an old friend. Fredrik Lundh ("/F") is IMO one of the few XML package developers in the Python community who really understand both Python and XML. This has been generally borne out in his ElementTree library, about which I've always had a lot of good things to say. cElementTree came along and suddenly raised the Python/XML benchmark sweepstakes once again. As part of promotion of cElementTree, /F posted a benchmark on the home page. The benchmarks are very flattering to cElementTree, and it's probably deserving of some such flattery, but as I examined the performance issue a bit more, I've come to conclude that his benchmarks are pretty much useless.

The problem is that besides a performance bug in my own Amara 0.9.2, which /F brought to my notice, and that was fixed in the subsequent release, I was unable to reproduce under real-world conditions anything like the proportions implied in /F's benchmarks. Well, /F pretty much admits that all he's doing in his benchmark is reading in a file using each library. Hmm. This is not the stuff of which useful benchmarks are made. Nobody reads in a 3MB XML document just to throw all the data away, least of all Python developers who have long been vocal of their desire to do as little with XML as possible. Of course of I can't be 100% sure in this complaint because I haven't seen the benchmark code, but then again that's just another complaint.

I set out to run at least one real-world benchmark, in order to determine whether there is anything to the no-op benchmarks /F uses. The basics come from this article, where I introduce the Old Testament test. The idea is simply to print all verses containing the word 'begat' Jon Bosak's Old Testament in XML, a 3.3MB document. A quick note on the characteristics of the file: it contains 23145 v elements containing each Bible verse and only text: no child elements. The v elements and their content represent about 3.2 of the file's total 3.3MB. In the rest of this article I present the code and results.

I'm working on a Dell Inspiron 8600 notebook with 2GB RAM. It's a Centrino 1.7GHz, which is about equivalent to a P4-3GHz (modulo the equally wacky world of CPU benchmarks). The OS is Fedora Core 3 Linux, and I've tuned DMA and the like. I'm running Python 2.3.2. The following are my pystone results:

$ python /home/uogbuji/lib/lib/python2.3/test/pystone.py
Pystone(1.1) time for 50000 passes = 2.99
This machine benchmarks at 16722.4 pystones/second

I ran each case 5 times and recorded the high and low run times, according to the UNIX time command. In understand very well that this is not quite statistically thorough, but It's well ahead of all the other such benchmarks I've seen in terms of reproduceability (I present all my code) and usefulness (this is a real-world use-case for XML processing).

First up: plain old PySAX. Forget the performance characteristics for a moment: this code was just a pain in the arse to write.

from xml import sax

class OtHandler(sax.ContentHandler):
    def __init__(self):
        #Yes, all this rigmarole *is* required, otherwise
        #you could miss The word "begat" split across
        #multiple SAX events
        self.verse = None
        return

    def startElementNS(self, (ns, local), qname, attrs):
        if local == u'v':
            self.verse = u''
        return

    def endElementNS(self, name, qname):
        if (self.verse is not None
            and self.verse.find(u'begat') != -1):
            print self.verse
        self.verse = None
        return

    def characters(self, text):
        if self.verse is not None:
            #Yeah yeah, probably a tad faster to use the
            #''.join(fragment_list) trick, but not worth
            #the complication with these small verse chunks
            self.verse += text
        return

handler = OtHandler()
parser = sax.make_parser()
parser.setContentHandler(handler)
parser.setFeature(sax.handler.feature_namespaces, 1)
parser.parse("ot.xml")

I get numbers ranging from 2.32 - 3.97 seconds.

Next up is PySAX using a filter to normalize text events, and thus simplify the SAX code a great deal. The filter, amara.saxtools.normalize_text_filter is basically the one I posted here, with some improvements. The code is much less painful than the PySAX example above, but it still demonstrates why SAX turns off people used to Python's simplicity.

from xml import sax
from amara import saxtools

class OtHandler(sax.ContentHandler):
    def characters(self, text):
        if text.find(u'begat') != -1:
            print text
        return

handler = OtHandler()
parser = sax.make_parser()
normal_parser = saxtools.normalize_text_filter(parser)
normal_parser.setContentHandler(handler)
normal_parser.setFeature(sax.handler.feature_namespaces, 1)
normal_parser.parse("ot.xml")

I get numbers ranging from 2.66 - 4.88 seconds.

Next up is Amara pushdom, which tries to combine some of the performance advantages of SAX with the (relative) ease of DOM.

from amara import domtools

for docfrag in domtools.pushdom(u'v', source='ot.xml'):
    text = docfrag.childNodes[0].firstChild.data
    if text.find(u'begat') != -1:
         print text

I get numbers ranging from 5.83 - 7.11 seconds.

Next up is Amara pushbind, which tries to combine some of the performance advantages of SAX with the most Pythonic (and thus easy) API I can imagine.

from amara import binderytools

for v in binderytools.pushbind(u'v', source='ot.xml'):
    text = unicode(v)
    if text.find(u'begat') != -1:
         print text

I get numbers ranging from 10.46 - 11.40 seconds.

Next up is Amara bindery chunker, which is the basis of pushbind.

from xml import sax
from amara import binderytools

def handle_chunk(docfrag):
    text = unicode(docfrag.v)
    if text.find(u'begat') != -1:
        print text

xpatterns = 'v'
handler = binderytools.saxbind_chunker(xpatterns=xpatterns,
        chunk_consumer=handle_chunk
    )
parser = sax.make_parser()
parser.setContentHandler(handler)
parser.setFeature(sax.handler.feature_namespaces, 1)
parser.parse("ot.xml")

I get numbers ranging from 9.44 - 10.27 seconds.

Finally, I look at /F's cElementTree.

import cElementTree as ElementTree

tree = ElementTree.parse("ot.xml")
for v in tree.findall("//v"):
    text = v.text
    if text.find(u'begat') != -1:
        print text

I get numbers ranging from 1.53 - 3.18 seconds.

So what do I conclude from these numbers? As I've said before, the speed of cElementTree amazes, me, but it's advantage in the real world is nowhere near as dramatic as /F's benchmarks claim. More relevant to my own vanity, Amara 0.9.3's disadvantage in the real world is nowhere as dramatic as /F's benchmarks claim. IMHO, it's close enough in performance to all the other options, and offers so many advantages in areas besides performance, that it's a very respectable alternative to any Python/XML library out there.

But the point of this exercise goes far beyond all that. We really need to clean up our act in what is a very strange political battleground in the Python/XML space. If we've decided that MIPS wars are what we're going to be all about in development, then let's benchmark properly. Let's gather some real-world use-cases and normalized test conditions. Let's make sure all our benchmarks are transparent (at least release all the code used), and let's put some statistical rigor behind them (not an easy thing to do, and not something I claim to have done in this article). Let's do all this as a community.

While we're at it, I'd like to repeat my call for test case diversity from my PyRXP article: [R]un the tests on a variety of hardware and operating systems, and [don't] focus on a single XML file, but rather examine a variety of XML files. Numerous characteristics of XML files can affect parsing and processing speed, including:

  • The preponderance of elements versus attributes versus text (and even comments and processing instructions)
  • Any repetition of element or attribute names, values and text content
  • The distribution of white space
  • The character encoding
  • The use of character and general entities
  • The input source (in-memory, string, file, URL, etc.)

And if we're not willing to do things rightly, let's stop deceiving users with meaningless benchmarks.

Uche Ogbuji is a Partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies.