Published on (
 See this if you're having trouble printing code examples

Profiling and Optimizing Python

by Jeremy Jones

Most of the time, code that we write doesn't have to perform as fast as if we wrote it in C. Most of the time, the first pass at writing it is "fast enough" and we don't have to optimize--but there are times when a piece of code just has to meet a certain standard of performance. For those "it's gotta run like a scalded dog" moments, the Python profiler may lend some assistance.

The Python standard library provides three libraries for performing code profiling: profile, hotshot, and timeit. The type of profiling performed in these modules involves tracing method and function calls and recording how long they took to run. profile provides detailed profiling information on all functions and methods in a code base, which are called at a specific runtime. hotshot is a replacement for the profile module. timeit provides less detailed profiling information for smaller pieces of code, where "less detailed" comprises overall execution time.

Profiling Code

One of the frequently repeated mantras I hear from Python veterans is, "Premature optimization is the root of all evil" (a quote from Donald Knuth, which apparently originated from Tony Hoare). The Pythonic approach to writing code that performs acceptably is:

The example code I use in these examples comes from ediplex. ediplex is an open source project centered around parsing and handling EDI data. For people who don't know, EDI stands for Electronic Data Interchange and is a generic term used as a general umbrella for the means used to trade electronic business documents between two business partners. I profiled my EDI parser with hotshot by putting this code in the if __name__ == '__main__': block of the main entry point module,

import hotshot
prof = hotshot.Profile("hotshot_edi_stats")

calling from the command line and passing the appropriate parameters. Interestingly, running this with hotshot takes about 25 seconds, while running it without a profiler takes a little over 3 seconds. Here is a run on the command line without the profiler, tracking the time with the Unix time utility:

jmjones@qiwi 10:10PM edi % time python \
    ~/svn/ediplex/src/ediplex/parser/ \
python ~/svn/ediplex/src/ediplex/parser/
    3.09s user 0.08s system 93% cpu 3.380 total

Here is a run on the command line with the profiler:

jmjones@qiwi 10:54PM edi % time python \
    ~/svn/ediplex/src/ediplex/parser/ \
python ~/svn/ediplex/src/ediplex/parser/
    4.11s user 20.99s system 98% cpu 25.381 total

This run with the hotshot profiler created a log file (hotshot_edi_stats) with profiling information of each called method. Here is an IPython session showing how to get the profiling information:

In [1]: from hotshot import stats

In [2]: s = stats.load("hotshot_edi_stats")

In [3]: s.sort_stats("time").print_stats()
         286016 function calls in 3.359 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   134000    1.455    0.000    2.104    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.942    0.942    3.358    3.358 /home/jmjones/svn/ediplex/src/ediplex/parser/util/
   136000    0.658    0.000    0.658    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/example/
     1000    0.203    0.000    0.250    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
     1000    0.034    0.000    0.057    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
     7000    0.032    0.000    0.032    0.000 /usr/lib/python2.4/
     2000    0.009    0.000    0.009    0.000 /usr/lib/python2.4/
     1001    0.006    0.000    0.006    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
     1000    0.005    0.000    0.005    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/example/
     1000    0.005    0.000    0.005    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/example/
     1000    0.005    0.000    0.005    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/example/
     1000    0.005    0.000    0.005    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    3.359    3.359 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    3.359    3.359 /home/jmjones/svn/ediplex/src/ediplex/parser/
        6    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/util/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/util/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/util/
        1    0.000    0.000    0.000    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        0    0.000             0.000          profile:0(profiler)

Out[3]: <pstats.Stats instance at 0xb798f52c>

The stats.load() call returns a pstats.Stats object. The only methods that I could find documented in the standard library for this object were print_stats(), sort_stats(), and strip_dirs(). Here, I have used only sort_stats(), which sorts the statistics based on the column specified in the argument list, and print_stats(), which prints out the statistics table displayed above. The strip_dirs() method would have displayed only the module rather than the absolute path to the module. Call me nutty, but I like seeing the absolute path to my modules. With a 17-inch display on my laptop, I don't worry about the extra output width from absolute filenames.

Interpreting Output

The statistical output shows the number of times a method was called (ncalls), the total time spent in the specified method but not in any of its children (tottime), the amount of time spent for each call to the specified method (the first percall), the amount of time spent in a method and calls to its children (cumtime), the amount of time spent for each call to the specified method including child methods (the second percall), and the filename. Given the profiling information, I can see that my three biggest bottlenecks are in (in the body_seg method), (in the run method), and (in the segment method).

I sorted the statistics on the tottime column (by passing "time" to the sort_stats() method) to show how well each method performed by itself. The total time spent in the top four methods is about 97 percent of the entire processing time. The top one takes about 43 percent of the entire processing time.

I should now ask the question, Is it fast enough? My answer is, Probably, but faster would be nice. It can parse about 300 2K interchanges per second on my laptop, (about 440 per second if I don't take into consideration Python interpreter startup, library imports, and such), which is actually pretty respectable, but I would like to trim whatever time I can. The profiler has pointed out a couple of glaring architectural deficiencies, so some changes are in order.

I just mentioned the parser's performance without Python interpreter startup and library imports. The timeit module is an excellent tool for helping determine the performance of only the interesting piece of code. I put this code:

import timeit
t = timeit.Timer("main()", "from __main__ import main")
res_set = t.repeat(10, 1)
print res_set, "::", min(res_set)

in my if __name__ == "__main__": block and ran it.

jmjones@qiwi  5:39AM parser % python
[3.1825220584869385, 2.4778821468353271, 2.4290919303894043, 2.4358069896697998,
2.4379229545593262, 2.4492578506469727, 2.439661979675293, 2.4325060844421387,
2.4347169399261475, 2.435114860534668] :: 2.42909193039

repeat(10, 1) runs 10 iterations of the specified code one time. I specified for the code to execute only once per iteration because I handled the looping by parsing a file of 1,000 EDI interchanges. The output of repeat() is a list of runtimes for each repeating call to the specified code. The recommended method of determining performance for a piece of code (by using timeit) is to take the minimum value from the returned list.

Finding and Fixing Bottlenecks

Related Reading

Python Cookbook
By Alex Martelli, Anna Martelli Ravenscroft, David Ascher

A primer on X12 EDI would probably be helpful at this point. EDI documents are character-delimited text files, which are more machine-readable than human-readable. Each document is made up of one interchange; each interchange is made up of a series of segments that roughly equate to database records; each segment is made up of elements that roughly equate to database columns. The first few characters of each segment (sometimes called the tag), up to the element-separating delimiter, declares what type of segment it is. The first segment in an X12 interchange has an ISA tag. The last segment in an X12 interchange has an IEA tag.

First problem: when ediplex parses an EDI document, it first determines whether it has a valid header segment, then proceeds to look for each individual segment and determines whether that segment is the end segment. This is really inefficient and results in an inordinate number of calls to the body_seg() method as it chunks through the document. A better method would be to read in the body of the EDI document in bigger chunks to search specifically for the end segment rather than just the segment delimiter.

Second problem: aggregate_parser calls a segment() method in the specified EDI handler for each segment it finds, passing it the segment data. aggregate_parser was using the NoOpEDIHandler for this example. The segment() method in this particular handler does nothing. The entirety of this segment() method is pass. It's interesting, but not surprising, that a no-op now takes about 20 percent of the overall processing time; it's not surprising because the parser called segment() 136,000 times. This is an obvious inefficiency, but it is also an architectural problem. If I write a piece of code that uses this parser, which ignores the finer level of detail of an EDI document at the segment level, then handling segment data, even with a no-op, costs a 20 percent performance hit. A better method would be to return a segment iterator such that code will retrieve segments only when the interested code calls the iterator's next() method.

I have rearchitected ediplex to work more iteratively rather than relying on callbacks. I have also reduced the granularity of what the parser is interested in on its first pass. Now when ediplex parses a file of EDI interchanges, it identifies the header segment, then immediately starts looking for the footer segment rather than each body segment. When it finds a full interchange, it yields a parser object, which can, in turn, return an iterator for the interchange's segments.


The results were surprising. I was hoping for a 50 to 100 percent performance improvement. Here is a time of running the parser with no profiler:

jmjones@qiwi  9:25PM parser % time python \
python ~/svn/ediplex/example_data/big_edi.txt
    0.47s user 0.01s system 99% cpu 0.481 total

Here is the parser run with the hotshot profiler:

jmjones@qiwi  5:43PM parser % time python \
python ~/svn/ediplex/example_data/big_edi.txt
    0.42s user 0.49s system 87% cpu 1.042 total

Without the profiler, this is a performance boost of about 7X. Removing the Python interpreter startup time, library imports, and such from both sides of the equation, this was actually a performance boost of 10X. The timeit module shows a 2.3 second best time for parsing a file before optimization and .22 seconds after optimizations. Here is the statistical data from running the parser:

In [1]: from hotshot import stats

In [2]: s = stats.load("hotshot_edi_stats")

In [3]: s.sort_stats("time").print_stats()
         11005 function calls in 0.436 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.338    0.000    0.374    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
     7000    0.036    0.000    0.036    0.000 /usr/lib/python2.4/
     1001    0.035    0.000    0.428    0.000
     1000    0.013    0.000    0.013    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.008    0.008    0.436    0.436
     1000    0.006    0.000    0.006    0.000 /home/jmjones/svn/ediplex/src/ediplex/parser/
        1    0.000    0.000    0.000    0.000
        1    0.000    0.000    0.000    0.000
        1    0.000    0.000    0.000    0.000
        0    0.000             0.000          profile:0(profiler)

Out[3]: <pstats.Stats instance at 0xb798f4cc>

The profiler shows far fewer methods than before optimizing things. The majority of time is now in methods that actually do things rather than no-op or state transitions. All of the features that had no-op placeholders are now gone and will occur only when a user decides he needs to call them.

The profiler also shows far fewer method calls. Before, there were two methods, body_seg() and segment(), with calls over 130,000 times. Function calls were fairly expensive in previous versions of Python--and I believe they still are, but to a lesser degree. This explains why 130,000 function calls to even a no-op created a performance bottle neck.

I heavily optimized the method that searched for the end of an EDI interchange (now called is_valid_footer()) but hardly touched the one that identified the beginning of an interchange (now called is_valid_header()). is_valid_header() is now the biggest bottleneck, taking 10 times more time than the next method in the list, which happens to be the Python standard string.strip() method--and about 77 percent of the total processing time. The only section of is_valid_header() that does anything is a for loop that iterates over range(106) and checks the header string for correctness. If I really wanted to optimize is_valid_header(), I would remove the for loop and either do the string check with a (shudder) regex or write a function in C and swig it. I may implement one of these optimizations just because of that nagging bottleneck (and for fun!)--but the 2,080 2K interchanges per second, which includes Python interpreter startup and library imports, or 4,500 2K interchanges per second, excluding Python startup and imports, is faster than I had hoped to take this parser.

Just for the sake of comparison, here is the timeit run of the optimized code:

jmjones@qiwi  4:33PM parser % python
[0.4770960807800293, 0.47198605537414551, 0.43927288055419922,
0.29295921325683594, 0.21883893013000488, 0.21800088882446289,
0.2185051441192627, 0.21764492988586426, 0.23419904708862305,
0.21747803688049316] :: 0.21747803688


I have attempted to live by the "premature optimization" motto with the activity surrounding this article. I spent very little time up front trying to predict where my code would run slow, and made very few concessions to preemptively make it run faster. I simply used the hotshot profiler, a tool that is part of the Python standard library, found my bottlenecks, rearchitected my parser (which is probably a bit more extreme than what most folks will have to do), made some code changes, and saw a significant performance improvement. For any performance analysis of Python source code, the profiler is an indispensable tool.

Jeremy Jones is a software engineer who works for Predictix. His weapon of choice is Python.

Return to the Python DevCenter.

Copyright © 2009 O'Reilly Media, Inc.