The statistical output shows the number of times a method was called (
ncalls), the total time spent in the specified method but not in any of its children (
tottime), the amount of time spent for each call to the specified method (the first
percall), the amount of time spent in a method and calls to its children (
cumtime), the amount of time spent for each call to the specified method including child methods (the second
percall), and the filename. Given the profiling information, I can see that my three biggest bottlenecks are in x12_parser.py (in the
body_seg method), state_machine.py (in the
run method), and edi_handler.py (in the
I sorted the statistics on the
tottime column (by passing "time" to the
sort_stats() method) to show how well each method performed by itself. The total time spent in the top four methods is about 97 percent of the entire processing time. The top one takes about 43 percent of the entire processing time.
I should now ask the question, Is it fast enough? My answer is, Probably, but faster would be nice. It can parse about 300 2K interchanges per second on my laptop, (about 440 per second if I don't take into consideration Python interpreter startup, library imports, and such), which is actually pretty respectable, but I would like to trim whatever time I can. The profiler has pointed out a couple of glaring architectural deficiencies, so some changes are in order.
I just mentioned the parser's performance without Python interpreter startup and library imports. The
timeit module is an excellent tool for helping determine the performance of only the interesting piece of code. I put this code:
import timeit t = timeit.Timer("main()", "from __main__ import main") res_set = t.repeat(10, 1) print res_set, "::", min(res_set)
if __name__ == "__main__": block and ran it.
jmjones@qiwi 5:39AM parser % python aggregate_parser.py ~/svn/ediplex/example_data/edi_1000_interchanges.txt [3.1825220584869385, 2.4778821468353271, 2.4290919303894043, 2.4358069896697998, 2.4379229545593262, 2.4492578506469727, 2.439661979675293, 2.4325060844421387, 2.4347169399261475, 2.435114860534668] :: 2.42909193039
repeat(10, 1) runs 10 iterations of the specified code one time. I specified for the code to execute only once per iteration because I handled the looping by parsing a file of 1,000 EDI interchanges. The output of
repeat() is a list of runtimes for each repeating call to the specified code. The recommended method of determining performance for a piece of code (by using
timeit) is to take the minimum value from the returned list.
Finding and Fixing Bottlenecks
A primer on X12 EDI would probably be helpful at this point. EDI documents are character-delimited text files, which are more machine-readable than human-readable. Each document is made up of one interchange; each interchange is made up of a series of segments that roughly equate to database records; each segment is made up of elements that roughly equate to database columns. The first few characters of each segment (sometimes called the tag), up to the element-separating delimiter, declares what type of segment it is. The first segment in an X12 interchange has an ISA tag. The last segment in an X12 interchange has an IEA tag.
First problem: when ediplex parses an EDI document, it first determines whether it has a valid header segment, then proceeds to look for each individual segment and determines whether that segment is the end segment. This is really inefficient and results in an inordinate number of calls to the
body_seg() method as it chunks through the document. A better method would be to read in the body of the EDI document in bigger chunks to search specifically for the end segment rather than just the segment delimiter.
aggregate_parser calls a
segment() method in the specified EDI handler for each segment it finds, passing it the segment data.
aggregate_parser was using the
NoOpEDIHandler for this example. The
segment() method in this particular handler does nothing. The entirety of this
segment() method is
pass. It's interesting, but not surprising, that a no-op now takes about 20 percent of the overall processing time; it's not surprising because the parser called
segment() 136,000 times. This is an obvious inefficiency, but it is also an architectural problem. If I write a piece of code that uses this parser, which ignores the finer level of detail of an EDI document at the segment level, then handling segment data, even with a no-op, costs a 20 percent performance hit. A better method would be to return a segment iterator such that code will retrieve segments only when the interested code calls the iterator's
I have rearchitected ediplex to work more iteratively rather than relying on callbacks. I have also reduced the granularity of what the parser is interested in on its first pass. Now when ediplex parses a file of EDI interchanges, it identifies the header segment, then immediately starts looking for the footer segment rather than each body segment. When it finds a full interchange, it yields a parser object, which can, in turn, return an iterator for the interchange's segments.