Understanding Network I/O, Part 2
Pages: 1, 2, 3, 4
Asynchronous I/O
Asynchronous I/O is a technique specifically targeted at handling multiple I/O requests efficiently. In contrast, threads are a general concurrency mechanism that can be used in situations not related to I/O. Most modern operating systems, such as Linux and Windows, support asynchronous I/O.
Asynchronous I/O works very differently from threads. Instead of having an application spawn multiple tasks (that can then be used to perform I/O), the operating system performs the I/O on the application's behalf. This makes it possible for just one thread to handle multiple I/O operations concurrently. While the application continues to run, the operating system takes care of the I/O in the background.
Due to potentially more efficient, kernel-level I/O processing, the reduction in the total number of threads in the system, and dramatically fewer context switches, asynchronous I/O is sometimes the best method to use. Its major disadvantage is an increase in the complexity of the application's logic — an increase that can be very significant.
Two common ways of asking the operating system to perform asynchronous I/O
are the select
and poll
system calls. While
Python provides direct access to these facilities (via the select
module), there are easier ways to take advantage of asynchronous I/O in your programs.
In particular, the Twisted framework, as mentioned in the previous article, makes working with asynchronous I/O quite painless in many cases. The next subsection will present a Twisted-based variation on our weather reader example.
The asyncore library provides another alternative to using poll
or select
directly. This is a lightweight facility that remains
sufficiently low-level to give you a good look at the nature of asynchronous
I/O. See the The Asyncore Library subsection for details.
The Twisted Framework
Twisted is a large, comprehensive framework. It includes many diverse components such as a web server, a news server, and a web spider client. Achieving I/O concurrency with Twisted is not difficult, as the following example illustrates.
Example 6. A Twisted Framework client
# Import the Twisted network event monitoring loop.
from twisted.internet import reactor
# Import the Twisted web client function for retrieving
# a page using a URL.
from twisted.web.client import getPage
import re # Library for finding patterns in text.
# Twisted will call this function to process the retrieved web page.
def process_result(webpage,name,url,nrequests):
# Pattern which matches text like '66.9 F'. The last
# argument ('re.S') is a flag, which effectively causes
# newlines to be treated as ordinary characters.
match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)
# Print out the matched text and a descriptive message;
# if there is no match, print an error message.
if match == None:
print 'No temperature reading at URL:',url
else:
print 'In '+name+', it is now',match.group(1),'degrees.'
# Keep a shared count of requests (see article text for details).
nrequests[0] = nrequests[0] - 1 # Just finished a request.
if nrequests[0] <= 0: # If this is the last request ...
reactor.stop() # ... stop the Twisted event loop.
# Twisted will call this function to indicate an error.
def process_error(error,name,url,nrequests):
print 'Error getting information for',name,'( URL:',url,'):'
print error
# Keep a shared count of requests (see article text for details).
nrequests[0] = nrequests[0] - 1 # Just finished a request.
if nrequests[0] <= 0: # If this is the last request ...
reactor.stop() # ... stop the Twisted event loop.
# Three NOAA web pages, showing current conditions in New York,
# London and Tokyo, respectively.
citydata = (('New York','http://weather.noaa.gov/weather/current/KNYC.html'),
('London', 'http://weather.noaa.gov/weather/current/EGLC.html'),
('Tokyo', 'http://weather.noaa.gov/weather/current/RJTT.html'))
# Initialize the shared count of the number of requests. This will be
# passed as an argument to the callback functions above. It cannot
# be a simple integer (see article text for an explanation).
nrequests = [len(citydata)]
# Tell Twisted to get the above pages; also register our
# processing functions, defined previously.
for name,url in citydata:
getPage(url).addCallbacks(callback = process_result,
callbackArgs = (name,url,nrequests),
errback = process_error,
errbackArgs = (name,url,nrequests))
# Run the Twisted event loop.
reactor.run()
Here is the output produced by the client:
Example 7. Twisted Framework client output
In London, it is now 46 degrees.
In New York, it is now 37.9 degrees.
In Tokyo, it is now 48 degrees.
Note that -- just like with the thread-based examples -- the output may be in a different order from the inputs. After all, asynchronous I/O is also a concurrency technique. As before, when several requests are performed in parallel, the faster ones will tend to pass the slower ones, finishing earlier.
Even with Twisted hiding the low-level details, the event-driven nature of asynchronous I/O is readily apparent in the example. When certain events take place, Twisted calls the functions we have previously supplied. These functions are known as callbacks, because you call the framework to pass the functions to it, and the framework subsequently calls them back. Callbacks are also common in other event-driven systems, such as GUI libraries.
You may not wait inside your Twisted callbacks; it is important to complete the required processing as fast as possible, and return control back to the framework. Any waiting will suspend the other requests, because only a single thread is doing all of the work.
Twisted defines a special construct, Deferred
, for triggering callbacks. In our example, the getPage
function
actually returns a Deferred
object. We then use its
addCallbacks
method use to register our result-processing and
error-handling functions.
The last line in our program (reactor.start()
) starts the
Twisted event loop. This transfer of control is common in event-driven
systems; it allows the framework to invoke our callbacks in response to events.
Our program will terminate when reactor.start()
returns.
Now that we have surrendered control to Twisted, however, how can we make
reactor.start()
return? The example issues a
reactor.stop()
from either of our two callbacks. In order to
prevent Twisted from exiting prematurely, we keep track of how many requests
are left to process and only call reactor.stop()
after all the
requests have been processed.
We store the count of outstanding requests in a standard Python list that we specify as a parameter to our callbacks. In your own code, you may want to create a counter class for this purpose. Alternately, you can write your callbacks as methods of a class, keeping the count in an attribute. In any case, do not pass a simple integer to the callback. Any changes you make to such a type inside the function will be purely local, and will be discarded when the callback returns. See the Python Reference Manual for a deeper understanding of these issues.
Although all invocations of the callback share the count, we need no locking to protect the value. Only one thread makes every call, so each invocation must complete before the next one can start. This ensures that the count is always consistent, because no one operation on it may preempt another already in progress.
The Asyncore Library
Asyncore is another Python project for dealing with asynchronous I/O. In contrast to the large, comprehensive Twisted, asyncore is small, lightweight, and included as part of the standard Python distribution. You may also be interested in asynchat (also included with Python), which provides extra functionality on top of asyncore. The well-known Zope project, a powerful, sophisticated, web-application server is built using asyncore.
Asyncore's minimalist approach comes at a price. While this library is
higher level than using select
or poll
directly, it
does not provide additional facilities such as HTTP protocol support. Asyncore
is a fundamental building block, with a tight focus on just the I/O process
itself.
The asyncore documentation includes an easy-to-follow web client example. It is immediately clear from this example that we must do all the work pertaining to the HTTP protocol ourselves. Asyncore provides only the I/O channel. The example also illustrates how to use asyncore in our programs: by writing a class that inherits from a base class supplied by the library.
Now we are ready to reimplement our weather reader with asyncore. Ideally,
we would like to reuse the code from the Twisted client.
After all, neither the logic of our program nor the underlying I/O method will
change. In the following example, we substitute our own (asyncore-derived)
CustomDispatcher
class for the facilities previously provided by
Twisted, leaving the rest of our code virtually intact.
Example 8. An Asyncore client
import asyncore # Lightweight library for asynchronous I/O.
import re # Library for finding patterns in text.
# Our asyncore-based dispatcher class.
import CustomDispatcher
# Function to process the retrieved web page.
def process_result(webpage,name,url):
# Pattern which matches text like '66.9 F'. The last
# argument ('re.S') is a flag, which effectively causes
# newlines to be treated as ordinary characters.
match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)
# Print out the matched text and a descriptive message;
# if there is no match, print an error message.
if match == None:
print 'No temperature reading at URL:',url
else:
print 'In '+name+', it is now',match.group(1),'degrees.'
# Function to indicate an error.
def process_error(error,name,url):
print 'Error getting information for',name,'( URL:',url,'):'
print error
# Three NOAA web pages, showing current conditions in New York,
# London and Tokyo, respectively.
citydata = (('New York','http://weather.noaa.gov/weather/current/KNYC.html'),
('London', 'http://weather.noaa.gov/weather/current/EGLC.html'),
('Tokyo', 'http://weather.noaa.gov/weather/current/RJTT.html'))
# Create one asyncore-based dispatcher for each of the above pages;
# also register our callback functions, defined previously.
for name,url in citydata:
# No need to save the result of the constructor call, because
# asyncore keeps a reference to our dispacher objects.
CustomDispatcher.CustomDispatcher(url,
process_func = process_result,
process_args = (name,url),
error_func = process_error,
error_args = (name,url))
# Run the asyncore event loop. The loop will terminate automatically
# once all I/O channels have been closed.
asyncore.loop()
The output is the same as in the Twisted example (of course, the order of the results returned may be different for each run). In addition, the code to stop the Twisted reactor is no longer required; asyncore will automatically exit its loop when all I/O channels have closed.
Most of the work required to create the asyncore example is actually in
writing the CustomDispatcher
class. Due to the amount of low-level
details it must handle, CustomDispatcher
is quite a long
piece of code compared to the other programs shown in this article. You can download it from the previous link or read it in the appendix
CustomDispatcher
strives to be a fairly complete example that
is also compatible with several versions of Python and asyncore. In addition,
the goal is to write simple code that makes it easier to understand the nature
of asynchronous I/O, rather than come up with the most optimal
implementation.
As mentioned in the Twisted discussion, programs
relying on asynchronous I/O are event-driven by nature. This is certainly
different from the threaded examples given earlier. The
CustomDispatcher
class is sufficiently low-level to clearly bring
out these differences.
When using synchronous I/O with threads, the physical layout of the program can correspond closely with its internal logic. For instance, each thread in our multitasking examples performs, in order, the following tasks:
- Send a request to the server.
- Receive the response.
- Process the response and output the results.
All of these operations can be written naturally, from the top down, in the
program's source code. While urllib
takes care of the first two
steps in our examples, it still does so in the context of the threads we
create.
As a thread performs the first two steps, it may have to wait an unpredictable amount of time for the network I/O to complete. When a thread is waiting, the operating system will allow other threads to run. Thus, if one of the threads has entered a lengthy sleep (e.g., in step 2), it will not prevent the other threads from performing step 3.
The situation changes completely when we use asynchronous I/O. Now, waiting is not allowed -- there is only one thread doing all the work. Instead, we perform I/O operations only when the operating system tells us that they will succeed immediately.
For each such I/O event, it is entirely likely that we will write less data than is needed to complete our request, or read only part of the incoming reply, etc. The unfinished work will have to be continued when the next I/O event comes. We must therefore store enough state information in order to resume the partially completed operation correctly at a later time.
Asyncore translates the results of the low-level system call
(select
or poll
) into calls to
handle_read
, handle_write
, and so on. We provide
these methods in our CustomDispatcher
class.
Our class is also a great place to keep state information. In particular,
note the __is_header
member variable. It is used as a flag to
indicate that we have not yet finished reading the HTTP header.
Due to the nature of asynchronous I/O, it is likely that
handle_read
will be called multiple times before the entire
web page is read. In addition, one of these read operations will probably wind
up reading the last part of the HTTP header and the first part of the body.
After all, the low-level asynchronous I/O routines are not familiar with the
HTTP protocol. The transition from header to body is meaningless to them. Our
handle_read
method must carefully preserve any body content as it
discards the header; otherwise part of the information we are interested in
would be lost. Keep these sorts of issues in mind when working with
asynchronous I/O.
When writing your own asyncore-based dispatcher classes, you may also want
to override the handle_expt
, handle_error
, and
log
methods. These methods deal with Out-Of-Band data
(OOB), unhandled errors, and logging, respectively. See the asyncore
documentation and the library source code itself (file
asyncore.py
, installed on your hard drive in the same place as the
rest of the standard Python library) for more information. The asyncore source
code is actually quite easy to read. Also note that OOB is a rarely used
feature of the TCP/IP protocol family.
CustomDispatcher
uses Python's built-in apply
function to call the supplied callback functions. This allows the list of
arguments to the callbacks to be generated dynamically. Note, however, that
apply
has been deprecated in Python version 2.3. Unless you want
to support old versions of the language (notably version 1.5), you should use
the extended call syntax to achieve the same result. See the documentation
of the deprecated apply
function for a description of the
extended call syntax.
