If you are creating networked applications in Python, there is a powerful, open source tool that can help you. The Twisted framework had its big debut during the 2003 Python Conference in Washington, DC (PyCon DC 2003). Twisted takes care of many networking details, thus reducing the amount of development work required, particularly for complex systems. Here is another implementation of our simple client, this time using Twisted.
# Import the Twisted network event monitoring loop. from twisted.internet import reactor # Import the Twisted web client function for retrieving # a page using a URL. from twisted.web.client import getPage import re # Library for finding patterns in text. # Twisted will call this function to process the # retrieved web page. def process_result(webpage): # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S) # Print out the matched text and a descriptive message. print 'In New York, it is now',match.group(1),'degrees.' reactor.stop() # Stop the Twisted event loop. # Twisted will call this function to indicate an # error. def process_error(error): print error reactor.stop() # Stop the Twisted event loop. # The NOAA web page, showing current conditions in New York. url = 'http://weather.noaa.gov/weather/current/KNYC.html' # Tell Twisted to get the above page; also register our # processing functions, defined previously. getPage(url).addCallbacks(callback=process_result, errback =process_error) # Run the Twisted event loop. reactor.run()
In particular, note the error handling function, which is passed to the Twisted framework. Such "hooks" for error handling are very important, if the system is to be extensible. Dealing with errors is the topic of the next section. Make sure to read it before writing your own applications.
One very notable characteristic of the Twisted example is evident in the last line of the code. Twisted's event loop resembles that of the Tkinter example shown previously. In Tkinter, the event loop drives the GUI, responding to inputs from the user. Twisted processes network events in a similar way; like a waiter in a restaurant, the framework can keep track of several "customers" (connections), responding when each one asks for service (i.e., issues an event). This is called asynchronous I/O, and will be covered in more detail in the next article.
The admonition to handle error cases in your code is heard so often, you may have learned to ignore it. In networked applications, however, error handling is far more critical than what you may be used to. Network-related errors are frequent, even commonplace. In such an environment, a "quick and dirty" utility will not survive at all.
Fortunately, there are simple things that can be done to deal with errors in network applications. We will discuss them in this section. In addition, we will modify our original simple client to handle common error conditions.
Security concerns itself with a very specific kind of error: a deliberate assault on your system by a remote attacker. This is a monumental subject that can easily become the focus of your entire life. Nevertheless, any network application ignores security only at great peril. The following list provides several guidelines that should keep you reasonably safe in simple situations.
Run your program with the lowest possible privileges that still allow it to work. Even a client performs actions on your system in response to untrusted information received over the network. The less power it has, the less damage that attackers (or buggy code) can do.
Control the sizes of things — even when using a language such as Python. While Python should protect you against the buffer overflow errors so common in C and C++, unexpectedly large input can still cause unpredictable behavior as system resources are used up. For example, it is not reasonable for a person's last name to be ten megabytes long; you should catch such conditions early, truncate the input, or even refuse to continue processing altogether.
Make sure things are in the right format. A country name should not contain question marks, for example. Regular expressions are very useful for this sort of input validation. Thus, it is generally important to construct them carefully, so that only correctly formatted input will match.
Choosing files to manipulate based on a network request should be avoided. After all, a filename is basically a pointer — it refers to some place in the filesystem, similar to the way a pointer in the C language refers to some place in memory. Reading a file based on a network request can expose private information, while writing a file can overwrite critical data and compromise the system. In contrast, if you locally specify the name of a configuration file to read, or a log file to write to, then using these locally specified files is usually safe.
Executing another program with arguments received over the network can lead to very serious security breaches. Consider eliminating such requirements by design, as you would a
gotostatement in some languages. If you still choose to go ahead, always analyze the input very carefully, in order to avoid feeding malformed data to the other program. Under Unix-like operating systems, for example, these kinds of calls are typically made via the intervening shell program. This is especially dangerous — the shell supports a quite-capable programming language, meaning that insufficient input validation will actually allow an attacker to write and (possibly) execute arbitrary code on your system.
The following example shows our simple client, modified to perform several important error checks.
import urllib # Library for retrieving files using a URL. import re # Library for finding patterns in text. import sys # Library for system-specific functionality. # The NOAA web page, showing current conditions in New York. url = 'http://weather.noaa.gov/weather/current/KNYC.html' # The maximum amount of data we are prepared to read. MAX_PAGE_LEN = 20000 # Open and read the web page; catch any I/O errors. try: webpage = urllib.urlopen(url).read(MAX_PAGE_LEN) except IOError, e: # An I/O error occurred; print the error message and exit. print 'I/O Error: ',e.strerror sys.exit() # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S) # Print out the matched text and a descriptive message; # if there is no match, print an error message. if match == None: print 'No temperature reading at URL:',url else: print 'In New York, it is now',match.group(1),'degrees.'
First, we limit the size of the web page that we are prepared to read (the
MAX_PAGE_LEN variable). This is a security precaution, as
described in item 2 on the list in the previous subsection.
Next, we make sure to catch any I/O errors from the network operation. The
module raises an exception in such situations. In comparison to a
local hard drive, for example, network I/O operations fail much more
frequently. While the lower levels of networking software on your machine will
(in cooperation with remote systems) try to effect a recovery, this is not
always possible. You must therefore be prepared to deal with unrecoverable
errors yourself. In this case, we simply print an error message and exit.
Finally, we add an explicit check of whether the regular expression pattern has matched. If there is no match, the temperature reading is not available, and an error message is printed instead. A pattern match failure is also a clear indication that something unexpected has been received — it is therefore important that your code deals with these faults explicitly.
One of the most challenging — but fascinating — aspects of network I/O is its unpredictability. As mentioned in the previous subsection, such operations are not reliable; it is not unexpected for a network request to simply fail. Sometimes, however, it does not fail in a clean, readily apparent fashion. Instead, data transmission might start only after a lengthy delay (high latency), proceed very slowly (low bandwidth), or both.
Many factors, in myriad combinations, can cause such unpredictable behavior. On the Internet, data routinely travels over very long distances — even across continents — as it hops from one system to another towards its final destination. Anywhere along the route, hardware failures, software crashes, excessive network traffic, misconfigured systems, electromagnetic interference, and many other causes can disrupt the orderly flow of data.
Unpredictability of network operations becomes a central concern for servers. It is rarely acceptable to make all clients wait because one particular connection is having trouble. If you write a more complex client, such as a spider that gathers information from multiple web sites, you will also run into this problem.
Waiting for each query to complete fully before starting the next is very time-consuming. In addition, failing to complete a crawl of a thousand sites just because the connection to number five on your list is "hanging" is unlikely the desired behavior. Fortunately, a modern computer is physically capable of handling hundreds or thousands of network operations in parallel. In consequence, many useful strategies for concurrent network I/O have been developed, researched, and deployed in actual systems. This will be the topic of the next article.