This series of two articles will show you how to participate in the global Internet. The architects of this greatest advance in communications since the printing press have strived to make such participation possible. The Internet is — by design — a simple network, engineered only to route data from one location to another. All of the valuable services (such as the World Wide Web and email) are implemented at the endpoints, not inside of the Internet itself (see Further Reading for more information).
Because everybody is allowed to add value at the endpoints of the Internet, you can do it too! Today, this is easier than ever before. You do not need to be proficient in a difficult programming language, such as C or C++. For example, you will see that a useful web client can be about as complicated as a shopping list.
This article describes the basics of networked applications, providing information and sample code to get you started immediately. The second article will delve deeper into various techniques for network I/O, including important, practical results that help you choose the best method.
This article focuses on Internet clients. Clients — like your web browser — request information from servers (like the one from which you accessed this page). Typically, the client then presents the information to a person, although there are clients that talk to other computer programs instead. The next article will present ideas that are also applicable to developing servers and peer-to-peer systems.
With just a basic Internet connection, you can create your own clients (for personal or internal company use) in a leisurely afternoon. The following discussion presents the core techniques, illustrated with complete, working web clients. It is very important that you do not skip the last section of this article. The discussion there covers several simple things that you can do to avoid mystifying errors and security breaches in your network application.
Here is a simple web client. It displays the current outdoor temperature in New York City and then exits.
Example 1. A simple client
import urllib # Library for retrieving files using a URL. import re # Library for finding patterns in text. # The NOAA web page, showing current conditions in New York. url = 'http://weather.noaa.gov/weather/current/KNYC.html' # Open and read the web page. webpage = urllib.urlopen(url).read() # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S) # Print out the matched text and a descriptive message. print 'In New York, it is now',match.group(1),'degrees.'
Here is the output produced by the client.
Example 2. Output from the simple client
In New York, it is now 52.0 degrees.
The client is written in Python —
probably the easiest, fully featured mainstream programming language available
today. Python is used by such organizations as Google, Yahoo, NASA, and
Lawrence Livermore National Laboratories. The language is open source (so you
can use it freely) and cross platform. Note that any text in the example that
starts with a
# is a descriptive comment to human readers of the
code — Python ignores these comments.
If you are really unfamiliar with Python, the Further Reading appendix provides several helpful links. C++, Perl, Java, and Visual Basic programmers should find Python very easy. It is probably the ideal choice for beginners, as well.
To run the examples in this article, it is sufficient to save the code into
an ordinary text file and then issue the command
from the shell (the DOS command
prompt, for Windows users). See the Installation Notes for easy directions on installing Python.
Now we are ready for a more detailed discussion of the example. The first two lines import libraries to use in our program. The Python distribution includes a library to download files based on a URL (e.g., from a web server). There is no need to do any low-level socket programming — the library does this work for you.
The program opens and reads a web page from the National Oceanic and
Atmospheric Administration (NOAA) web server. This server hosts pages with
current weather information from all over the world. The specific page
retrieved by our example shows the conditions in New York City. The information
is saved as one long string in the
Next, a regular expression is applied to the
string. This is a quick, effective way to extract information from text without
implementing complex parsing logic yourself. Applications range from input
validation for security (e.g., in CGI scripts) to bioinformatics (e.g.,
searching for DNA sequences in a genome).
While regular expressions can be a difficult topic, you do not need to become an absolute expert to make good use of this technology in many situations. The Further Reading appendix lists several resources for learning about regular expressions.
The final line in the example prints out the temperature reading, captured as part of the regular expression match. The following discussion gives the details of the matching process.
Here is, once again, the regular expression pattern from the simple client example.
Example 3. Temperature-reading pattern
r specifies a raw Python string. This will prevent
the interpreter from processing special characters (such as the backslash). In
a raw string, what you see is what you get: a
r'\n' is two
characters (a backslash followed by the letter
n), whereas an
'\n' would be interpreted as a linefeed. If you
are interested in the subtle details (not usually necessary in day-to-day
programming) read the string literals subsection of the Python Reference Manual.
The first character in the pattern string is the open bracket,
(. The bracket starts a grouping; text that matches the pattern
inside of the brackets will be saved, so you can retrieve it for later use. This
is the first grouping in the pattern; thus, the last line of the example retrieves the text captured by the grouping
The temperature reading might be negative, so the minus sign is next. The
question mark is a special flag that indicates that the preceding character
is optional (in this case, the temperature reading may or may not contain a
minus sign). If you need to match an actual question mark in the text, escape
it with a backslash like this:
\? (similarly, the backslash
itself is matched by
Next, we expect to find some digits. A single digit is matched by a
\d. Note that
d is a normal character, which usually
matches itself (i.e., the literal letter
d in the text). Here, the
backslash is used to turn on a special meaning for
d — that
of matching a single digit. This is a general technique in regular expressions;
some ordinary characters (like
d) have a special meaning when
prefixed with a backslash, while special characters (such as
become ordinary characters (which match themselves) when the backslash is
\d is followed by the special character
which specifies that at least one, but possibly more, of the preceding
characters should be matched. Thus,
\d+ matches any number of
digits, but there must be at least one digit, or the match will fail. If you
want to allow zero or more digits, use
* instead of
+, like this:
After the first string of digits, the temperature might contain a decimal
point, with more digits following. This whole sequence, however, is optional,
so a little more work is required to specify the pattern. We begin another
grouping, nested inside of our original one. This time, however, the grouping is
(?: instead of just an opening bracket. The
?: sequence just inside of the bracket indicates that the grouping
should not be saved — only matched. We are already saving the entire
match in our first grouping, and having a copy of just the fractional part of
the reading is not needed for this example.
Inside of the new grouping, we have
already familiar: it matches one or more digits. The
the decimal point. The backslash escape is required because
itself matches almost any single character. This is why, for example,
.* is often used to "match anything."
After the closing bracket of the nested grouping, there is a single question
mark. This question mark applies to the entire nested grouping. Thus,
we have an optional fractional part — the decimal point and at least one
digit — to the temperature reading. There are easier ways to specify this
fractional part, but they may allow malformed constructs (such as
65. with no digits after the decimal point) to slip through.
Next comes the closing bracket of our top-level grouping. After that, the
last part of the pattern will match an ordinary space followed by the
F character (which stands for "Fahrenheit"). While only the part
inside of the outer brackets (the temperature reading itself) will be saved, the
single space and the letter
F must follow in the text in order for
the match to succeed. This is a simple case of including extra information in a
pattern, in order to identify just the right data. In the current example, here
is the result if the "space F" sequence is left out of the regular
Example 4. Matching the wrong thing
In New York, it is now 3 degrees.
When it is winter in New York, you might actually think that this reading is
correct! Actually, it comes from the
W3C substring, which occurs
in the start of the returned HTML page.
Example 5. Start of an HTML document
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "hmpro6.dtd">
In the given example, just adding the "space F" yields the right result. Depending on the complexity and variability of the data being processed, you may need more identifying information in your pattern, in order to select the correct text. Of course, if you do not control the generation of the data you are processing, it is always possible that its format will be changed without warning. You should always consider this possibility, as well as the potential impact on your application.
Another interesting variation is writing a web client for a site that you control. For example, the client may interact with the CGI scripts on the web site. In this case, you would have effectively defined a network protocol of your own — layered on top of the HTTP protocol.
In the next section, we provide a Graphical User Interface (GUI) for our simple client. This is quite easy to do in Python.
This example uses Tkinter, one of several open source GUI libraries available for Python. Tkinter is the de facto standard Python GUI, and is typically included with the language (see Installation Notes for details). An Introduction to Tkinter is a highly accessible tutorial on using this library in your own programs.
Example 6. A simple GUI client
import urllib # Library for retrieving files using a URL. import re # Library for finding patterns in text. import sys # Various system facilities. import Tkinter # The Tkinter GUI library. # The class for our simple GUI. class SimpleGUI: # Class constructor -- note that the first argument, 'self' # is a reference to the object being created. def __init__(self, master, title, url, pattern, samplerate): # Save the URL and the regular expression pattern as part # of the object being constructed -- we'll need them later. self.__url = url self.__pattern = pattern # The samplerate is per hour, while Tkinter uses milliseconds # to calculate delays. Thus, we convert a per hour rate # into a delay value in milliseconds, and save it inside our # object for later use. self.__delay = int(60*60*1000.0/samplerate) # The master is our parent widget, we'll need to make calls # to it later, so save the reference inside the object. In # this case, our parent widget will actually be the Tkinter # root widget itself. self.__master = master # Set the root window title. master.title(title) # Create a frame, to hold our controls. frame = Tkinter.Frame(master) # This makes the frame visible; see "An Introduction to Tkinter" # for more details. frame.pack() # Create a button; its action ('command') is to exit the # program. We pack the button in the first available position # in the frame, furthest to the right. If you want to know # about other layout options for your windows, see # "An Introduction to Tkinter". Tkinter.Button(frame, text="EXIT", command=frame.quit).pack(side=Tkinter.RIGHT) # A label widget will be used to display the output. We save # a reference to it in our object, and again pack towards the # right of the frame. self.__show = Tkinter.Label(frame) self.__show.pack(side=Tkinter.RIGHT) # We've set everything up -- now run our sampling routine. self.__sample() # The sampling routine, which is a member function of our class. # It is almost the same as the simple command-line client shown # previously. Note that this routine operates on an object of # our SimpleGUI class -- this object is passed via the 'self' # parameter of the routine. def __sample(self): # Open and read the web page. webpage = urllib.urlopen(self.__url).read() # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(self.__pattern,webpage,re.S) # Display the matched text and a descriptive message. self.__show.config(text='it is now '+match.group(1)+' degrees.') # Tell the Tkinter root to call our sample routine again # after the delay we have set earlier. This will update # our temperature display periodically. self.__master.after(self.__delay, self.__sample) # --- End of SimpleGUI Class --- # Create the Tkinter root widget. tkroot = Tkinter.Tk() # Create an instance of our SimpleGUI class. simplegui = SimpleGUI(tkroot, # The window title 'In New York ...', # The URL of the page to retrieve. 'http://weather.noaa.gov/weather/current/KNYC.html', # The regular expression pattern to apply. r'(-?\d+(?:\.\d+)?) F', # How many times to sample per hour. 1) # Start the Tkinter mainloop. This will allow our application # to respond to events, such as the 'EXIT' button, and to make # periodic calls to the sample routine we have defined. tkroot.mainloop()
In this GUI-based version of a simple web client, the network I/O is performed in the sampling routine. The Tkinter framework's event loop will call this routine after the specified delay, or perhaps later, if the system is heavily loaded with other events. The routine then reschedules itself. Here is a screenshot of the GUI-based simple client.
Figure 1. Screenshot of the GUI-based simple client
With a GUI, it is also possible to show the results in formats other than plain text. For example, temperature readings can be used to animate a picture of a thermometer.
When accessing network resources, it is important to tread lightly. This is why the GUI client example only samples once per hour — quite sufficient for checking the weather. On a fast network connection, even a simple client like this is capable of generating multiple requests per second. If you then give the program to others, and they pass it on, soon the operator of the server you are accessing can be experiencing a lot of unnecessary extra traffic.
The Internet — and the multitude of publicly accessible servers on it — are a shared resource. By writing well-behaved programs that take only what they need, you are using this resource carefully and wisely. Thus, we preserve the free and open Internet for others as well as ourselves.
If you are creating networked applications in Python, there is a powerful, open source tool that can help you. The Twisted framework had its big debut during the 2003 Python Conference in Washington, DC (PyCon DC 2003). Twisted takes care of many networking details, thus reducing the amount of development work required, particularly for complex systems. Here is another implementation of our simple client, this time using Twisted.
Example 7. A simple client with a twist
# Import the Twisted network event monitoring loop. from twisted.internet import reactor # Import the Twisted web client function for retrieving # a page using a URL. from twisted.web.client import getPage import re # Library for finding patterns in text. # Twisted will call this function to process the # retrieved web page. def process_result(webpage): # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S) # Print out the matched text and a descriptive message. print 'In New York, it is now',match.group(1),'degrees.' reactor.stop() # Stop the Twisted event loop. # Twisted will call this function to indicate an # error. def process_error(error): print error reactor.stop() # Stop the Twisted event loop. # The NOAA web page, showing current conditions in New York. url = 'http://weather.noaa.gov/weather/current/KNYC.html' # Tell Twisted to get the above page; also register our # processing functions, defined previously. getPage(url).addCallbacks(callback=process_result, errback =process_error) # Run the Twisted event loop. reactor.run()
In particular, note the error handling function, which is passed to the Twisted framework. Such "hooks" for error handling are very important, if the system is to be extensible. Dealing with errors is the topic of the next section. Make sure to read it before writing your own applications.
One very notable characteristic of the Twisted example is evident in the last line of the code. Twisted's event loop resembles that of the Tkinter example shown previously. In Tkinter, the event loop drives the GUI, responding to inputs from the user. Twisted processes network events in a similar way; like a waiter in a restaurant, the framework can keep track of several "customers" (connections), responding when each one asks for service (i.e., issues an event). This is called asynchronous I/O, and will be covered in more detail in the next article.
The admonition to handle error cases in your code is heard so often, you may have learned to ignore it. In networked applications, however, error handling is far more critical than what you may be used to. Network-related errors are frequent, even commonplace. In such an environment, a "quick and dirty" utility will not survive at all.
Fortunately, there are simple things that can be done to deal with errors in network applications. We will discuss them in this section. In addition, we will modify our original simple client to handle common error conditions.
Security concerns itself with a very specific kind of error: a deliberate assault on your system by a remote attacker. This is a monumental subject that can easily become the focus of your entire life. Nevertheless, any network application ignores security only at great peril. The following list provides several guidelines that should keep you reasonably safe in simple situations.
Run your program with the lowest possible privileges that still allow it to work. Even a client performs actions on your system in response to untrusted information received over the network. The less power it has, the less damage that attackers (or buggy code) can do.
Control the sizes of things — even when using a language such as Python. While Python should protect you against the buffer overflow errors so common in C and C++, unexpectedly large input can still cause unpredictable behavior as system resources are used up. For example, it is not reasonable for a person's last name to be ten megabytes long; you should catch such conditions early, truncate the input, or even refuse to continue processing altogether.
Make sure things are in the right format. A country name should not contain question marks, for example. Regular expressions are very useful for this sort of input validation. Thus, it is generally important to construct them carefully, so that only correctly formatted input will match.
Choosing files to manipulate based on a network request should be avoided. After all, a filename is basically a pointer — it refers to some place in the filesystem, similar to the way a pointer in the C language refers to some place in memory. Reading a file based on a network request can expose private information, while writing a file can overwrite critical data and compromise the system. In contrast, if you locally specify the name of a configuration file to read, or a log file to write to, then using these locally specified files is usually safe.
Executing another program with arguments received over the network can
lead to very serious security breaches. Consider eliminating such requirements
by design, as you would a
goto statement in some languages. If you
still choose to go ahead, always analyze the input very carefully, in order
to avoid feeding malformed data to the other program. Under Unix-like
operating systems, for example, these kinds of calls are typically made via the
intervening shell program. This is especially dangerous — the
shell supports a quite-capable programming language, meaning that insufficient
input validation will actually allow an attacker to write and (possibly) execute
arbitrary code on your system.
The following example shows our simple client, modified to perform several important error checks.
Example 8. A simple client with error checking
import urllib # Library for retrieving files using a URL. import re # Library for finding patterns in text. import sys # Library for system-specific functionality. # The NOAA web page, showing current conditions in New York. url = 'http://weather.noaa.gov/weather/current/KNYC.html' # The maximum amount of data we are prepared to read. MAX_PAGE_LEN = 20000 # Open and read the web page; catch any I/O errors. try: webpage = urllib.urlopen(url).read(MAX_PAGE_LEN) except IOError, e: # An I/O error occurred; print the error message and exit. print 'I/O Error: ',e.strerror sys.exit() # Pattern which matches text like '66.9 F'. The last # argument ('re.S') is a flag, which effectively causes # newlines to be treated as ordinary characters. match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S) # Print out the matched text and a descriptive message; # if there is no match, print an error message. if match == None: print 'No temperature reading at URL:',url else: print 'In New York, it is now',match.group(1),'degrees.'
First, we limit the size of the web page that we are prepared to read (the
MAX_PAGE_LEN variable). This is a security precaution, as
described in item 2 on the list in the previous subsection.
Next, we make sure to catch any I/O errors from the network operation. The
module raises an exception in such situations. In comparison to a
local hard drive, for example, network I/O operations fail much more
frequently. While the lower levels of networking software on your machine will
(in cooperation with remote systems) try to effect a recovery, this is not
always possible. You must therefore be prepared to deal with unrecoverable
errors yourself. In this case, we simply print an error message and exit.
Finally, we add an explicit check of whether the regular expression pattern has matched. If there is no match, the temperature reading is not available, and an error message is printed instead. A pattern match failure is also a clear indication that something unexpected has been received — it is therefore important that your code deals with these faults explicitly.
One of the most challenging — but fascinating — aspects of network I/O is its unpredictability. As mentioned in the previous subsection, such operations are not reliable; it is not unexpected for a network request to simply fail. Sometimes, however, it does not fail in a clean, readily apparent fashion. Instead, data transmission might start only after a lengthy delay (high latency), proceed very slowly (low bandwidth), or both.
Many factors, in myriad combinations, can cause such unpredictable behavior. On the Internet, data routinely travels over very long distances — even across continents — as it hops from one system to another towards its final destination. Anywhere along the route, hardware failures, software crashes, excessive network traffic, misconfigured systems, electromagnetic interference, and many other causes can disrupt the orderly flow of data.
Unpredictability of network operations becomes a central concern for servers. It is rarely acceptable to make all clients wait because one particular connection is having trouble. If you write a more complex client, such as a spider that gathers information from multiple web sites, you will also run into this problem.
Waiting for each query to complete fully before starting the next is very time-consuming. In addition, failing to complete a crawl of a thousand sites just because the connection to number five on your list is "hanging" is unlikely the desired behavior. Fortunately, a modern computer is physically capable of handling hundreds or thousands of network operations in parallel. In consequence, many useful strategies for concurrent network I/O have been developed, researched, and deployed in actual systems. This will be the topic of the next article.
This appendix provides notes on how to install Python and the Twisted framework. In most cases, it is quite easy to do so — even if you have little prior experience. For additional information, see the Python and Twisted home pages.
Most Linux distributions include Python. To find out, try the command
python at the shell prompt. If Python is
installed, the interpreter will show its version, and start in interactive mode
(hold down Ctrl and press D to exit).
If your Python is very old (e.g., version 1.5.2), you probably want to upgrade it. For example, the latest stable version of Twisted is not available for Python variants earlier than 2.2.X. When upgrading Python, do not delete your old version. Different Python variants usually coexist well, and your system might be relying on the older version for some of its functions.
Before upgrading Python, check that you do not already have a later version
installed alongside the earlier one (i.e., as additional commands like
python2.2). If so, you can use the additional
command to explicitly invoke the more recent interpreter.
At the time of this writing, the most recent stable Python release is 2.3.2.
Red Hat Linux users can download the RPMs. Do not
forget the Tkinter RPMs if you wish to run the GUI
example. Tkinter also requires Tk. The version on the Red Hat install disks
should work fine, or search for
tk at the bottom
of the Red Hat download
To install the RPMs, copy them to a separate subdirectory, use the
su command at the shell prompt to become the
superuser (you will be asked the root password for the system), then
rpm -i subdirname/*.rpm.
Debian GNU/Linux 3.0 ("woody") users can
get a slightly older, 2.2.X series Python in just two steps. Become the
su at the shell prompt, then enter the
root password) and issue the command
python2.2-tk. In addition, you will need
apt-get install python2.2-dev) if you want to install Twisted later.
On Linux, Python can also be built from source, usually without too much work. See Python on Other Platforms for details.
Python is very easy to set up for Windows users. From the Python download page, click and run the executable installer. The link is to version 2.3.2 of Python, the latest at the time of this writing. Everything you need to use Tkinter is also included in the installation.
Python for Windows provides two variants of the interpreter:
pythonw. The latter can be used to execute
programs that do not require a console — such as the GUI
example given earlier.
For Python on the Macintosh, see the MacPython download page.
In general, on Unix and compatible systems (such as Linux), Python can be built using the source code. Here is a brief synopsis of the procedure, quoted from the Python version 2.3.2 download page.
All others should download either Python-2.3.2.tgz or Python-2.3.2.tar.bz2, the source archive. The
tar.bz2is considerably smaller, so get that one if your system has the appropriate tools to deal with it. Unpack it with
tar -zxvf Python-2.3.2.tgz(or
bzcat Python-2.3.2.tar.bz2 | tar -xf -). Change to the Python-2.3.2 directory and run the
make installcommands to compile and install Python.
If you're having trouble building on your system, check the top-level README file for platform-specific tips, or check the Build Bugs section on the Bugs web page.
Note that the last command (
needs to be run as the superuser.
At the time of this writing, the latest stable version of Twisted is 1.1.0.
From the download
page get the
tar.bz2 archive (see Python on
Other Platforms for a discussion of the difference between the two
formats). You may need to scroll down on the page until you find the
production version (downloads for the latest alpha release are sometimes listed
first). Choose the archive with documentation included, unless disk space
is really limited or your Internet connection is excessively slow.
After uncompressing the archive, change to that directory (using a shell),
become superuser (root), and issue the command
install. This will make Twisted available system-wide.
The Twisted download page includes Windows installers. If the alpha version is listed first, scroll down on the page until you get to the stable release (1.1.0 as of this writing). Choose the installer with documentation, unless you are low on disk space or have a slow Internet connection. Make sure that you use the installer that matches the Python version that you installed earlier. Run the installer, and follow the prompts as usual when adding a new application under Windows.
This appendix provides additional notes on the resources referred to in the article. First, some of the classic papers ("The Design Philosophy of the DARPA Internet Protocols" and "End-to-End Arguments in System Design") on the design of the Internet are enlightening and fascinating to read. An understanding of the underlying architecture will help you in developing your own programs.
Regarding the use of regular expressions to analyze text, there are many
online resources. Even the discussions that are not specific to Python can be a
very useful guide. Two Python-centric resources on regular expressions are the
HOWTO" and the documentation of the
module itself. For an overview of recent developments in regular
expressions — as well as how Python's implementation fits into the
overall taxonomy — see "What's New with Regular Expressions." The article is written by the author of Mastering Regular Expressions, a book for those who really want to become experts in this area.
Python programming is also well covered in numerous books, articles and online resources. The Python web site, for example, has an excellent documentation section. In particular, the tutorial is a quick but gentle introduction to the language. Python comes with a large, diverse library; the documentation is accessible via the library reference and module index pages. "An Introduction to Tkinter" provides a very helpful tutorial on this Python GUI library.
Papers from PyCon DC 2003 are available online, several on Twisted. The Twisted framework's web site also provides extensive documentation. In particular, the "howto" and example pages offer plenty of easy-to-follow information on the many diverse facilities of this system.
Download examples and other files related to this article: python_nio.zip or python_nio.tar.gz.
George Belotsky is a software architect who has done extensive work on high-performance internet servers, as well as hard real-time and embedded systems.
Return to Python DevCenter.
Copyright © 2009 O'Reilly Media, Inc.