Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Understanding Network I/O: From Spectator to Participant

by George Belotsky
11/06/2003

This series of two articles will show you how to participate in the global Internet. The architects of this greatest advance in communications since the printing press have strived to make such participation possible. The Internet is — by design — a simple network, engineered only to route data from one location to another. All of the valuable services (such as the World Wide Web and email) are implemented at the endpoints, not inside of the Internet itself (see Further Reading for more information).

Because everybody is allowed to add value at the endpoints of the Internet, you can do it too! Today, this is easier than ever before. You do not need to be proficient in a difficult programming language, such as C or C++. For example, you will see that a useful web client can be about as complicated as a shopping list.

This article describes the basics of networked applications, providing information and sample code to get you started immediately. The second article will delve deeper into various techniques for network I/O, including important, practical results that help you choose the best method.

This article focuses on Internet clients. Clients — like your web browser — request information from servers (like the one from which you accessed this page). Typically, the client then presents the information to a person, although there are clients that talk to other computer programs instead. The next article will present ideas that are also applicable to developing servers and peer-to-peer systems.

Example Files

Download examples and other files related to this article:
python_nio.zip or python_nio.tar.gz

With just a basic Internet connection, you can create your own clients (for personal or internal company use) in a leisurely afternoon. The following discussion presents the core techniques, illustrated with complete, working web clients. It is very important that you do not skip the last section of this article. The discussion there covers several simple things that you can do to avoid mystifying errors and security breaches in your network application.

A Simple Client

Here is a simple web client. It displays the current outdoor temperature in New York City and then exits.

Example 1. A simple client

import urllib  # Library for retrieving files using a URL.
import re      # Library for finding patterns in text.

# The NOAA web page, showing current conditions in New York.
url = 'http://weather.noaa.gov/weather/current/KNYC.html'

# Open and read the web page.
webpage = urllib.urlopen(url).read()

# Pattern which matches text like '66.9 F'.  The last
# argument ('re.S') is a flag, which effectively causes
# newlines to be treated as ordinary characters.
match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)

# Print out the matched text and a descriptive message.
print 'In New York, it is now',match.group(1),'degrees.'

Here is the output produced by the client.

Example 2. Output from the simple client

In New York, it is now 52.0 degrees.

The client is written in Python — probably the easiest, fully featured mainstream programming language available today. Python is used by such organizations as Google, Yahoo, NASA, and Lawrence Livermore National Laboratories. The language is open source (so you can use it freely) and cross platform. Note that any text in the example that starts with a # is a descriptive comment to human readers of the code — Python ignores these comments.

Related Reading

Python in a Nutshell
By Alex Martelli

If you are really unfamiliar with Python, the Further Reading appendix provides several helpful links. C++, Perl, Java, and Visual Basic programmers should find Python very easy. It is probably the ideal choice for beginners, as well.

To run the examples in this article, it is sufficient to save the code into an ordinary text file and then issue the command python myfilename from the shell (the DOS command prompt, for Windows users). See the Installation Notes for easy directions on installing Python.

Now we are ready for a more detailed discussion of the example. The first two lines import libraries to use in our program. The Python distribution includes a library to download files based on a URL (e.g., from a web server). There is no need to do any low-level socket programming — the library does this work for you.

The program opens and reads a web page from the National Oceanic and Atmospheric Administration (NOAA) web server. This server hosts pages with current weather information from all over the world. The specific page retrieved by our example shows the conditions in New York City. The information is saved as one long string in the webpage variable.

Next, a regular expression is applied to the webpage string. This is a quick, effective way to extract information from text without implementing complex parsing logic yourself. Applications range from input validation for security (e.g., in CGI scripts) to bioinformatics (e.g., searching for DNA sequences in a genome).

While regular expressions can be a difficult topic, you do not need to become an absolute expert to make good use of this technology in many situations. The Further Reading appendix lists several resources for learning about regular expressions.

The final line in the example prints out the temperature reading, captured as part of the regular expression match. The following discussion gives the details of the matching process.

Capturing a Temperature Reading with a Regular Expression

Here is, once again, the regular expression pattern from the simple client example.

Example 3. Temperature-reading pattern

r'(-?\d+(?:\.\d+)?) F'

The leading r specifies a raw Python string. This will prevent the interpreter from processing special characters (such as the backslash). In a raw string, what you see is what you get: a r'\n' is two characters (a backslash followed by the letter n), whereas an ordinary string '\n' would be interpreted as a linefeed. If you are interested in the subtle details (not usually necessary in day-to-day programming) read the string literals subsection of the Python Reference Manual.

The first character in the pattern string is the open bracket, (. The bracket starts a grouping; text that matches the pattern inside of the brackets will be saved, so you can retrieve it for later use. This is the first grouping in the pattern; thus, the last line of the example retrieves the text captured by the grouping with match.group(1).

The temperature reading might be negative, so the minus sign is next. The question mark is a special flag that indicates that the preceding character is optional (in this case, the temperature reading may or may not contain a minus sign). If you need to match an actual question mark in the text, escape it with a backslash like this: \? (similarly, the backslash itself is matched by \\).

Next, we expect to find some digits. A single digit is matched by a \d. Note that d is a normal character, which usually matches itself (i.e., the literal letter d in the text). Here, the backslash is used to turn on a special meaning for d — that of matching a single digit. This is a general technique in regular expressions; some ordinary characters (like d) have a special meaning when prefixed with a backslash, while special characters (such as ?) become ordinary characters (which match themselves) when the backslash is prepended.

The \d is followed by the special character +, which specifies that at least one, but possibly more, of the preceding characters should be matched. Thus, \d+ matches any number of digits, but there must be at least one digit, or the match will fail. If you want to allow zero or more digits, use * instead of +, like this: \d*.

After the first string of digits, the temperature might contain a decimal point, with more digits following. This whole sequence, however, is optional, so a little more work is required to specify the pattern. We begin another grouping, nested inside of our original one. This time, however, the grouping is slightly different: (?: instead of just an opening bracket. The ?: sequence just inside of the bracket indicates that the grouping should not be saved — only matched. We are already saving the entire match in our first grouping, and having a copy of just the fractional part of the reading is not needed for this example.

Inside of the new grouping, we have \.\d+. The \d+ is already familiar: it matches one or more digits. The \. matches the decimal point. The backslash escape is required because . by itself matches almost any single character. This is why, for example, .* is often used to "match anything."

After the closing bracket of the nested grouping, there is a single question mark. This question mark applies to the entire nested grouping. Thus, we have an optional fractional part — the decimal point and at least one digit — to the temperature reading. There are easier ways to specify this fractional part, but they may allow malformed constructs (such as 65. with no digits after the decimal point) to slip through.

Next comes the closing bracket of our top-level grouping. After that, the last part of the pattern will match an ordinary space followed by the F character (which stands for "Fahrenheit"). While only the part inside of the outer brackets (the temperature reading itself) will be saved, the single space and the letter F must follow in the text in order for the match to succeed. This is a simple case of including extra information in a pattern, in order to identify just the right data. In the current example, here is the result if the "space F" sequence is left out of the regular expression.

Example 4. Matching the wrong thing

In New York, it is now 3 degrees.

When it is winter in New York, you might actually think that this reading is correct! Actually, it comes from the W3C substring, which occurs in the start of the returned HTML page.

Example 5. Start of an HTML document

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "hmpro6.dtd">

In the given example, just adding the "space F" yields the right result. Depending on the complexity and variability of the data being processed, you may need more identifying information in your pattern, in order to select the correct text. Of course, if you do not control the generation of the data you are processing, it is always possible that its format will be changed without warning. You should always consider this possibility, as well as the potential impact on your application.

Another interesting variation is writing a web client for a site that you control. For example, the client may interact with the CGI scripts on the web site. In this case, you would have effectively defined a network protocol of your own — layered on top of the HTTP protocol.

In the next section, we provide a Graphical User Interface (GUI) for our simple client. This is quite easy to do in Python.

Pages: 1, 2, 3, 4

Next Pagearrow





Sponsored by: