Web client programming is a powerful technique for querying the Web. A web client is any program that retrieves data from a web server using the Hyper Text Transfer Protocol (the http in your URLs). A web browser is a client; so are web crawlers, programs that traverse the Web automatically to gather information. You can also use web clients to take advantage of services offered by others on the Web and add dynamic features to your own web site.
Web client programming belongs in any developer's toolbox. Perl aficionados have employed it for years. In Python, the process reaches even higher levels of convenience and flexibility. Three modules provide most of the functionality you will need: HTTPLIB, URLLIB, and a newer addition, XMLRPCLIB. In true Pythonesque fashion, each module builds upon its predecessor, providing a solid, well-designed base for your applications. We will cover the first two modules in this article, saving XMLRPCLIB for a later time.
For our examples, we will use Meerkat. If you are like me, you invest time tracking trends and developments in the open source community to give you a competitive edge. Meerkat is a tool that makes that task much easier. It is an open wire service that collects and collates an enormous amount of information on open source computing. Although its browser interface is flexible and customizable, using web client programming we can scan, extract, and even store this information off-line for later use. We will first access Meerkat using HTTPLIB interactively, and then move on to accessing Meerkat's Open API via URLLIB to create a customizable information-collecting tool.
HTTPLIB is a lightweight wrapper around the socket module. Of the three libraries I have mentioned, HTTPLIB provides the most control when accessing a web site. That control, however, comes at the cost of requiring more work to accomplish your task. The http protocol is "stateless," so it doesn't remember anything about your previous requests. You must construct a new HTTPLIB object to connect to the web site for each request. The requests form a conversation with the web server, mimicking a web browser. Let's connect to Meerkat using Rael Dornfest's Open API interactively and see what results we get. The conversation begins by building up a series of statements that first state what action you want to take, and then identify you to the web server:
>>> import httplib >>> host = 'www.oreillynet.com' >>> h = httplib.HTTP(host) >>> h.putrequest('GET', '/meerkat/?_fl=minimal') >>> h.putheader('Host', host) >>> h.putheader('User-agent', 'python-httplib') >>> h.endheaders() >>>
GET request tells the server which page you want to receive. The Host
header tells it the domain name you are querying. Modern servers using HTTP 1.1 can host several domains at the
same address. If you don't tell it which domain name you want, you will get a
'302' redirection response as your return code. The User-agent header tells
the server what kind of client you are so it knows what it can and cannot send
you. This is all the information you need for the web server to process your
request. Next you ask for the
>>> returncode, returnmsg, headers = h.getreply() >>> if returncode == 200: #OK ... f = h.getfile() ... print f.read() ...
This will print out the current Meerkat page in the minimal flavor. The
response header and content are returned separately, which aids in both
troubleshooting and parsing any returned data. If you want to see the
response headers use
HTTPLIB hides the mechanics of socket programming, and its use of a file
object for buffering lets you use a familiar approach to manipulating the
data. It is, however, best suited as a building block for more powerful web
client applications, or for interactive conversations with a troubled web site.
To aid in both areas, HTTPLIB has a useful debug capability. You access it by
calling the method
h.set_debuglevel(1) at any point after object
initialization (the line
h = httplib.HTTP(host) in our example).
With the debug level set to 1, the module will echo requests and the results
of any calls to
getreply() to the screen.
The interactive nature of Python makes analyzing websites using HTTPLIB a joy. Familiarize yourself with this module and you will have a powerful, flexible tool for diagnosing web site problems. Take time to look at the source for HTTPLIB as well. With less than 200 lines of code, HTTPLIB is a quick and easy introduction to socket programming using Python.
URLLIB provides a sophisticated interface to the functionality found in HTTPLIB. It is best used for getting at the data itself, rather than analyzing the web site. Here's the same interaction as above, using URLLIB (NOTE: we've had to break the last line into two for the sake of display, but don't use a line break in your script):
>>> import urllib >>> u = urllib.urlopen ('http://www.oreillynet.com/meerkat/?_fl=minimal')
That's all there is to it! With one line you've accessed Meerkat, obtained the data, and placed it in a temporary cache. To access the header information:
>>> print u.headers
And to view the entire file:
But that's not all. In addition to HTTP, URLLIB can also access FTP, Gopher, and even local files in the same manner. The module also contains many utility functions, including those for parsing urls, encoding strings into a url-safe format, and providing progress indication during lengthy data transfers.
Imagine that you have a group of clients who expect you to keep them informed by mail of the latest happenings regarding Linux. We can write a short script using URLLIB to get this information from Meerkat, build a listing of links, and store those links in a file for later transmission. The author of Meerkat, Rael Dornfest, has already done most of the work for us through the Meerkat API. All that is left is to construct the request, parse out the links, and store the results for later transmission.
Why do this rather than just have the users head to Meerkat? Providing this "passive" service allows the user to view the information at leisure, and provides them with the ability to selectively store the information in a familar (e.g., e-mail) format. With the news waiting in their mailbox on Monday morning, they won't miss information that "scrolls by" over the weekend.
Since Meerkat's minimal flavor is limited to 15 stories, we will run the script every hour (e.g., as a Unix cron job or using NT's AT command) to lessen the chances of missing any data. Here is the url we will use (NOTE: we've had to break this into two lines for the sake of our display. You can see the results of using this URL here).
This will pull in all Linux stories (profile=5) from the last hour, presenting the data in minimal flavor, with no descriptions, no category info, no channel info, and no date info. We will also use the regular expression module to help us extract the link information and redirect our output to a file object opened in append mode.
View the complete script here.
We've only scratched the surface of these modules, and there are many other network programming modules available for Python than can be used for web client tasks. Web client programming is especially useful when processing large amounts of tabular data. Using web client programming in a recent Electronic Data Interchange project, we bypassed an unwieldy, proprietary software package. We took the updated price information we needed directly from the web and put it into our database. It saved us a lot of time and frustration.
Web client programming can also be useful for testing the structure and consistency of web sites. The most common procedure is checking for dead links. The standard Python distribution comes with a complete example of this, based upon URLLIB. Webchecker, along with a Tk-based front end, can be found under the tools subdirectory of the distribution. Another Python tool, Linbot, improves on this. It provides everything you need for web-site troubleshooting. As web sites become more and more complex, other web client applications will become necessary to ensure your web site's quality.
There is a pitfall to web client programming. Your programs are often susceptible to small changes in the way a page is formatted. How a site displays its data today may not be how it displays it tomorrow. When the format changes, so must your programs. This is one reason XML is so exciting: With data on the web tagged for meaning, format is less important. As XML standards evolve and become universally accepted, processing XML data will be even easier and more robust.
There are also some limits to the tools we covered here. Although they are excellent for client-based tasks, the HTTPLIB and URLLIB modules shouldn't be used to build a production http server, since they can only handle one request at a time. To provide asynchronous processing, Sam Rushing has built an impressive set of tools, including asyncore.py, which comes with the standard Python distribution. The most powerful example of this approach is ZOPE, an application server that includes a fast http server built using Sam Rushing's Medusa engine.
In a future article I will show you how you can combine XML and web-client programming with the XMLRPCLIB module. You can use XML to squeeze even more functionality out of the Meerkat API.
Dave Warner is a senior programmer and DBA at Federal Data Corporation. He specializes in accessing relational databases with languages that begin with the letter P: Python, Perl, and PowerBuilder.
Discuss this article in the O'Reilly Network Python Forum.
Return to the Python DevCenter.
Copyright © 2009 O'Reilly Media, Inc.