ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Top Ten Data Crunching Tips and Tricks

by Greg Wilson, author of Data Crunching: Solve Everyday Problems Using Java, Python, and More
06/09/2005

Every day, all over the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check configuration files, and yank data out of web server logs. This kind of programming is usually called data crunching, and while it's not glamorous, knowing how to do it with the least amount of effort can make the difference between meeting a deadline and making another pot of coffee. These ten tips will take the headache out of crunching data.

1. Master the Classic Tools

In an age of 3D window managers and web-based everything, it's easy to forget that the Unix command-line model of pipes, filters, and redirection is still the best way to solve most data crunching problems. Almost everything you'll ever want to do to text data has been done before, so it pays to master the more powerful tools like cut, grep, sed, find, and xargs.

Similarly, you should learn how to use a programmable cross-platform editor, such as Emacs or Vim, since many simple crunching tasks can be solved simply by recording macros and playing them back. If you're working in a pure-Microsoft environment, learn enough Visual Basic to script the Visual Studio editor. (The easiest way to do this is to record a few simple operations, then look at the code Visual Studio has written for you.)

One thing I don't think it's worth learning any longer is advanced shell programming. When you have serious crunching to do, you're probably going to want to do at least part of it in Python, Ruby, or some other freely available scripting language, so you might as well start there in the first place.

2. Separate Input, Processing, and Output

You'll occasionally be tempted to solve simple data crunching problems with one-off scripts that transform each record as it's read. Don't give in---experience shows that "one-offs" usually hang around longer than their authors expect, and that simple transformations usually become more complex over time. Instead, always write data crunchers in three parts: one that reads the input into memory, a second that transforms the data, and a third that writes out the data structures. Programs structured this way are a lot simpler to debug, and their parts are easier to modify and reuse.

And while we're on the subject of reuse: every software project's version control repository should have a misc or tools directory, and every little cruncher you write should be saved there. If nothing else, the odds and ends you put there can serve as starting points for the next script you have to write.

3. Store Format Information in the Data Itself

Programs come and go, but data lives forever. Any time you're working with data that isn't in a well-documented standard format, you should therefore try to include format information in the data itself, so that it can't ever be lost. If you're working with binary data, store the format template for each record at the start of the data; for example, if you're archiving a database table whose rows contain an 80-character string, two integers, and a variable-length string, you might store this:

680c2iS
    

before the first record. The 6 tells whoever's reading the data that the format specifier is six characters long; those six characters tell the reader to look for 80 characters, two integers, and a null-terminated block of text.

Related Reading

Data Crunching
Solve Everyday Problems Using Java, Python, and more.
By Greg Wilson

If you're making up a new text-based format, allow for headers (like the meta tags in an HTML page's head section), or put format information in specially structured comments, so that whoever has to read your legacy files ten years from now will know how to parse them.

4. Understand International Character Sets

Do you speak Thai? Russian? Arabic? Thanks to the internet, the odds are good that at least a few of your users do, but if you mangle their data, they may not be your users for long. It's therefore essential for your data crunchers to handle international character sets correctly. In particular:

5. Use Reluctant Matches and Graphical Tools

While we're on the topic of regular expressions, suppose you're trying to extract currency amounts from an email archive. Your first attempt uses the following regular expression:

^(.*)(\d+\.\d{2})(.*)$

which matches:

The problem is, when you apply this pattern to the text:

Charge $229.95, and get a receipt.

the middle group only matches 9.95. The reason is that regular expressions are greedy: each part of the pattern tries to match as much text as it can, leaving as little as possible for subsequent pattern elements.

Most regular expression (RE) packages now support reluctant (also called generous) patterns, which match as little text as possible. Each of the operators ?, *, and + has a reluctant counterpart: ??, *?, and +?. Making the first group in the pattern reluctant:

^(.*?)(\d+\.\d{2})(.*)$

solves our problem.

As well as illustrating reluctant matches, this example also shows why regular expressions can be so hard to debug: adding one character completely changes the way the RE behaves. This is why I'm very fond of graphical tools for building and debugging REs, such as Edi Weitz's Regex Coach, JRegexpTester, and the Rx Toolkit in ActiveState's Komodo IDE.

6. Nest and Subtract to Negate

Suppose you have a database that records which programmers are assigned to which projects, and you want a list of everyone who isn't working on version 2.1 of Frib. You might try this:

SELECT Assigned.EmployeeId
FROM   Assigned, ProjectInfo
WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
AND    ProjectInfo.ProjectName != "Frib 2.1";

However, if Alan Turing is working on both Frib 2.1 and Snab 3.0, his name will appear in this query's results.

The right way to tackle this problem is to use a nested query to select the records you don't want, and then subtract them from the set of all possible records. It feels clumsy the first few times you do it, but---oh, who am I kidding? It always feels clumsy, but it's what you have to do.

Here's the complete query:

SELECT Assigned.EmployeeId
FROM   Assigned
WHERE  Assigned.EmployeeId NOT IN
      (SELECT Assigned.EmployeeId
       FROM   Assigned, ProjectInfo
       WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
       AND    ProjectInfo.ProjectName = "Frib 2.1");

The nested query finds all employees who are working on Frib 2.1; the outer query then removes those IDs from the set of all employee IDs.

Depending on which database manager you're using, it may help to turn the nested query into a view. A view acts like a stored SELECT statement whose result can be treated as a temporary table. In this case, we could use:

CREATE VIEW WorkingOnFrob AS
SELECT Assigned.EmployeeId AS EmployeeId
FROM   Assigned, ProjectInfo
WHERE  Assigned.ProjectId = ProjectInfo.ProjectId
AND    ProjectInfo.ProjectName = "Frib 2.1";

SELECT Assigned.EmployeeId
FROM   Assigned, WorkingOnFrob
WHERE  Assigned.EmployeeId NOT IN WorkingOnFrob.EmployeeId;

7. Check for Holes

In the real world, data is never complete: there are always people who don't have a fax number or whose blood type is unknown. As a result, the data you're crunching may have holes. If you're lucky, these will be marked explicitly---many relational databases, for example, use NULL to indicate missing values. If you're unlucky, you'll have to figure out when an empty string or -1 means what it says, and when it means that whoever created the data didn't know what else to put.

It's particularly important to keep holes in mind when combining data. For example, if you've been asked to calculate the average sales price for a set of books, and some of them were given away as part of a promotional deal, you'll have to decide whether to count them as $0.00, or leave them out entirely.

The SQL standard tries to help programmers handle missing data by specifying that any operation involving NULL returns NULL as a result; i.e., 5 + NULL is NULL, as is (X < 0) OR (Y > 100) if either X or Y is NULL. Unfortunately, not all databases implement this correctly: in some, TRUE OR NULL is TRUE, while FALSE AND NULL is FALSE. Having been bitten by this a couple of times, I now check for records containing NULL before doing any other crunching.

8. Use XSLT (with Caveats)

XSLT is an easy way to specify simple transformations, but only when your input is highly structured (that is, where you don't need any information about context in order to interpret particular elements). However, the less structured your documents are--that is, the easier they are for human beings to type in and read---the more complex your transformation rules have to be, and experience shows that the difficulty of debugging XSLT transformations grows very quickly.

For example, transforming this:

<b1><t>XSLT is a simple way to solve simple problems</t>
  <b2><t>But complex scripts are fiendishly difficult to debug</t></b2>
</b1>
<b1><t>Options:</t>
  <b2><t>Put more structural information in your input.</t>
    <b3><t>But this makes it harder for people to create and read.</t></b3>
  </b2>
  <b2><t>Use a different tool.</t></b2>
</b1>

is almost trivial, since the text for each bullet point is wrapped in <t>...</t>. Take that away, so that whoever's creating the data has less typing to do:

<b1>XSLT is a simple way to solve simple problems
  <b2>But complex scripts are fiendishly difficult to debug</b2>
</b1>
<b1>Options:
  <b2>Put more structural information in your input.
    <b3>But this makes it harder for people to create and read.</b3>
  </b2>
  <b2>Use a different tool.</b2>
</b1>

and the XSLT transformation nearly triples in complexity.

9. Use XPath Without Using XSLT

XPath is a companion standard to XSLT, so you can use it without XSLT. XPath lets you identity parts of an XML document using a notation similar to that used for paths and filenames in the shell. For example, /project/* means, "All children of project elements," while ticket[@status='open']/author means, "Authors of tickets whose status attribute has the value 'open'." Several modern XML libraries, such as Fredrik Lundh's ElementTree package for Python, include XPath; data crunchers that use it are a lot easier to read (and write) than ones that walk trees themselves, and can sometimes be faster, too.

10. Check Your Work

If you don't know how to check your work, it's almost certainly wrong. Like everything you write, data crunchers should be tested. Sometimes, this is as simple as running it on the data you actually want to transform, and eyeballing the results; in more complex situations, you may want to select some subsets of your actual data, and make sure that each is handled correctly.

The most important thing, however, is figuring out how to tell the right answer from a wrong one. If you don't know enough about your input and output to be able to look at a specific case and say, "Wait a second, that's not right," then the odds are good that you don't know enough about your problem to write the transformation correctly. And please--when you do figure it out, write it down somewhere, so that the next time someone has to wrestle with the problem, it'll take them less time to figure it out.

Greg Wilson has been crunching data for more than 20 years. He is an independent programming consultant, an adjunct professor at the University of Toronto, and a contributing editor with Doctor Dobb's Journal. He is the author of Data Crunching and Practical Parallel Programming.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.