ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Building Decision Trees in Python

by Christopher Roach
02/09/2006

You have a great idea to start selling the most marvelous widget ever known. You're so sure of your enterprising idea that you decide to go into business for yourself and begin manufacturing said widgets. A few years pass and you're a success. However, lately you've noticed a slump in sales, and you decide that you need a better way of focusing your marketing dollars toward your target audience. How do you do this?

This article introduces a popular and easy-to-use datamining tool called a decision tree that should help you solve your marketing dilemma.

Decision trees are a topic of artificial intelligence. More specifically, they belong to the subfield of machine learning. This is due to their ability to learn--through example--to classify individual records in a data set.

This article will give you a closer look at what decision trees are, and how they can help you in your endeavors to more effectively target your marketing campaign toward your core customers. In so doing, I'll go over a few different uses for decision trees (aside from the aforementioned marketing scenario), and I'll discuss one popular heuristic that can be used during the learning process to create the most effective decision tree. Finally, I'll implement a simple decision tree program using the Python programming language.

If you've ever had any interest in machine learning or artificial intelligence; if you find the idea of writing a program that has the ability to learn simply fascinating; or if you just happen to manufacture the world's best widgets, then this article is for you.

Decision Tree ... What's That?

Decision trees fall under the subfield of machine learning within the larger field of artificial intelligence. Decision trees are mainly used for classification purposes, but they are also helpful in uncovering features of data that were previously unrecognizable to the eye. Thus, they can be very useful in datamining activities as well as data classification. They work well in many different areas, from typical business scenarios (such as the widget example described previously) to airplane autopilots and medical diagnoses.

A decision tree is essentially a series of if-then statements, that, when applied to a record in a data set, results in the classification of that record. Therefore, once you've created your decision tree, you will be able to run a data set through the program and get a classification for each individual record within the data set. What this means to you, as a manufacturer of quality widgets, is that the program you create from this article will be able to predict the likelihood of each user, within a data set, purchasing your finely crafted product.

Though the classification of data is the driving force behind creating a decision tree program, it is not what makes a decision tree special. The beauty of decision trees lies in their ability to learn. When finished, you will be able to feed your program a test set of data, and it essentially will learn how to classify future sets of data from the examples.

Hopefully, if I've done my job well enough, you're champing at the bit to start coding. However, before you go any further with your code monkey endeavors, you need a good idea of what a decision tree looks like. It's one of those data structures that is easiest to understand with a good visual representation. Figure 1 contains a graphical depiction of a decision tree.

a decision tree
Figure 1. A decision tree

Notice that it actually is a true tree structure. Because of this, you can use recursive techniques to both create and traverse the decision tree. For that reason, you could use just about any tree representation that you remember from your Data Structures course to represent the decision tree. In this article, though, I'm going to keep everything very simple and create my decision tree out of only Python's built-in dictionary object.

Fleshing Out the Scenario

If you recall the example scenario, you are the manufacturer of the world's finest widgets and are currently looking for a way to focus your marketing efforts to more closely entice your target demographic. To do this, you need a set of test data to use to train the decision tree program.

I assume that, being the concerned entrepreneur that you undoubtedly are, you've already gathered plenty of demographic information through, perhaps, anonymous email surveys. Now, what you need to do is organize all this data into a large set of user records. Here is a table containing a sampling of the information you collected during your email survey:

Age Education Income Marital Status Purchase?
36-55 master's high single will buy
18-35 high school low single won't buy
36-55 master's low single will buy
18-35 bachelor's high single won't buy
< 18 high school low single will buy
18-35 bachelor's high married won't buy
36-55 bachelor's low married won't buy
> 55 bachelor's high single will buy
36-55 master's low married won't buy
> 55 master's low married will buy
36-55 master's high single will buy
> 55 master's high single will buy
< 18 high school high single won't buy
36-55 master's low single will buy
36-55 high school low single will buy
< 18 high school low married will buy
18-35 bachelor's high married won't buy
> 55 high school high married will buy
> 55 bachelor's low single will buy
36-55 high school high married won't buy

Given the decision tree in Figure 1 and the set of data, it should be somewhat easy to see just how a decision tree can classify records in a data set. Starting with the top node (Age), check the value of the first record in the field matching that of the top node (in this case, 36-55). Then follow the link to the next node in the tree (Marital Status) and repeat the process until you finally reach a leaf node (a node with no children). This leaf node holds the answer to the question of whether the user will buy your product. (In this example, the user will buy, because his marital status is single). It's also quite easy to see that this type of operation lends itself to a recursive process (not necessarily the most efficient way to program, I know--but a very elegant way).

The decision tree in the figure is just one of many decision tree structures you could create to solve the marketing problem. The task of finding the optimal decision tree is an intractable problem. For those of you who have taken an analysis of algorithms course, you no doubt recognize this term. For those of you who haven't had this pleasure (he says, gritting his teeth), essentially what this means is that as the amount of test data used to train the decision tree grows, the amount of time it takes to do so grows as well--exponentially. While it may be nearly impossible to find the smallest (or more fittingly, the shallowest) decision tree in a respectable amount of time, it is possible to find a decision tree that is "small enough" using special heuristics. It is the job of the heuristic you choose to accomplish this task by choosing the "next best" attribute by which to divide the data set according to some predefined criteria. There are many such heuristics (C4.5, C5.0, gain ratio, GINI, and others). However, for this article I've used one of the more popular heuristics for choosing "next best" attributes based on some of the ideas found in information theory. The ID3 (information theoretic) heuristic uses the concept of entropy to calculate which attribute is best to use for dividing the data into subgroups.

The next section quickly covers the basic idea behind how this heuristic works. Don't worry; it's not too much math. Following this discussion, you'll finally get a chance to get your hands dirty by writing the code that will create the decision tree and classify the users in your data set as a "will buy" or "won't buy," thereby making your company instantly more profitable.

The ID3 Heuristic

Physics uses the term entropy to describe the amount of disorder inherent within a system. In information theory, this term has a similar meaning--it is the measure of the disorder in a set of data. The ID3 heuristic uses this concept to come up with the "next best" attribute in the data set to use as a node, or decision criteria, in the decision tree. Thus, the idea behind the ID3 heuristic is to find the attribute that most lowers the entropy for the data set, thereby reducing the amount of information needed to completely describe each piece of data. Thus, by following this heuristic you will essentially be finding the best attribute to classify the records (according to a reduction in the amount of information needed to describe the remaining data division) in the data set.

Information theory uses the log function with a base of 2 to determine the number of bits necessary to represent a piece of information. If you remember from early math education, the log function finds the exponent in an equation such as 2x = 8. In this equation, x is equal to 3. The exponent in that equation is easy enough to see, but what about a more difficult example, such as 2x = 8,388,608? By using the logarithm function with a base of 2--log2 8,388,608--you can find that the exponent x is equal to 23. Thus you need 23 bits of information to properly represent 8,388,608 different numbers. This is the basic idea behind the entropy measurement in the ID3 algorithm. In other words, you are trying to find the attribute that best reduces the amount of information you need to classify your data.

The first step in this process is getting the "next best" attribute from the set of available attributes. The call to choose_attribute takes care of this step. The choose_attribute function uses the heuristic you've chosen for selecting "next best" attributes--in this case, the ID3 heuristic. In fact, the fitness_func parameter you see in the call to choose_attribute is a pointer to the gain function from the ID3 algorithm described in the next section. By passing in a pointer to the gain function, you've effectively separated the code for choosing the next attribute in the decision tree from the code for assembling the decision tree. This makes it possible, and extremely easy, to switch out the ID3 heuristic and exchange it with other heuristics that you may prefer with only the most minimal amount of change to the code.

The next step is to create a new decision tree containing the attribute returned from the choose_attribute function as its root node. The only task left after this is to create the subtrees for each of the values in the best attribute. The get_values function cycles through each of the records in the data set and returns a list containing the unique values for the chosen attribute.

Finally, the code loops through each of these unique values and creates a subtree for them by making a recursive call to the create_decision_tree function. The call to get_examples just returns a list of all the records in the data set that have the value val for the attribute defined by the best variable. This list of examples is passed into the create_decision_tree function along with the list of remaining attributes (minus the currently selected "next best" attribute). The call to create_decision_tree returns the subtree for the remaining list of attributes and the subset of data passed into it. All that's left is to add each of these subtrees to the current decision tree and return it.

The next step in finding the entropy for the data set is to find the number of bits needed to represent each of the probabilities we calculated in the previous step. This is where you use the logarithm function. For the example above, the number of bits needed to represent the probability of each value occurring in the target attribute is log2 0.6 = -0.736 for "will buy" and log2 0.4 = -1.321 for "won't buy."

Now that you have the number of bits needed to represent the probability of each value occurring in the data set, all that's left to do is sum this up and, voilá, you have the entropy for the data set! Right? Not exactly. Before you do this, there is one more step. You need to go through and weight each of these numbers before summing them. To do so, multiply each amount that you found in the previous step by the probability of that value occurring, and then multiply the outcome by -1 to make the number positive. Once you've done this, the summation should look like (-0.6 * -0.736) + (-0.4 * -1.321) = 0.97. Thus, 0.97 is the entropy for the data set in the table above.

That's all there is to finding the entropy for a set of data. You use the same equation to calculate the entropy for each subset of data in the gain equation, but it is essentially the same process. The only difference is that you will be using a smaller subset of the records within the data set, and you'll also be using an attribute other than the target attribute to calculate the entropy.

The next step in the ID3 heuristic is to calculate the information gain that each attribute affords if it is the next decision criteria in the decision tree. If you understood the first step on calculating the entropy, then this step should be a breeze. Essentially, all you need to do to find the gain for a specific attribute is find the entropy measurement for that attribute using the process described in the last few paragraphs (find the entropy for the subset of data for each value in the chosen attribute and sum them all), and subtract this value from the entropy for the entire data set. The decision tree algorithm follows this process for each attribute in the data set, and the attribute with the highest gain will be the one chosen as the next node in the decision tree.

That's the prose explanation. For those of you a bit more mathematically minded, the equations in Figure 2 and Figure 3 are the entropy and information gain for the data set.

entropy equation
Figure 2. The entropy equation

information gain equation
Figure 3. The information gain equation

Entropy and gain are the only two methods you need in the ID3 module. If you understood the concepts of entropy and information gain, then you understand the final pieces of the puzzle.

Just as a quick note: if you didn't totally understand the section on the ID3 heuristic, don't worry--several good web sites go over the ID3 heuristic in more detail. (One very good site in particular is decisiontrees.net, created by Michael Nashvili.) Also, keep in mind that it's just one of several heuristics that you can use to decide the "next best" node in the decision tree. The most important thing is to understand the inner workings of the decision tree algorithm. In the end, if you don't understand ID3, you can always just plug in another heuristic or create your own.

The Decision Tree Learning Algorithm

With most of the preliminary information out of the way, you can now look at the actual decision tree algorithm. The following code listing is the main function used to create your decision tree:

def create_decision_tree(data, attributes, target_attr, fitness_func):
    """
    Returns a new decision tree based on the examples given.
    """
    data    = data[:]
    vals    = [record[target_attr] for record in data]
    default = majority_value(data, target_attr)

    # If the dataset is empty or the attributes list is empty, return the
    # default value. When checking the attributes list for emptiness, we
    # need to subtract 1 to account for the target attribute.
    if not data or (len(attributes) - 1) <= 0:
        return default
    # If all the records in the dataset have the same classification,
    # return that classification.
    elif vals.count(vals[0]) == len(vals):
        return vals[0]
    else:
        # Choose the next best attribute to best classify our data
        best = choose_attribute(data, attributes, target_attr,
                                fitness_func)

        # Create a new decision tree/node with the best attribute and an empty
        # dictionary object--we'll fill that up next.
        tree = {best:{}}

        # Create a new decision tree/sub-node for each of the values in the
        # best attribute field
        for val in get_values(data, best):
            # Create a subtree for the current value under the "best" field
            subtree = create_decision_tree(
                get_examples(data, best, val),
                [attr for attr in attributes if attr != best],
                target_attr,
                fitness_func)

            # Add the new subtree to the empty dictionary object in our new
            # tree/node we just created.
            tree[best][val] = subtree

    return tree

The create_decision_tree function starts off by declaring three variables: data, vals, and default. The first, data, is just a copy of the data list being passed into the function. The reason I do this is because Python passes all mutable data types, such as dictionaries and lists, by reference. It's a good rule of thumb to make a copy of any of these in order to keep from accidentally altering the original data. vals is a list of all the values in the target attribute for each record in the data set, and default holds the default value that is returned from the function when the data set is empty. That is simply the value in the target attribute with the highest frequency, and thus, the best guess for when the decision tree is unable to classify a record.

The next lines are the real nitty-gritty of the algorithm. The algorithm makes use of recursion to create the decision tree, and as such it needs a base case (or, in this case, two base cases) to prevent it from entering an infinite recursive loop. What are the base cases for this algorithm? For starters, if either the data or attributes list is empty, then the algorithm has reached a stopping point. The first if-then statement takes care of this case. If either list is empty, then the algorithm returns a default value. (Actually, for the attributes list, check to see whether it has only one attribute in it, because the attributes list also contains the target attribute, which the decision tree never uses; that is what the tree should predict.) It returns the value with the highest frequency in the data set for the target attribute. The only other case to worry about is when the remaining records in the data list all have the same value for the target attribute, in which case the algorithm returns that value.

Those are the base cases. What about the recursive case? Well, when everything else is normal (that is, the data and attributes lists are not empty and the records in the list of data still have multiple values for the target attribute), the algorithm needs to choose the "next best" attribute for classifying the test data and add it to the decision tree. The choose_attribute function is responsible for picking the "next best" attribute for classifying the records in the test data set. After this, the code creates a new decision tree containing only the newly selected "best" attribute. Then the recursion takes place. In other words, each of the subtrees is created by making a recursive call to the create_decision_tree function and adding the returned tree to the newly created tree in the last step.

The first step in this process is getting the "next best" attribute from the set of available attributes. The call to choose_attribute takes care of this step. The next step is to create a new decision tree containing the chosen attribute as the root node. All that remains to do after this is to create the subtrees for each of the values in the best attribute. The get_values function cycles through each of the records in the data set and returns a list containing the unique values for the chosen attribute. Next, the code loops through each of these unique values and creates a subtree for them by making a recursive call to the create_decision_tree function. The call to get_examples just returns a list of all the records in the data set that have the value val for the attribute defined by the best variable. This list of examples passes to the create_decision_tree function along with the list of remaining attributes (minus the currently selected "next best" attribute). The call to create_decision_tree will return the subtree for the remaining list of attributes and the subset of data passed into it. All that's left is to add each of these subtrees to the current decision tree and return it.

If you're not used to recursion, this process can seem a bit strange. Take some time to look over the code and make sure that you understand what is happening here. Create a little script to run the function and print out the tree (or, just alter test.py to do so), so you can get a better idea of how it's functioning. It's really a good idea to take your time and make sure you understand what's happening, because many programming problems lend themselves to a recursive solution--you just may be adding a very important tool to your programming arsenal.

That's about all there is to the algorithm; everything else is just helper functions to the main algorithm. Most of the functions should be fairly self-explanatory, with the exception of the ID3 heuristic.

Implementing the ID3 Heuristic

The ID3 heuristic uses the concept of entropy to formulate the gain in information received by choosing a particular attribute to be the next node in the decision tree. Here's the entropy function:

def entropy(data, target_attr):
    """
    Calculates the entropy of the given data set for the target attribute.
    """
    val_freq     = {}
    data_entropy = 0.0

    # Calculate the frequency of each of the values in the target attr
    for record in data:
        if (val_freq.has_key(record[target_attr])):
            val_freq[record[target_attr]] += 1.0
        else:
            val_freq[record[target_attr]]  = 1.0

    # Calculate the entropy of the data for the target attribute
    for freq in val_freq.values():
        data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2) 
        
    return data_entropy

Just like the create_decision_tree function, the first thing the entropy function does is create the variables it uses throughout the algorithm. The first is a dictionary object called val_freq to hold all the values found in the data set passed into this function and the frequency at which each value appears in the data set. The other variable is data_entropy, which holds the ongoing calculation of the data's entropy value.

The next section of code adds each of the values in the data set to the val_freq dictionary and calculates the corresponding frequency for each value. It does so by looping through each of the records in the data set and checking the val_freq dictionary object to see if the current value already resides within it. If it does, it increments the frequency for the current value, otherwise, it adds the current value to the dictionary object and initializes its frequency to 1. The final portion of the code is responsible for actually calculating the entropy measurement (using the equation in Figure 1) with the frequencies stored in the val_freq dictionary object.

That was easy, wasn't it? That's only the first half of the ID3 heuristic. Now that you know how to calculate the amount of disorder in a set of data, you need to take those calculations and use them to find the amount of information gain you will get by using an attribute in the decision tree. The information gain function is very similar to the entropy function. Here's the code that calculates this measurement:

def gain(data, attr, target_attr):
    """
    Calculates the information gain (reduction in entropy) that would
    result by splitting the data on the chosen attribute (attr).
    """
    val_freq       = {}
    subset_entropy = 0.0

    # Calculate the frequency of each of the values in the target attribute
    for record in data:
        if (val_freq.has_key(record[attr])):
            val_freq[record[attr]] += 1.0
        else:
            val_freq[record[attr]]  = 1.0

    # Calculate the sum of the entropy for each subset of records weighted
    # by their probability of occuring in the training set.
    for val in val_freq.keys():
        val_prob        = val_freq[val] / sum(val_freq.values())
        data_subset     = [record for record in data if record[attr] == val]
        subset_entropy += val_prob * entropy(data_subset, target_attr)

    # Subtract the entropy of the chosen attribute from the entropy of the
    # whole data set with respect to the target attribute (and return it)
    return (entropy(data, target_attr) - subset_entropy)

Once again, the code starts by calculating the frequency of each of the values in the data set. Following this, it calculates the entropy for the data set with the new division of data derived by using the chosen attribute, attr, to classify the records in the data set. Subtracting that from the original entropy for the current subset of data set finds the gain in information (or, reduction in disorder, if you prefer to think those terms) that you get by choosing that attribute as the next node in the decision tree.

That is essentially all there is to it. You still need some code that cycles through each attribute and calculates its information gain measure and chooses the best one, but that part of the code should be somewhat obvious; it's just a matter of repeatedly calling the gain function on each attribute and keeping track of the attribute with the best score. That said, I leave it as a challenge for you to look over the rest of the helper functions in the accompanying source code and figure out each one.

Datamining with Decision Trees

Aside from classifying data, decision trees are also useful for discerning patterns in data, commonly referred to as datamining. Just by glancing over the decision tree figure at the beginning of this article, you can quickly pull out a few significant trends in the data you've collected.

The most obvious trend in the data is that young adults (ages 18 to 35) and seniors (55 and older) seem not to buy your widget at all. This could lead you to target only youths (18 and younger) and the middle-aged (36 to 55) with your marketing campaign in an effort to save money, because only consumers from these two groups tend to purchase the widget. On the other hand, it could also lead you to change your marketing strategy altogether to try to convince those reticent customers to buy your product, in the hopes of opening up new sources of revenue.

By looking a little closer at the decision tree, you'll also notice that with youths, income is the largest determinant in their decision to buy your widget. As youths from low-income households tend to buy the widget and those from high-income families tend not to, it may be possible that your product is not trendy enough for higher-income kids. You may want to push the widget into higher-priced, trendier retail stores in an effort to popularize it with higher-income kids and capture a portion of that market. With middle-aged consumers, marital status seems to be the discriminant factor. Because single consumers are more likely to buy your product, perhaps it would be a good idea to stress its utility to both sexes and maybe even point out its usefulness in a family setting, if adding more consumers from the married crowd is important to you.

The main point here is that looking over the data set, even with only 20 records, none of these patterns are easy to find with the naked eye. With hundreds, thousands, or even millions of records in a data set, spotting these trends becomes absolutely impossible. By using the decision tree algorithm, you can not only predict the likelihood of a person buying your product, you can also spot significant patterns in your collected test data that can help you to better mold your marketing practices, and thus infuse your revenue stream with plenty of new customers.

Conclusion

As I stated earlier, the rest of the code is basically just helper functions for the decision tree algorithm. I am hoping they will be fairly self-explanatory. Download the decision tree source code to see the rest of the functions that help create the decision tree.

The tarball contains three separate source code files. If you want to try out the algorithm and see how it works, just uncompress the source and run the test.py file. All it does is create a set of test data (the data you saw earlier in this article) that it uses to create a decision tree. Then it creates another set of sample data whose records it classifies using the decision tree it created with the test data in the first step.

The other two source files are the code for the decision tree algorithm and the ID3 heuristic. The first file, d_tree.py, contains the create_decision_tree function and all the helper functions associated with it. The second file contains all the code that implements the ID3 heuristic, called, appropriately enough, id3.py. The reason for this division is that the decision tree learning algorithm is a well-established algorithm with little need for change. However, there exist many heuristics that can be used in the choosing of the "next best" attribute, and by placing this code into its own file, you are able to try out other heuristics by just adding another file and including it in place of id3.py in the file that makes the call to the create_decision_tree function. (In this case, that file is test.py.)

I've had fun running through this first foray into the world of artificial intelligence with you. I hope you've enjoyed this tutorial and had plenty of success in getting your decision tree up and running. If so, and you find yourself thirsting for more AI related topics, such as genetic algorithms, neural networks, and swarm intelligence, then keep your eyes peeled for my next installment in this series on Python AI programming.

Until next time ... I wish you all the best in your programming endeavors.

Christopher Roach recently graduated with a master's in computer science and currently works in Florida as a software engineer at a government communications corporation.

Python Pocket Reference

Related Reading

Python Pocket Reference
By Mark Lutz

Return to the Python DevCenter.

Copyright © 2009 O'Reilly Media, Inc.