Datamining with Decision Trees
Aside from classifying data, decision trees are also useful for discerning patterns in data, commonly referred to as datamining. Just by glancing over the decision tree figure at the beginning of this article, you can quickly pull out a few significant trends in the data you've collected.
The most obvious trend in the data is that young adults (ages 18 to 35) and seniors (55 and older) seem not to buy your widget at all. This could lead you to target only youths (18 and younger) and the middle-aged (36 to 55) with your marketing campaign in an effort to save money, because only consumers from these two groups tend to purchase the widget. On the other hand, it could also lead you to change your marketing strategy altogether to try to convince those reticent customers to buy your product, in the hopes of opening up new sources of revenue.
By looking a little closer at the decision tree, you'll also notice that with youths, income is the largest determinant in their decision to buy your widget. As youths from low-income households tend to buy the widget and those from high-income families tend not to, it may be possible that your product is not trendy enough for higher-income kids. You may want to push the widget into higher-priced, trendier retail stores in an effort to popularize it with higher-income kids and capture a portion of that market. With middle-aged consumers, marital status seems to be the discriminant factor. Because single consumers are more likely to buy your product, perhaps it would be a good idea to stress its utility to both sexes and maybe even point out its usefulness in a family setting, if adding more consumers from the married crowd is important to you.
The main point here is that looking over the data set, even with only 20 records, none of these patterns are easy to find with the naked eye. With hundreds, thousands, or even millions of records in a data set, spotting these trends becomes absolutely impossible. By using the decision tree algorithm, you can not only predict the likelihood of a person buying your product, you can also spot significant patterns in your collected test data that can help you to better mold your marketing practices, and thus infuse your revenue stream with plenty of new customers.
As I stated earlier, the rest of the code is basically just helper functions for the decision tree algorithm. I am hoping they will be fairly self-explanatory. Download the decision tree source code to see the rest of the functions that help create the decision tree.
The tarball contains three separate source code files. If you want to try out the algorithm and see how it works, just uncompress the source and run the test.py file. All it does is create a set of test data (the data you saw earlier in this article) that it uses to create a decision tree. Then it creates another set of sample data whose records it classifies using the decision tree it created with the test data in the first step.
The other two source files are the code for the decision tree algorithm and the ID3 heuristic. The first file, d_tree.py, contains the
create_decision_tree function and all the helper functions associated with it. The second file contains all the code that implements the ID3 heuristic, called, appropriately enough, id3.py. The reason for this division is that the decision tree learning algorithm is a well-established algorithm with little need for change. However, there exist many heuristics that can be used in the choosing of the "next best" attribute, and by placing this code into its own file, you are able to try out other heuristics by just adding another file and including it in place of id3.py in the file that makes the call to the
create_decision_tree function. (In this case, that file is test.py.)
I've had fun running through this first foray into the world of artificial intelligence with you. I hope you've enjoyed this tutorial and had plenty of success in getting your decision tree up and running. If so, and you find yourself thirsting for more AI related topics, such as genetic algorithms, neural networks, and swarm intelligence, then keep your eyes peeled for my next installment in this series on Python AI programming.
Until next time ... I wish you all the best in your programming endeavors.
Christopher Roach recently graduated with a master's in computer science and currently works in Florida as a software engineer at a government communications corporation.
Return to the Python DevCenter.