The ID3 Heuristic
Physics uses the term entropy to describe the amount of disorder inherent within a system. In information theory, this term has a similar meaning--it is the measure of the disorder in a set of data. The ID3 heuristic uses this concept to come up with the "next best" attribute in the data set to use as a node, or decision criteria, in the decision tree. Thus, the idea behind the ID3 heuristic is to find the attribute that most lowers the entropy for the data set, thereby reducing the amount of information needed to completely describe each piece of data. Thus, by following this heuristic you will essentially be finding the best attribute to classify the records (according to a reduction in the amount of information needed to describe the remaining data division) in the data set.
Information theory uses the log function with a base of 2 to determine the number of bits necessary to represent a piece of information. If you remember from early math education, the log function finds the exponent in an equation such as
2x = 8. In this equation,
x is equal to
3. The exponent in that equation is easy enough to see, but what about a more difficult example, such as
2x = 8,388,608? By using the logarithm function with a base of 2--
log2 8,388,608--you can find that the exponent
x is equal to 23. Thus you need 23 bits of information to properly represent 8,388,608 different numbers. This is the basic idea behind the entropy measurement in the ID3 algorithm. In other words, you are trying to find the attribute that best reduces the amount of information you need to classify your data.
The first step in this process is getting the "next best" attribute from the set of available attributes. The call to
choose_attribute takes care of this step. The
choose_attribute function uses the heuristic you've chosen for selecting "next best" attributes--in this case, the ID3 heuristic. In fact, the
fitness_func parameter you see in the call to
choose_attribute is a pointer to the
gain function from the ID3 algorithm described in the next section. By passing in a pointer to the
gain function, you've effectively separated the code for choosing the next attribute in the decision tree from the code for assembling the decision tree. This makes it possible, and extremely easy, to switch out the ID3 heuristic and exchange it with other heuristics that you may prefer with only the most minimal amount of change to the code.
The next step is to create a new decision tree containing the attribute returned from the
choose_attribute function as its root node. The only task left after this is to create the subtrees for each of the values in the
best attribute. The
get_values function cycles through each of the records in the data set and returns a list containing the unique values for the chosen attribute.
Finally, the code loops through each of these unique values and creates a subtree for them by making a recursive call to the
create_decision_tree function. The call to
get_examples just returns a list of all the records in the data set that have the value
val for the attribute defined by the
best variable. This list of examples is passed into the
create_decision_tree function along with the list of remaining attributes (minus the currently selected "next best" attribute). The call to
create_decision_tree returns the subtree for the remaining list of attributes and the subset of data passed into it. All that's left is to add each of these subtrees to the current decision tree and return it.
The next step in finding the entropy for the data set is to find the number of bits needed to represent each of the probabilities we calculated in the previous step. This is where you use the logarithm function. For the example above, the number of bits needed to represent the probability of each value occurring in the target attribute is
log2 0.6 = -0.736 for "will buy" and
log2 0.4 = -1.321 for "won't buy."
Now that you have the number of bits needed to represent the probability of each value occurring in the data set, all that's left to do is sum this up and, voilá, you have the entropy for the data set! Right? Not exactly. Before you do this, there is one more step. You need to go through and weight each of these numbers before summing them. To do so, multiply each amount that you found in the previous step by the probability of that value occurring, and then multiply the outcome by -1 to make the number positive. Once you've done this, the summation should look like
(-0.6 * -0.736) + (-0.4 * -1.321) = 0.97. Thus, 0.97 is the entropy for the data set in the table above.
That's all there is to finding the entropy for a set of data. You use the same equation to calculate the entropy for each subset of data in the gain equation, but it is essentially the same process. The only difference is that you will be using a smaller subset of the records within the data set, and you'll also be using an attribute other than the target attribute to calculate the entropy.
The next step in the ID3 heuristic is to calculate the information gain that each attribute affords if it is the next decision criteria in the decision tree. If you understood the first step on calculating the entropy, then this step should be a breeze. Essentially, all you need to do to find the gain for a specific attribute is find the entropy measurement for that attribute using the process described in the last few paragraphs (find the entropy for the subset of data for each value in the chosen attribute and sum them all), and subtract this value from the entropy for the entire data set. The decision tree algorithm follows this process for each attribute in the data set, and the attribute with the highest gain will be the one chosen as the next node in the decision tree.
That's the prose explanation. For those of you a bit more mathematically minded, the equations in Figure 2 and Figure 3 are the entropy and information gain for the data set.
Figure 2. The entropy equation
Figure 3. The information gain equation
Entropy and gain are the only two methods you need in the ID3 module. If you understood the concepts of entropy and information gain, then you understand the final pieces of the puzzle.
Just as a quick note: if you didn't totally understand the section on the ID3 heuristic, don't worry--several good web sites go over the ID3 heuristic in more detail. (One very good site in particular is decisiontrees.net, created by Michael Nashvili.) Also, keep in mind that it's just one of several heuristics that you can use to decide the "next best" node in the decision tree. The most important thing is to understand the inner workings of the decision tree algorithm. In the end, if you don't understand ID3, you can always just plug in another heuristic or create your own.