**Calculating Entropy for Data Miners**

Pages: 1, 2, 3, 4, **5**

### Information Gain

What I have called the mutual information score, Han and Kamber refer to as
the *Gain* score. While the formulas for computing this score appear different,
the mutual information score (`I(age; buys_computer) = 0.246`

) is
the same value as their gain score (`Gain(age) = 0.246`

).

In the datamining literature, the term *gain* often appears instead of *mutual information*, probably because the standard formulas differ and the
gain score is sometimes "corrected" to make it more useful for datamining. Be
careful, however, in correcting the computation too much, or else you will lose the
poweful consistency properties of the information theory calculus.

The term *gain* is more suggestive of why you might want to compute the
mutual information score in a datamining context, and I will use it from here
on in. According Han and Kamber, "Gain(A) is the expected reduction in
entropy cause by knowing the value of attribute A." Below is some code that
computes the gain score for each of the columns in the
`AllElectronics.Customers`

table to determine which column most
reduces the entropy in the `buys_computer`

column.

```
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/ConditionalEntropy.php";
/**
* Example of how to compute gain scores for several columns.
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();
$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConsequent('buys_computer');
$columns = array('age','income','student','credit_rating');
foreach($columns as $column) {
$ce->setAntecedent($column);
$ce->analyze();
$gain = $e->bits - $ce->bits;
echo "gain($column): $gain<br />";
}
?>
```

The `setAntecedent()`

and `setConsequent()`

methods in
the `ConditionalEntropy.php`

class exist to handle situations where
you want to compute the conditional entropy score for a target column in terms
of several other columns. The script outputs the following set of gain
scores:

```
gain(age): 0.246749819774
gain(income): 0.029222565659
gain(student): 0.151835501362
gain(credit_rating): 0.0481270304083
```

The report shows that the `age`

column reduces the entropy in the
`buys_computer`

column the most. You might consider developing a
script that ranks the gains from highest to lowest for use in
constructing decision tree classifiers, discussed in the next section.

### Datamining Applications

Three common datamining applications of entropic analysis are data
description, classifier construction, and data discretization. You can easily
incorporate the entropy scores calculated previously into web-based tools that
report the entropy, joint entropy, conditional entropy, and mutual
information/gain scores between the various columns in the database. Use this
information much as you would use means, standard deviations, and correlation
coefficients to provide a useful descriptive summary of your data sets. Dorian
Pyle's book *Data Preparation for Data Mining* provides examples of such
entropic analysis reports along with a useful discussion of how to use such
reported entropy information used in the early "data survey" stage of data
mining.

Related Reading |

The previous article also developed several such reports; however, they reported only univariate entropy scores. With the new entropy classes and sample code discussed in this article, you should be able to develop a variety of new entropy-based reports to characterize your data columns.

A more advanced and very common application of entropic analysis to data mining is in the construction of classifiers in the form of decision trees. A decision tree might be useful for determining whether an All Electronics customer is likely to buy a computer. This decision tree would involve first asking the age of the customer. If the age were younger than 30 years old, it might then ask whether the customer is a student. After that, it might ask his income range. Eventually it would reach a terminal node in the decision tree that classifies the instance as to whether the customer is likely to buy a computer.

It is beyond the scope of this article to go too deeply into the theory of
constructing decision trees, except to note that constructing such trees uses
the recursive computation of gain scores extensively. The `age`

attribute forms the root of the decision tree because it has the highest
information gain score. It then takes the partitioned customers (in the ranges
of `<30`

, `31..40`

, `>40`

) and computes the gain score again using the
remaining attributes. The remaining attribute producing the highest gain score
becomes the second-level branch in the decision tree. This process repeats
until there is very little gain to be had by adding more attribute tests to the
decision tree. See Quinlan's book and the CART literature for more details on
how to use gain scores to construct decision trees.

In addition to data description and classifier construction, you can also use entropic analysis to discretize your data. For example, if customer ages were recorded in years, you might want to find appropriate bins to use by first ordering the age data from smallest to largest, selecting the number of bins and sizes to use, and then computing the entropy score for the data using this number and size of bins. Then compare this with other bin numbers and sizes to find the binning strategy that produces the lowest entropy score. A more sophisticated and useful form of discretization would incorporate information about the target class into the selection of bins to use--perhaps selecting bins so as to minimize a joint entropy, conditional entropy, or mutual information score.

In summary, you can use entropic analysis in datamining to prepare your
data (converting continuous quantities into discrete quantities), to describe
your data (producing entropic analysis reports), and to make inferences about
new sample data (applying your `buys_computer`

decision tree to a
new customer). You can apply entropic analysis to many aspects and stages of
the datamining process.

### Consistency Proof

There are many ways to structure the code base of an information theory
package. In this series I have proposed a class structure that reserves a class
file for each basic logical element of information theory:
`Entropy.php`

, `JointEntropy.php`

, and
`ConditionalEntropy.php`

.

By preserving the logical structure of the area in the code base class structure, it's possible to construct proofs that the logical house is in order:

```
<?php
/**
* @package IT
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";
/**
* Proof that H(Y,X) = H(Y|X) + H(X)
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();
echo "H(age) = ". $e->bits ."<br />";
$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer|age) = ". $ce->bits ."<br />";
$joint_entropy = $ce->bits + $e->bits;
echo "H(buys_computer|age) + H(age) =
$joint_entropy<br />";
$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns(array("age","buys_computer"));
$je->analyze();
echo "H(age; buys_computer) = ". $je->bits ."<br />";
echo "Therefore, H(buys_computer;age) = H(buys_computer|age) + H(age) <br />";
echo "Or, more generally, H(Y;X) = H(Y|X) + H(X)";
?>
```

Point your browser at the `consistency_proof.php`

script to see
the following result:

```
H(age) = 1.57740628285
H(buys_computer|age) = 0.693536138896
H(buys_computer|age) + H(age) = 2.27094242175
H(age, buys_computer) = 2.27094242175
Therefore, H(buys_computer, age) = H(buys_computer|age) + H(age)
Or, more generally, H(Y,X) = H(Y|X) + H(X)
```

In addition to using the IT Package to engage in datamining, you might also want to use it to explore and confirm the mathematical structure of information theory.

### Future Technical Work

I spent quite a bit of time designing the set of column setting methods used by the IT Package classes:

`Entropy::setColumn`

`JointEntropy::setColumns`

`ConditionalEntropy::setAntecedent`

`ConditionalEntropy::setConsequent`

`ConditionalEntropy::setConditional`

I designed them with a view toward their use in multicolumn contexts.

For example, when computing the joint entropy between columns, you might specify more than two columns to use:

`$je->setColumns("age,buys_computer,student");`

Similarly, when computing a conditional entropy score, you might specify more than one column in the antecedent or consequent part of the conditional expression:

`$ce->setConditional('buys_computer|age,student');`

Future work on the IT Package might involve improving the joint and conditional entropy classes so that they can evaluate multicolumn queries in a useful manner. One way to evaluate a three-column mutual information query is by arithmetic over the outcomes generated by entropy, joint entropy, and conditional entropy objects.

`I(A;B,C) = {H(A) - H(A|B)} + {H(A|B) - H(A|B,C)}`

The last term in the three-column mutual information query requires implementing a conditional entropy object with three-column support. There are likely more efficient ways to compute queries with four or more columns.

Another useful class to add to the package is one that computes a relative
entropy score (perhaps `RelativeEntropy.php`

), which is useful for
model-fitting applications. The relative entropy score is also known as the
Kullback-Leibler distance. The formula for computing this score looks
like:

`D(p || q) = Σ`_{i:n} p(x_{i}) * log (p(x_{i})/q(x_{i}))

Finally, I would like to extend these information theory concepts to handling continuous random variables. This involves incorporating a probability distributions library into the calculations. Because I have already coded a PDL Package that supplies such a library, this is a feasible technical direction to move in as well.

### Resources

- Download the code here. PHPmath.com will host further updates to the IT Package.
- C. E. Shannon (1948). A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656.
- Jiawei Han and Micheline Kamber (2000). Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufman.
- Dorian Pyle (1999). Data Preparation for Data Mining. San Francisco: Morgan Kaufman.
- Roberto Togneri and
Christopher J deSilva (2003).
*Fundamentals of Information Theory and Coding Design*, Chapman & Hall/CRC. - Thomas M. Cover and Joy A. Thomas (1991). Elements of Information Theory. John Wiley and Sons.
- Paul DuBois (2002). MySQL Cookbook. O'Reilly.
- J. Ross Quinlan (1993). C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann.

*
Paul Meagher
is a cognitive scientist whose graduate studies focused on mathematical problem solving.
*

Return to ONLamp.com.