Calculating Entropy for Data Miners
by Paul Meagher03/24/2005
Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:
Entropy.phpJointEntropy.phpConditionalEntropy.php
Calculating
Entropy for Datamining discussed the Entropy.php class, which
provides web database programmers with quick access to an entropic analysis of
a single database column. This article discusses the concepts of joint and
conditional entropy through developing the corresponding classes--JointEntropy.php, ConditionalEntropy--and illustrating how
these classes allow web database programmers quick access to an entropic
analysis of the relationship between two or more database columns considered at
a time.
The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.
AllElectronics Database
In the acclaimed book Data Mining: Concepts and
Techniques, Jiawei Han and Micheline Kamber illustrate many datamining
concepts using the fictitious AllElectronics database. A version
of the AllElectronics database also appeared in Ross Quinlan's
datamining research. This database ably illustrates the information theoretic
concepts that follow. Here is a Customers table derived from that
database:
| id | age | income | student | credit_rating | buys_computer |
|---|---|---|---|---|---|
| 1 | <=30 | high | no | fair | no |
| 2 | <=30 | high | no | excellent | no |
| 3 | 31..40 | high | no | fair | yes |
| 4 | >40 | medium | no | fair | yes |
| 5 | >40 | low | yes | fair | yes |
| 6 | >40 | low | yes | excellent | no |
| 7 | 31..40 | low | yes | excellent | yes |
| 8 | <=30 | medium | no | fair | no |
| 9 | <=30 | low | yes | fair | yes |
| 10 | >40 | medium | yes | fair | yes |
| 11 | <=30 | medium | yes | excellent | yes |
| 12 | 31..40 | medium | no | excellent | yes |
| 13 | 31..40 | high | yes | fair | yes |
| 14 | >40 | medium | no | excellent | no |
I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.
Joint Entropy
The formula for computing the joint entropy between two database columns looks like this:
H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))
This formula is similar in many respects to the formula for computing the entropy of a single database column.
H(X) = -Σi:n p(xi) * log(p(xi))
The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.
To see what a joint probability distribution looks like, use the
JointEntropy.php class from the information theory package (IT
Package) that the last article began to assemble. Below is a PHP script that
applies the JointEntropy.php class to two columns (age,
buys_computer) in the AllElectronics.Customers table.
It outputs a joint frequency table, a joint probability table, and the summary
joint entropy score.
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";
/**
* Example of how to use JointEntropy class.
*/
$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>
<i>Joint frequency table.</i>
<?php
$je->showJointFrequency("%d");
?>
<br />
<i>Joint probability table.</i>
<?php
$je->showJointProbability("%.5f");
?>
<br />
<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>
The output produced by this script looks like:
| buys_computer | ||||
| no | yes | Σi+ | ||
|---|---|---|---|---|
| age | <=30 | 3 | 2 | 5 |
| 31..40 | 0 | 4 | 4 | |
| >40 | 2 | 3 | 5 | |
| Σ+j | 5 | 9 | 14 | |
Joint frequency table
| buys_computer | ||||
| no | yes | Σi+ | ||
|---|---|---|---|---|
| age | <=30 | 0.21429 | 0.14286 | 0.35714 |
| 31..40 | 0.00000 | 0.28571 | 0.28571 | |
| >40 | 0.14286 | 0.21429 | 0.35714 | |
| Σ+j | 0.35714 | 0.64286 | 1 | |
Joint probability table
Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:
H(X,Y) = -1 * [ (0.21429 * log(0.21429)) +
(0.14286 * log(0.14286)) +
(0.28571 * log(0.28571)) +
(0.14286 * log(0.14286)) +
(0.21429 * log(0.21429)) ]
H(X,Y) = -1 * [ -0.476226947429 +
-0.401050703151 +
-0.516387120588 +
-0.476226947429 +
-0.401050703151 ]
H(X,Y) = -1 * -2.270942421748
H(X,Y) = 2.271
Joint entropy in bits is 2.271.
The calculation of the joint entropy score is the same as the calculation of
the univariate entropy score. In both cases, it takes a vector of probabilities
summing to 1 and applies the Shannon formula to the probability vector. The
interpretation of what the score means is a bit different insofar as what the
probabilities encode is not a univariate distribution of signals. That is,
P(buys_computer=no), P(buys_computer=yes), but a
bivariate distribution of signals; for example, P(age=<30,
buys_computer=no), P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions
you need to ask, on average, to identify a signal coming from this bivariate
distribution of signals.



