Calculating Entropy for Data Minersby Paul Meagher
Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:
Entropy for Datamining discussed the
Entropy.php class, which
provides web database programmers with quick access to an entropic analysis of
a single database column. This article discusses the concepts of joint and
conditional entropy through developing the corresponding classes--
ConditionalEntropy--and illustrating how
these classes allow web database programmers quick access to an entropic
analysis of the relationship between two or more database columns considered at
The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.
In the acclaimed book Data Mining: Concepts and
Techniques, Jiawei Han and Micheline Kamber illustrate many datamining
concepts using the fictitious
AllElectronics database. A version
AllElectronics database also appeared in Ross Quinlan's
datamining research. This database ably illustrates the information theoretic
concepts that follow. Here is a
Customers table derived from that
I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.
The formula for computing the joint entropy between two database columns looks like this:
H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))
This formula is similar in many respects to the formula for computing the entropy of a single database column.
H(X) = -Σi:n p(xi) * log(p(xi))
The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.
To see what a joint probability distribution looks like, use the
JointEntropy.php class from the information theory package (IT
Package) that the last article began to assemble. Below is a PHP script that
JointEntropy.php class to two columns (
buys_computer) in the
It outputs a joint frequency table, a joint probability table, and the summary
joint entropy score.
<?php /** * @package IT */ require_once "../config.php"; require_once PHPMATH."/IT/JointEntropy.php"; /** * Example of how to use JointEntropy class. */ $je = new JointEntropy; $je->setTable("Customers"); $je->setColumns("age, buys_computer"); $je->analyze(); ?> <i>Joint frequency table.</i> <?php $je->showJointFrequency("%d"); ?> <br /> <i>Joint probability table.</i> <?php $je->showJointProbability("%.5f"); ?> <br /> <i>Joint entropy in bits is <?php printf("%01.3f", $je->bits) ?>.</i>
The output produced by this script looks like:
Joint frequency table
Joint probability table
Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:
H(X,Y) = -1 * [ (0.21429 * log(0.21429)) + (0.14286 * log(0.14286)) + (0.28571 * log(0.28571)) + (0.14286 * log(0.14286)) + (0.21429 * log(0.21429)) ] H(X,Y) = -1 * [ -0.476226947429 + -0.401050703151 + -0.516387120588 + -0.476226947429 + -0.401050703151 ] H(X,Y) = -1 * -2.270942421748 H(X,Y) = 2.271
Joint entropy in bits is 2.271.
The calculation of the joint entropy score is the same as the calculation of
the univariate entropy score. In both cases, it takes a vector of probabilities
summing to 1 and applies the Shannon formula to the probability vector. The
interpretation of what the score means is a bit different insofar as what the
probabilities encode is not a univariate distribution of signals. That is,
P(buys_computer=yes), but a
bivariate distribution of signals; for example,
P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions
you need to ask, on average, to identify a signal coming from this bivariate
distribution of signals.