Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:
Entropy.phpJointEntropy.phpConditionalEntropy.phpCalculating
Entropy for Datamining discussed the Entropy.php class, which
provides web database programmers with quick access to an entropic analysis of
a single database column. This article discusses the concepts of joint and
conditional entropy through developing the corresponding classes--JointEntropy.php, ConditionalEntropy--and illustrating how
these classes allow web database programmers quick access to an entropic
analysis of the relationship between two or more database columns considered at
a time.
The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.
AllElectronics DatabaseIn the acclaimed book Data Mining: Concepts and
Techniques, Jiawei Han and Micheline Kamber illustrate many datamining
concepts using the fictitious AllElectronics database. A version
of the AllElectronics database also appeared in Ross Quinlan's
datamining research. This database ably illustrates the information theoretic
concepts that follow. Here is a Customers table derived from that
database:
| id | age | income | student | credit_rating | buys_computer |
|---|---|---|---|---|---|
| 1 | <=30 | high | no | fair | no |
| 2 | <=30 | high | no | excellent | no |
| 3 | 31..40 | high | no | fair | yes |
| 4 | >40 | medium | no | fair | yes |
| 5 | >40 | low | yes | fair | yes |
| 6 | >40 | low | yes | excellent | no |
| 7 | 31..40 | low | yes | excellent | yes |
| 8 | <=30 | medium | no | fair | no |
| 9 | <=30 | low | yes | fair | yes |
| 10 | >40 | medium | yes | fair | yes |
| 11 | <=30 | medium | yes | excellent | yes |
| 12 | 31..40 | medium | no | excellent | yes |
| 13 | 31..40 | high | yes | fair | yes |
| 14 | >40 | medium | no | excellent | no |
I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.
The formula for computing the joint entropy between two database columns looks like this:
H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))
This formula is similar in many respects to the formula for computing the entropy of a single database column.
H(X) = -Σi:n p(xi) * log(p(xi))
The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.
To see what a joint probability distribution looks like, use the
JointEntropy.php class from the information theory package (IT
Package) that the last article began to assemble. Below is a PHP script that
applies the JointEntropy.php class to two columns (age,
buys_computer) in the AllElectronics.Customers table.
It outputs a joint frequency table, a joint probability table, and the summary
joint entropy score.
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";
/**
* Example of how to use JointEntropy class.
*/
$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>
<i>Joint frequency table.</i>
<?php
$je->showJointFrequency("%d");
?>
<br />
<i>Joint probability table.</i>
<?php
$je->showJointProbability("%.5f");
?>
<br />
<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>
The output produced by this script looks like:
| buys_computer | ||||
| no | yes | Σi+ | ||
|---|---|---|---|---|
| age | <=30 | 3 | 2 | 5 |
| 31..40 | 0 | 4 | 4 | |
| >40 | 2 | 3 | 5 | |
| Σ+j | 5 | 9 | 14 | |
Joint frequency table
| buys_computer | ||||
| no | yes | Σi+ | ||
|---|---|---|---|---|
| age | <=30 | 0.21429 | 0.14286 | 0.35714 |
| 31..40 | 0.00000 | 0.28571 | 0.28571 | |
| >40 | 0.14286 | 0.21429 | 0.35714 | |
| Σ+j | 0.35714 | 0.64286 | 1 | |
Joint probability table
Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:
H(X,Y) = -1 * [ (0.21429 * log(0.21429)) +
(0.14286 * log(0.14286)) +
(0.28571 * log(0.28571)) +
(0.14286 * log(0.14286)) +
(0.21429 * log(0.21429)) ]
H(X,Y) = -1 * [ -0.476226947429 +
-0.401050703151 +
-0.516387120588 +
-0.476226947429 +
-0.401050703151 ]
H(X,Y) = -1 * -2.270942421748
H(X,Y) = 2.271
Joint entropy in bits is 2.271.
The calculation of the joint entropy score is the same as the calculation of
the univariate entropy score. In both cases, it takes a vector of probabilities
summing to 1 and applies the Shannon formula to the probability vector. The
interpretation of what the score means is a bit different insofar as what the
probabilities encode is not a univariate distribution of signals. That is,
P(buys_computer=no), P(buys_computer=yes), but a
bivariate distribution of signals; for example, P(age=<30,
buys_computer=no), P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions
you need to ask, on average, to identify a signal coming from this bivariate
distribution of signals.
|
The JointEntropy.php class below extends the
Output.php class. The ConditionalEntropy.php class
also extends the Output.php class. The Output.php
class currently contains methods including showJointFrequency(),
showJointProbability(), and showConditionalProbability()
to render joint and conditional entropy object data in the form of HTML tables.
Feel free to overwrite or extend the Output.php class with other
tabular or graphical methods. The JointEntropy.php class contains only the "business logic" of the computation. The analyze()
method creates a joint entropy object that the
Output.php class accesses for rendering purposes.
<?php
/**
* @package IT
*/
require_once "Output.php";
/**
* Computes the joint entropy between two columns.
*/
class JointEntropy extends Output {
var $n = 0;
var $columns = array();
var $row_freqs = array();
var $col_freqs = array();
var $row_probs = array();
var $col_probs = array();
var $row_labels = array();
var $col_labels = array();
var $joint_freqs = array();
var $joint_probs = array();
var $bits = 0;
var $data = array();
var $table = "";
var $select = "";
var $where = "";
/* Methods for handling database table input */
function setTable($table) {
$this->table = $table;
}
function setSelect($sql) {
$this->select = $sql;
}
function setWhere($sql) {
$this->where = " WHERE ".$sql;
}
function getSQL() {
if (empty($this->select)) {
$sql = " SELECT ".$this->columns[0].",".$this->columns[1];
$sql .= " FROM $this->table ";
if (empty($this->where))
return $sql;
else
return $sql . $this->where;
} else
return $this->select;
}
function getFrequenciesFromTable() {
global $db;
$sql = $this->getSQL();
$result = $db->query($sql);
if (DB::isError($result))
die($result->getMessage());
else {
$n = 0;
while($row = $result->fetchRow()) {
$a = $row[$this->columns[0]];
$b = $row[$this->columns[1]];
$this->joint_freqs[$a][$b]++;
$this->row_freqs[$a]++; // aka row marginals
$this->col_freqs[$b]++; // aka col marginals
$n++;
}
$this->n = $n;
}
return true;
}
/* Methods for handling array input */
function setArray($data) {
$this->data = $data;
}
function getFrequenciesFromArray() {
$this->n = count($this->data);
for ($i=0; $i < $this->n; $i++) {
$a = $this->data[$i][0];
$b = $this->data[$i][1];
$this->joint_freqs[$a][$b]++;
$this->row_freqs[$a]++; // aka row marginals
$this->col_freqs[$b]++; // aka col marginals
}
}
/* Shared methods */
function setColumns($columns) {
$parts = explode(",",$columns);
$this->columns[0] = trim($parts[0]);
$this->columns[1] = trim($parts[1]);
}
function clear() {
$this->n = 0;
$this->row_freqs = array();
$this->col_freqs = array();
$this->row_probs = array();
$this->col_probs = array();
$this->row_labels = array();
$this->col_labels = array();
$this->joint_freqs = array();
$this->joint_probs = array();
$this->bits = 0;
}
function analyze() {
$this->clear();
if (empty($this->table))
$this->getFrequenciesFromArray();
else
$this->getFrequenciesFromTable();
$this->row_labels = array_keys($this->row_freqs);
$this->col_labels = array_keys($this->col_freqs);
$this->getProbabilities();
$this->getJointEntropyScore();
}
function getProbabilities() {
foreach($this->joint_freqs AS $key1=>$array) {
foreach($array AS $key2=>$val2) {
$this->joint_probs[$key1][$key2] =
$this->joint_freqs[$key1][$key2] / $this->n;
$this->row_probs[$key1] += $this->joint_probs[$key1][$key2];
$this->col_probs[$key2] += $this->joint_probs[$key1][$key2];
}
}
}
function getJointEntropyScore() {
foreach($this->joint_probs AS $key1=>$array)
foreach($array AS $key2=>$val2)
$this->bits -= $this->joint_probs[$key1][$key2] *
log($this->joint_probs[$key1][$key2], 2);
}
}
?>
There are two forms of accepted input for this script:
From a database table:
getFrequenciesFromTable()From a passed-in two-dimensional array:
getFrequenciesFromArray()
|
Beyond knowing the minimal number of questions needed to identify a signal
from our bivariate distribution of signals, is there another use for the joint
entropy score? Another use is to compare the joint entropy score
H(X,Y) with the sum of the marginal entropies H(X) +
H(Y) in the probability distribution table to determine the degree of
independence between two random variables; for example, age and
buys_computer. If H(X,Y) is approximately equal to
H(X) + H(Y), then you can conclude that the two variables are
independent of each other--knowing the value of one variable does not
enlighten you about the value of the other variable. Here's a script to see whether
age and buys_computer are independent.
<?php
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();
$age = $e->bits;
$e->setColumn("buys_computer");
$e->analyze();
$buy = $e->bits;
echo "H(age) = $age<br />";
echo "H(buys_computer) = $buy<br />";
echo "H(age) + H(buys_computer)= ".($age + $buy)."<br />";
$je = new JointEntropy;
$je->setTable('Customers');
$je->setColumns(array('age','buys_computer'));
$je->analyze();
echo "H(age, buys_computer) = ". $je->bits;
?>
The output of this script is:
H(age) = 1.57740628285
H(buys_computer) = 0.940285958671
H(age) + H(buys_computer)= 2.51769224152
H(age, buys_computer) = 2.27094242175
From these results, you conclude that H(age, buys_computer) <
H(age) + H(buys_computer), which means that age and buys_computer are not
totally independent variables (although the dependence does not seem too strong
either). In general:
H(X,Y) <= H(X) + H(Y) with equality only if X and Y are
independent.
One of the most important reasons for being concerned about whether two variables are independent is that data reduction is a critical aspect of any datamining analysis. One important way to reduce the data to manageable size is to eliminate variables that are independent of the output variable about which you want to reduce your uncertainty. If you find that your joint probability scores are nearly equal to the additive sum of your marginal probabilities, you can conclude that our variables are independent and should consider eliminating the independent variable from your analysis.
The next formula to discuss and implement is the conditional entropy formula. Before discussing this formula, however, you must first understand how to compute a conditional probability from a joint probability table. The concrete formula for computing a conditional probability looks like this.
P(Y=y | X=x) = P(Y=y, X=x) / P(X=x)
An instantiated formula for computing the conditional probability that customers will buy a computer given that they are under 30 years old looks like this:
P(buys_computer = yes | age = <30) = P(buys_computer = yes,
age = <30) / P(age = < 30)
To compute P(buys_computer = yes, age = <30), look up the joint
probability cell with these specific row and column settings (that is,
0.14286). To compute the P(age = <30), look up the row
marginal where age = <30 (0.35714). In a nutshell,
computing a conditional probability involves dividing a joint probability by a
marginal probability (0.14286/0.35714 = 0.4).
The first step in computing the overall conditional entropy is to compute the specific conditional entropies using this formula:
H(X | Y = y) = -Σi:n P(X = x | Y = y) * log(P(X = x | Y = y))
Plug the specific conditional entropy formula H(X | Y = y) into the
conditional entropy formula below to compute the amount of uncertainty
remaining about X after Y has been observed:
H(X | Y) = -Σi:n P(Y = y) * H(X | Y =
y)
The specific conditional entropy formula computes the amount of uncertainty remaining after performing conditioning on one value of the signal distribution, whereas the conditional entropy is the amount of uncertainty remaining after summing the products of specific signal probabilities and specific conditional entropies.
Now onto some code for computing the conditional entropy. The code below
conditions buying on age ($ce->setConditional('buys_computer |
age')) and outputs a joint probability and conditional entropy
table.
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/ConditionalEntropy.php";
/**
* Example of how to use the ConditionalEntropy class.
*/
$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConditional('buys_computer|age');
$ce->analyze();
?>
<i>Joint probability table.</i>
<?php
$ce->showJointProbability("%.5f");
?>
<br />
<i>Conditional entropy table.</i>
<?php
$ce->showConditionalEntropy();
?>
This code outputs the following tables:
| buys_computer | ||||
| no | yes | Σi+ | ||
|---|---|---|---|---|
| age | <=30 | 0.21429 | 0.14286 | 0.35714 |
| 31..40 | 0.00000 | 0.28571 | 0.28571 | |
| >40 | 0.14286 | 0.21429 | 0.35714 | |
| Σ+j | 0.35714 | 0.64286 | 1 | |
Joint probability table
| P(B | A = ai) | |||||
|---|---|---|---|---|---|
| Ai | P(A = ai) | no | yes | H(B | A = ai) | P(A = ai) * H(B | A = ai) |
| <=30 | 0.357142857143 | 0.6 | 0.4 | 0.970950594455 | 0.346768069448 |
| 31..40 | 0.285714285714 | 0 | 1 | 0 | 0 |
| >40 | 0.357142857143 | 0.4 | 0.6 | 0.970950594455 | 0.346768069448 |
| Σi=1 ... 3 P(A = ai) * H(B | A = ai) | 0.693536138896 | ||||
Conditional entropy table
The first table appears again so that you can see how the second,
third, and fourth columns in the second table were derived from it. The second
column simply reproduces the row marginal from the first table. The third and
fourth columns are conditional probabilities calculated by dividing a joint
probability by a row marginal (for instance, 0.14286/0.35714 = 0.4). The fifth column
calculates the specific conditional entropy for each age range. For
example:
H(buys_computer | age = <30) = -1 * [ 0.6 * log(0.6) + 0.4 * log(0.4) ] =
0.970950594455
The specific conditional entropy column can be useful to examine in some detail, because low values are telling you that there is an uncertainty-reducing relationship between levels of your variables. In traditional statistical analysis, such minute relationships may not be theoretically interesting; however, in datamining contexts you might find it interesting to know that 30- to 40-year-old customers tend to purchase computers at your store. Of course, there is not enough data in this data set to draw any firm conclusions.
The specific entropy value in the fifth column is then multiplied by the corresponding probability in the second column to obtain the values in the sixth column. The values in the sixth column are summed to give the overall conditional entropy score reported in the bottom-right cell.
|
The source code for the conditional entropy class is:
<?php
/**
* @package IT
*/
require_once "Output.php";
/**
* Computes the conditional entropy in bits between
* two data columns and retains intermediate values
* that are useful for generating output reports.
*
* @author Paul Meagher, Datavore Productions
* @version 0.5
* @modified 22/02/2005
*/
class ConditionalEntropy extends Output {
var $n = 0;
var $columns = array();
var $row_freqs = array();
var $col_freqs = array();
var $row_probs = array();
var $col_probs = array();
var $row_labels = array();
var $col_labels = array();
var $joint_freqs = array();
var $joint_probs = array();
var $cond_probs = array();
var $bits = 0;
var $data = array();
var $table = "";
var $select = "";
var $where = "";
/* Methods for handling database table input */
function setTable($table) {
$this->table = $table;
}
function setSelect($sql) {
$this->select = $sql;
}
function setWhere($sql) {
$this->where = " WHERE ".$sql;
}
function getSQL() {
if (empty($this->select)) {
$sql = " SELECT ".$this->columns[0].",".$this->columns[1];
$sql .= " FROM $this->table ";
if (empty($this->where))
return $sql;
else
return $sql . $this->where;
} else
return $this->select;
}
function getJointAndMarginalFrequenciesFromTable() {
global $db;
$sql = $this->getSQL();
$result = $db->query($sql);
if (DB::isError($result))
die($result->getMessage());
else {
$n = 0;
while($row = $result->fetchRow()) {
$a = $row[$this->columns[0]];
$b = $row[$this->columns[1]];
$this->joint_freqs[$a][$b]++;
$this->row_freqs[$a]++; // aka row marginals
$this->col_freqs[$b]++; // aka col marginals
$n++;
}
$this->n = $n;
}
return true;
}
/* Methods for handling array input */
function setArray($data) {
$this->data = $data;
}
function getJointAndMarginalFrequenciesFromArray() {
$this->n = count($this->data);
for ($i=0; $i < $this->n; $i++) {
$a = $this->data[$i][0];
$b = $this->data[$i][1];
$this->joint_freqs[$a][$b]++;
$this->row_freqs[$a]++; // aka row marginals
$this->col_freqs[$b]++; // aka col marginals
}
}
/* Shared methods */
function setAntecedent($column, $label="") {
$this->columns[0] = $column;
$this->labels[0] = $label;
}
function setConsequent($column, $label="") {
$this->columns[1] = $column;
$this->labels[1] = $label;
}
function setConditional($conditional, $labels="") {
$cond_pieces = explode("|", $conditional);
$this->columns[0] = trim($cond_pieces[1]);
$this->columns[1] = trim($cond_pieces[0]);
if ($labels != "") {
$label_pieces = explode("|", $labels);
$this->labels[0] = trim($label_pieces[1]);
$this->labels[1] = trim($label_pieces[0]);
} else {
$this->labels[0] = $this->columns[0]{0};
$this->labels[1] = $this->columns[1]{0};
}
}
function clear() {
$this->n = 0;
$this->row_freqs = array();
$this->col_freqs = array();
$this->row_probs = array();
$this->col_probs = array();
$this->row_labels = array();
$this->col_labels = array();
$this->joint_freqs = array();
$this->joint_probs = array();
$this->cond_probs = array();
$this->bits = 0;
}
function analyze() {
$this->clear();
if (empty($this->table))
$this->getJointAndMarginalFrequenciesFromArray();
else
$this->getJointAndMarginalFrequenciesFromTable();
$this->row_labels = array_keys($this->row_freqs);
$this->col_labels = array_keys($this->col_freqs);
$this->getJointAndMarginalProbabilities();
$this->getConditionalProbabilities();
$this->getConditionalEntropyScore();
}
function getJointAndMarginalProbabilities() {
foreach($this->joint_freqs AS $key1=>$array) {
foreach($array AS $key2=>$val2) {
$this->joint_probs[$key1][$key2] = $this->joint_freqs[$key1][$key2] /
$this->n;
$this->row_probs[$key1] += $this->joint_probs[$key1][$key2];
$this->col_probs[$key2] += $this->joint_probs[$key1][$key2];
}
}
}
function getConditionalProbabilities() {
foreach($this->joint_probs AS $key1=>$array)
foreach($array AS $key2=>$val2)
$this->cond_probs[$key1][$key2] = $this->joint_probs[$key1][$key2] /
$this->row_probs[$key1];
}
function loadJointProbabilities($joint_probs) {
$this->clear();
$this->joint_probs = $joint_probs;
foreach($joint_probs AS $key1=>$array) {
foreach($array AS $key2=>$val2) {
$this->row_probs[$key1] += $joint_probs[$key1][$key2];
$this->col_probs[$key2] += $joint_probs[$key1][$key2];
}
}
$this->getConditionalProbabilities();
$this->getConditionalEntropy();
}
function getConditionalEntropyScore() {
foreach($this->joint_freqs AS $key1=>$array)
foreach($array AS $key2=>$val2)
$this->bits -= $this->joint_probs[$key1][$key2] *
log($this->cond_probs[$key1][$key2], 2);
}
}
?>
The formula for computing mutual information uses formulas encountered earlier:
I(X;Y) = H(X) - H(X|Y)
The following PHP script computes the mutual information between
age and buys_computer:
<?php
/**
* @package IT
*
* Example of how to compute mutual info.
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();
echo "H(buys_computer) = ". $e->bits ."<br />";
$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer | age) = ". $ce->bits ."<br />";
$mutual_info = $e->bits - $ce->bits;
echo "I(age;buys_computer) = H(buys_computer) - H(buys_computer | age) =
$mutual_info <br />";
?>
The output of this script looks like this:
H(buys_computer) = 0.940285958671
H(buys_computer | age) = 0.693536138896
I(age;buys_computer) = H(buys_computer) - H(buys_computer | age) = 0.246749819774
If the conditional entropy score is very low (that is, close to 0), this
will have the effect of producing a relatively high mutual information score. A
high mutual information score indicates that you can use one variable to reduce
your uncertainty about the other variable. Conversely, a low mutual information
score shows that you cannot use one variable to reduce your uncertainty about
the values of the other random variable. A signal transferred from the
age column is not well preserved in the buys_computer
column. In signal theory terms, the "channel" between them is full of
distortion.
The term mutual information derives from the fact that you will obtain the
same mutual information score no matter which of the two variables you
use as the conditioning variable in the equation. Instead of conditioning
buys_computer on age, you can condition
age on buys_computer and get the same result:
H(age) = 1.57740628285
H(age | buys_computer) = 1.33065646308
I(age;buys_computer) = H(age) - H(age | buys_computer) = 0.246749819774
A few other fundamental relationships are worth pointing out:
H(A, B) = H(A) + H(B) - I(A;B)
H(A, B) = H(A) + H(B|A)
H(A, B) = H(B) + H(A|B)
|
What I have called the mutual information score, Han and Kamber refer to as
the Gain score. While the formulas for computing this score appear different,
the mutual information score (I(age; buys_computer) = 0.246) is
the same value as their gain score (Gain(age) = 0.246).
In the datamining literature, the term gain often appears instead of mutual information, probably because the standard formulas differ and the gain score is sometimes "corrected" to make it more useful for datamining. Be careful, however, in correcting the computation too much, or else you will lose the poweful consistency properties of the information theory calculus.
The term gain is more suggestive of why you might want to compute the
mutual information score in a datamining context, and I will use it from here
on in. According Han and Kamber, "Gain(A) is the expected reduction in
entropy cause by knowing the value of attribute A." Below is some code that
computes the gain score for each of the columns in the
AllElectronics.Customers table to determine which column most
reduces the entropy in the buys_computer column.
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/ConditionalEntropy.php";
/**
* Example of how to compute gain scores for several columns.
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();
$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConsequent('buys_computer');
$columns = array('age','income','student','credit_rating');
foreach($columns as $column) {
$ce->setAntecedent($column);
$ce->analyze();
$gain = $e->bits - $ce->bits;
echo "gain($column): $gain<br />";
}
?>
The setAntecedent() and setConsequent() methods in
the ConditionalEntropy.php class exist to handle situations where
you want to compute the conditional entropy score for a target column in terms
of several other columns. The script outputs the following set of gain
scores:
gain(age): 0.246749819774
gain(income): 0.029222565659
gain(student): 0.151835501362
gain(credit_rating): 0.0481270304083
The report shows that the age column reduces the entropy in the
buys_computer column the most. You might consider developing a
script that ranks the gains from highest to lowest for use in
constructing decision tree classifiers, discussed in the next section.
Three common datamining applications of entropic analysis are data description, classifier construction, and data discretization. You can easily incorporate the entropy scores calculated previously into web-based tools that report the entropy, joint entropy, conditional entropy, and mutual information/gain scores between the various columns in the database. Use this information much as you would use means, standard deviations, and correlation coefficients to provide a useful descriptive summary of your data sets. Dorian Pyle's book Data Preparation for Data Mining provides examples of such entropic analysis reports along with a useful discussion of how to use such reported entropy information used in the early "data survey" stage of data mining.
|
Related Reading
|
The previous article also developed several such reports; however, they reported only univariate entropy scores. With the new entropy classes and sample code discussed in this article, you should be able to develop a variety of new entropy-based reports to characterize your data columns.
A more advanced and very common application of entropic analysis to data mining is in the construction of classifiers in the form of decision trees. A decision tree might be useful for determining whether an All Electronics customer is likely to buy a computer. This decision tree would involve first asking the age of the customer. If the age were younger than 30 years old, it might then ask whether the customer is a student. After that, it might ask his income range. Eventually it would reach a terminal node in the decision tree that classifies the instance as to whether the customer is likely to buy a computer.
It is beyond the scope of this article to go too deeply into the theory of
constructing decision trees, except to note that constructing such trees uses
the recursive computation of gain scores extensively. The age
attribute forms the root of the decision tree because it has the highest
information gain score. It then takes the partitioned customers (in the ranges
of <30, 31..40, >40) and computes the gain score again using the
remaining attributes. The remaining attribute producing the highest gain score
becomes the second-level branch in the decision tree. This process repeats
until there is very little gain to be had by adding more attribute tests to the
decision tree. See Quinlan's book and the CART literature for more details on
how to use gain scores to construct decision trees.
In addition to data description and classifier construction, you can also use entropic analysis to discretize your data. For example, if customer ages were recorded in years, you might want to find appropriate bins to use by first ordering the age data from smallest to largest, selecting the number of bins and sizes to use, and then computing the entropy score for the data using this number and size of bins. Then compare this with other bin numbers and sizes to find the binning strategy that produces the lowest entropy score. A more sophisticated and useful form of discretization would incorporate information about the target class into the selection of bins to use--perhaps selecting bins so as to minimize a joint entropy, conditional entropy, or mutual information score.
In summary, you can use entropic analysis in datamining to prepare your
data (converting continuous quantities into discrete quantities), to describe
your data (producing entropic analysis reports), and to make inferences about
new sample data (applying your buys_computer decision tree to a
new customer). You can apply entropic analysis to many aspects and stages of
the datamining process.
There are many ways to structure the code base of an information theory
package. In this series I have proposed a class structure that reserves a class
file for each basic logical element of information theory:
Entropy.php, JointEntropy.php, and
ConditionalEntropy.php.
By preserving the logical structure of the area in the code base class structure, it's possible to construct proofs that the logical house is in order:
<?php
/**
* @package IT
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";
/**
* Proof that H(Y,X) = H(Y|X) + H(X)
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();
echo "H(age) = ". $e->bits ."<br />";
$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer|age) = ". $ce->bits ."<br />";
$joint_entropy = $ce->bits + $e->bits;
echo "H(buys_computer|age) + H(age) =
$joint_entropy<br />";
$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns(array("age","buys_computer"));
$je->analyze();
echo "H(age; buys_computer) = ". $je->bits ."<br />";
echo "Therefore, H(buys_computer;age) = H(buys_computer|age) + H(age) <br />";
echo "Or, more generally, H(Y;X) = H(Y|X) + H(X)";
?>
Point your browser at the consistency_proof.php script to see
the following result:
H(age) = 1.57740628285
H(buys_computer|age) = 0.693536138896
H(buys_computer|age) + H(age) = 2.27094242175
H(age, buys_computer) = 2.27094242175
Therefore, H(buys_computer, age) = H(buys_computer|age) + H(age)
Or, more generally, H(Y,X) = H(Y|X) + H(X)
In addition to using the IT Package to engage in datamining, you might also want to use it to explore and confirm the mathematical structure of information theory.
I spent quite a bit of time designing the set of column setting methods used by the IT Package classes:
Entropy::setColumnJointEntropy::setColumnsConditionalEntropy::setAntecedentConditionalEntropy::setConsequentConditionalEntropy::setConditionalI designed them with a view toward their use in multicolumn contexts.
For example, when computing the joint entropy between columns, you might specify more than two columns to use:
$je->setColumns("age,buys_computer,student");
Similarly, when computing a conditional entropy score, you might specify more than one column in the antecedent or consequent part of the conditional expression:
$ce->setConditional('buys_computer|age,student');
Future work on the IT Package might involve improving the joint and conditional entropy classes so that they can evaluate multicolumn queries in a useful manner. One way to evaluate a three-column mutual information query is by arithmetic over the outcomes generated by entropy, joint entropy, and conditional entropy objects.
I(A;B,C) = {H(A) - H(A|B)} + {H(A|B) - H(A|B,C)}
The last term in the three-column mutual information query requires implementing a conditional entropy object with three-column support. There are likely more efficient ways to compute queries with four or more columns.
Another useful class to add to the package is one that computes a relative
entropy score (perhaps RelativeEntropy.php), which is useful for
model-fitting applications. The relative entropy score is also known as the
Kullback-Leibler distance. The formula for computing this score looks
like:
D(p || q) = Σi:n p(xi) * log (p(xi)/q(xi))
Finally, I would like to extend these information theory concepts to handling continuous random variables. This involves incorporating a probability distributions library into the calculations. Because I have already coded a PDL Package that supplies such a library, this is a feasible technical direction to move in as well.
Paul Meagher is a cognitive scientist whose graduate studies focused on mathematical problem solving.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.