ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Calculating Entropy for Data Miners

by Paul Meagher
03/24/2005

Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:

Calculating Entropy for Datamining discussed the Entropy.php class, which provides web database programmers with quick access to an entropic analysis of a single database column. This article discusses the concepts of joint and conditional entropy through developing the corresponding classes--JointEntropy.php, ConditionalEntropy--and illustrating how these classes allow web database programmers quick access to an entropic analysis of the relationship between two or more database columns considered at a time.

The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.

AllElectronics Database

In the acclaimed book Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber illustrate many datamining concepts using the fictitious AllElectronics database. A version of the AllElectronics database also appeared in Ross Quinlan's datamining research. This database ably illustrates the information theoretic concepts that follow. Here is a Customers table derived from that database:

id age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31..40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31..40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31..40 medium no excellent yes
13 31..40 high yes fair yes
14 >40 medium no excellent no

I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.

Joint Entropy

The formula for computing the joint entropy between two database columns looks like this:

H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))

This formula is similar in many respects to the formula for computing the entropy of a single database column.

H(X) = -Σi:n p(xi) * log(p(xi))

The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.

To see what a joint probability distribution looks like, use the JointEntropy.php class from the information theory package (IT Package) that the last article began to assemble. Below is a PHP script that applies the JointEntropy.php class to two columns (age, buys_computer) in the AllElectronics.Customers table. It outputs a joint frequency table, a joint probability table, and the summary joint entropy score.

<?php
/**
* @package IT
*/

require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";

/**
* Example of how to use JointEntropy class.   
*/

$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>

 <i>Joint frequency table.</i>

<?php
$je->showJointFrequency("%d");
?>

<br />

<i>Joint probability table.</i>

<?php
$je->showJointProbability("%.5f");
?>
<br />

<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>

The output produced by this script looks like:

  buys_computer  
  no yes Σi+
age <=30 3 2 5
31..40 0 4 4
>40 2 3 5
  Σ+j 5 9 14

Joint frequency table

  buys_computer  
  no yes Σi+
age <=30 0.21429 0.14286 0.35714
31..40 0.00000 0.28571 0.28571
>40 0.14286 0.21429 0.35714
  Σ+j 0.35714 0.64286 1

Joint probability table

Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:

H(X,Y) = -1 * [ (0.21429 * log(0.21429)) + 
                (0.14286 * log(0.14286)) + 
                (0.28571 * log(0.28571)) + 
                (0.14286 * log(0.14286)) + 
                (0.21429 * log(0.21429)) ]

H(X,Y) = -1 * [ -0.476226947429 + 
                -0.401050703151 + 
                -0.516387120588 + 
                -0.476226947429 + 
                -0.401050703151 ]

H(X,Y) = -1 * -2.270942421748

H(X,Y) = 2.271

Joint entropy in bits is 2.271.

The calculation of the joint entropy score is the same as the calculation of the univariate entropy score. In both cases, it takes a vector of probabilities summing to 1 and applies the Shannon formula to the probability vector. The interpretation of what the score means is a bit different insofar as what the probabilities encode is not a univariate distribution of signals. That is, P(buys_computer=no), P(buys_computer=yes), but a bivariate distribution of signals; for example, P(age=<30, buys_computer=no), P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions you need to ask, on average, to identify a signal coming from this bivariate distribution of signals.

Joint Entropy Code

The JointEntropy.php class below extends the Output.php class. The ConditionalEntropy.php class also extends the Output.php class. The Output.php class currently contains methods including showJointFrequency(), showJointProbability(), and showConditionalProbability() to render joint and conditional entropy object data in the form of HTML tables. Feel free to overwrite or extend the Output.php class with other tabular or graphical methods. The JointEntropy.php class contains only the "business logic" of the computation. The analyze() method creates a joint entropy object that the Output.php class accesses for rendering purposes.

<?php
/**
* @package IT
*/
require_once "Output.php";
/**
* Computes the joint entropy between two columns.
*/
class JointEntropy extends Output {

  var $n = 0;

  var $columns = array();

  var $row_freqs = array();
  var $col_freqs = array();

  var $row_probs = array();
  var $col_probs = array();
 
  var $row_labels = array();
  var $col_labels = array();
 
  var $joint_freqs = array();
  var $joint_probs = array();

  var $bits = 0;

  var $data = array();

  var $table  = "";
  var $select = "";
  var $where  = "";

  /* Methods for handling database table input */

  function setTable($table) {
    $this->table = $table;
  }

  function setSelect($sql) {
    $this->select = $sql;
  }

  function setWhere($sql) {
    $this->where = " WHERE ".$sql;
  }

  function getSQL() {
    if (empty($this->select)) {
      $sql = " SELECT ".$this->columns[0].",".$this->columns[1];
      $sql .= " FROM $this->table ";
      if (empty($this->where))
        return $sql;
      else
        return $sql . $this->where;
    } else
      return $this->select;
  }

  function getFrequenciesFromTable() {
    global $db;
    $sql    = $this->getSQL();
    $result = $db->query($sql);
    if (DB::isError($result))
      die($result->getMessage());
    else {
      $n = 0;
      while($row = $result->fetchRow()) {
        $a = $row[$this->columns[0]];
        $b = $row[$this->columns[1]];
        $this->joint_freqs[$a][$b]++;
        $this->row_freqs[$a]++; // aka row marginals
        $this->col_freqs[$b]++; // aka col marginals
        $n++;
      }
      $this->n = $n;
    }
    return true;
  }

  /* Methods for handling array input */

  function setArray($data) {
    $this->data = $data;
  }

  function getFrequenciesFromArray() {
    $this->n = count($this->data);
    for ($i=0; $i < $this->n; $i++) {
      $a = $this->data[$i][0];
      $b = $this->data[$i][1];
      $this->joint_freqs[$a][$b]++;
      $this->row_freqs[$a]++; // aka row marginals
      $this->col_freqs[$b]++; // aka col marginals
    }
  }

  /* Shared methods */

  function setColumns($columns) {
    $parts = explode(",",$columns);
    $this->columns[0] = trim($parts[0]);
    $this->columns[1] = trim($parts[1]);
  }

  function clear() {
    $this->n           = 0;
    $this->row_freqs   = array();
    $this->col_freqs   = array();
    $this->row_probs   = array();
    $this->col_probs   = array();
    $this->row_labels  = array();
    $this->col_labels  = array();
    $this->joint_freqs = array();
    $this->joint_probs = array();
    $this->bits        = 0;
  }

  function analyze() {
    $this->clear();
    if (empty($this->table))
      $this->getFrequenciesFromArray();
    else
      $this->getFrequenciesFromTable();

    $this->row_labels = array_keys($this->row_freqs);
    $this->col_labels = array_keys($this->col_freqs);
    $this->getProbabilities();
    $this->getJointEntropyScore();
  }

  function getProbabilities() {
    foreach($this->joint_freqs AS $key1=>$array) {
      foreach($array AS $key2=>$val2) {
        $this->joint_probs[$key1][$key2] =
		    $this->joint_freqs[$key1][$key2] / $this->n;
        $this->row_probs[$key1] += $this->joint_probs[$key1][$key2];
        $this->col_probs[$key2] += $this->joint_probs[$key1][$key2];
      }
    }
  }

  function getJointEntropyScore() {
    foreach($this->joint_probs AS $key1=>$array)
      foreach($array AS $key2=>$val2)
        $this->bits -= $this->joint_probs[$key1][$key2] *
		               log($this->joint_probs[$key1][$key2], 2);
  }

}
?>

There are two forms of accepted input for this script:

  1. From a database table:

    getFrequenciesFromTable()
  2. From a passed-in two-dimensional array:

    getFrequenciesFromArray()

Independence

Beyond knowing the minimal number of questions needed to identify a signal from our bivariate distribution of signals, is there another use for the joint entropy score? Another use is to compare the joint entropy score H(X,Y) with the sum of the marginal entropies H(X) + H(Y) in the probability distribution table to determine the degree of independence between two random variables; for example, age and buys_computer. If H(X,Y) is approximately equal to H(X) + H(Y), then you can conclude that the two variables are independent of each other--knowing the value of one variable does not enlighten you about the value of the other variable. Here's a script to see whether age and buys_computer are independent.

<?php
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";

$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();

$age = $e->bits;
$e->setColumn("buys_computer");
$e->analyze();
$buy = $e->bits;

echo "H(age) = $age<br />";
echo "H(buys_computer) = $buy<br />";
echo "H(age) + H(buys_computer)= ".($age + $buy)."<br />";

$je = new JointEntropy;
$je->setTable('Customers');
$je->setColumns(array('age','buys_computer'));
$je->analyze();

echo "H(age, buys_computer) = ". $je->bits;
?>

The output of this script is:

H(age) = 1.57740628285
H(buys_computer) = 0.940285958671
H(age) + H(buys_computer)= 2.51769224152
H(age, buys_computer) = 2.27094242175

From these results, you conclude that H(age, buys_computer) < H(age) + H(buys_computer), which means that age and buys_computer are not totally independent variables (although the dependence does not seem too strong either). In general:

H(X,Y) <= H(X) + H(Y) with equality only if X and Y are independent.

One of the most important reasons for being concerned about whether two variables are independent is that data reduction is a critical aspect of any datamining analysis. One important way to reduce the data to manageable size is to eliminate variables that are independent of the output variable about which you want to reduce your uncertainty. If you find that your joint probability scores are nearly equal to the additive sum of your marginal probabilities, you can conclude that our variables are independent and should consider eliminating the independent variable from your analysis.

Conditional Probability

The next formula to discuss and implement is the conditional entropy formula. Before discussing this formula, however, you must first understand how to compute a conditional probability from a joint probability table. The concrete formula for computing a conditional probability looks like this.

P(Y=y | X=x) = P(Y=y, X=x) / P(X=x)

An instantiated formula for computing the conditional probability that customers will buy a computer given that they are under 30 years old looks like this:

P(buys_computer = yes | age = <30) = P(buys_computer = yes,
age = <30) / P(age = < 30)

To compute P(buys_computer = yes, age = <30), look up the joint probability cell with these specific row and column settings (that is, 0.14286). To compute the P(age = <30), look up the row marginal where age = <30 (0.35714). In a nutshell, computing a conditional probability involves dividing a joint probability by a marginal probability (0.14286/0.35714 = 0.4).

Conditional Entropy

The first step in computing the overall conditional entropy is to compute the specific conditional entropies using this formula:

H(X | Y = y) = -Σi:n P(X = x | Y = y) * log(P(X = x | Y = y))

Plug the specific conditional entropy formula H(X | Y = y) into the conditional entropy formula below to compute the amount of uncertainty remaining about X after Y has been observed:

H(X | Y) = -Σi:n P(Y = y) * H(X | Y =
y)

The specific conditional entropy formula computes the amount of uncertainty remaining after performing conditioning on one value of the signal distribution, whereas the conditional entropy is the amount of uncertainty remaining after summing the products of specific signal probabilities and specific conditional entropies.

Calculating Conditional Entropy

Now onto some code for computing the conditional entropy. The code below conditions buying on age ($ce->setConditional('buys_computer | age')) and outputs a joint probability and conditional entropy table.

<?php
/**
* @package IT
*/
require_once "../config.php";

require_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Example of how to use the ConditionalEntropy class.   
*/
$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConditional('buys_computer|age');

$ce->analyze();
?>

<i>Joint probability table.</i>

<?php
$ce->showJointProbability("%.5f");
?>

<br />

<i>Conditional entropy table.</i>

<?php
$ce->showConditionalEntropy();

?>

This code outputs the following tables:

  buys_computer  
  no yes Σi+
age <=30 0.21429 0.14286 0.35714
31..40 0.00000 0.28571 0.28571
>40 0.14286 0.21429 0.35714
  Σ+j 0.35714 0.64286 1

Joint probability table

  P(B | A = ai)  
Ai P(A = ai) no yes H(B | A = ai) P(A = ai) * H(B | A = ai)
<=30 0.357142857143 0.6 0.4 0.970950594455 0.346768069448
31..40 0.285714285714 0 1 0 0
>40 0.357142857143 0.4 0.6 0.970950594455 0.346768069448
Σi=1 ... 3 P(A = ai) * H(B | A = ai) 0.693536138896

Conditional entropy table

The first table appears again so that you can see how the second, third, and fourth columns in the second table were derived from it. The second column simply reproduces the row marginal from the first table. The third and fourth columns are conditional probabilities calculated by dividing a joint probability by a row marginal (for instance, 0.14286/0.35714 = 0.4). The fifth column calculates the specific conditional entropy for each age range. For example:

H(buys_computer | age = <30) = -1 * [ 0.6 * log(0.6) + 0.4 * log(0.4) ] = 
   0.970950594455

The specific conditional entropy column can be useful to examine in some detail, because low values are telling you that there is an uncertainty-reducing relationship between levels of your variables. In traditional statistical analysis, such minute relationships may not be theoretically interesting; however, in datamining contexts you might find it interesting to know that 30- to 40-year-old customers tend to purchase computers at your store. Of course, there is not enough data in this data set to draw any firm conclusions.

The specific entropy value in the fifth column is then multiplied by the corresponding probability in the second column to obtain the values in the sixth column. The values in the sixth column are summed to give the overall conditional entropy score reported in the bottom-right cell.

Conditional Entropy Code

The source code for the conditional entropy class is:


<?php
/**
* @package IT
*/

require_once "Output.php";

/**
* Computes the conditional entropy in bits between
* two data columns and retains intermediate values
* that are useful for generating output reports.
*
* @author Paul Meagher, Datavore Productions
* @version 0.5
* @modified 22/02/2005
*/

class ConditionalEntropy extends Output {

  var $n = 0;

  var $columns = array();
    
  var $row_freqs = array();
  var $col_freqs = array();
  
  var $row_probs = array();
  var $col_probs = array();
  
  var $row_labels = array();
  var $col_labels = array();   
  
  var $joint_freqs = array();
  var $joint_probs = array();

  var $cond_probs  = array();    

  var $bits = 0;

  var $data = array();

  var $table  = "";
  var $select = "";     
  var $where  = "";   

  /* Methods for handling database table input */
          
  function setTable($table) {
    $this->table = $table;
  }

  function setSelect($sql) {
    $this->select = $sql;
  }

  function setWhere($sql) {
    $this->where = " WHERE ".$sql;
  }

  function getSQL() {
    if (empty($this->select)) {
      $sql  = " SELECT ".$this->columns[0].",".$this->columns[1];
      $sql .= " FROM $this->table ";
      if (empty($this->where))
        return $sql;
      else
        return $sql . $this->where;        
    } else
      return $this->select;      
  }
     
  function getJointAndMarginalFrequenciesFromTable() {
    global $db;    
    $sql    = $this->getSQL();
    $result = $db->query($sql);    
    if (DB::isError($result))
      die($result->getMessage());
    else {      
      $n = 0;
      while($row = $result->fetchRow()) {        
        $a = $row[$this->columns[0]];
        $b = $row[$this->columns[1]];
        $this->joint_freqs[$a][$b]++;
        $this->row_freqs[$a]++; // aka row marginals
        $this->col_freqs[$b]++; // aka col marginals
        $n++;
      }
      $this->n = $n;
    }  
    return true;
  }

  /* Methods for handling array input */
      
  function setArray($data) {
    $this->data = $data;
  }

  function getJointAndMarginalFrequenciesFromArray() {
    $this->n = count($this->data);        
    for ($i=0; $i < $this->n; $i++) {
      $a = $this->data[$i][0];
      $b = $this->data[$i][1];
      $this->joint_freqs[$a][$b]++;
      $this->row_freqs[$a]++; // aka row marginals
      $this->col_freqs[$b]++; // aka col marginals
    }  
  }

  /* Shared methods */

  function setAntecedent($column, $label="") {
    $this->columns[0] = $column;
    $this->labels[0]  = $label;    
  }

  function setConsequent($column, $label="") {
    $this->columns[1] = $column;
    $this->labels[1] = $label;    
  }

  function setConditional($conditional, $labels="") {
    $cond_pieces          = explode("|", $conditional);          

    $this->columns[0]  = trim($cond_pieces[1]);
    $this->columns[1]  = trim($cond_pieces[0]);    
    if ($labels != "") {
      $label_pieces       = explode("|", $labels);          
      $this->labels[0] = trim($label_pieces[1]);
      $this->labels[1] = trim($label_pieces[0]);          
    } else {
      $this->labels[0] = $this->columns[0]{0};
      $this->labels[1] = $this->columns[1]{0};      
    }
  }

  function clear() {   
    $this->n           = 0;    
    $this->row_freqs   = array();
    $this->col_freqs   = array();
    $this->row_probs   = array();
    $this->col_probs   = array();
    $this->row_labels  = array();
    $this->col_labels  = array();   
    $this->joint_freqs = array();
    $this->joint_probs = array();
    $this->cond_probs  = array();  
    $this->bits        = 0;        
  }

  function analyze() {
    $this->clear();
    if (empty($this->table))
      $this->getJointAndMarginalFrequenciesFromArray();
    else
      $this->getJointAndMarginalFrequenciesFromTable();     
    $this->row_labels = array_keys($this->row_freqs);
    $this->col_labels = array_keys($this->col_freqs);    
    $this->getJointAndMarginalProbabilities();
    $this->getConditionalProbabilities();      
    $this->getConditionalEntropyScore();    
  }       

  function getJointAndMarginalProbabilities() {
    foreach($this->joint_freqs AS $key1=>$array) {
      foreach($array AS $key2=>$val2) {
        $this->joint_probs[$key1][$key2] = $this->joint_freqs[$key1][$key2] / 
		   $this->n;
        $this->row_probs[$key1] += $this->joint_probs[$key1][$key2];
        $this->col_probs[$key2] += $this->joint_probs[$key1][$key2];
      }
    } 
  }
  
  function getConditionalProbabilities() {
    foreach($this->joint_probs AS $key1=>$array)
      foreach($array AS $key2=>$val2)         
        $this->cond_probs[$key1][$key2] = $this->joint_probs[$key1][$key2] /
                                             $this->row_probs[$key1];
  }

  function loadJointProbabilities($joint_probs) {
    $this->clear();
    $this->joint_probs = $joint_probs;
    foreach($joint_probs AS $key1=>$array) {
      foreach($array AS $key2=>$val2) {
        $this->row_probs[$key1] += $joint_probs[$key1][$key2];
        $this->col_probs[$key2] += $joint_probs[$key1][$key2];        
      }
    }
    $this->getConditionalProbabilities();      
    $this->getConditionalEntropy();
  }

  function getConditionalEntropyScore() {
    foreach($this->joint_freqs AS $key1=>$array)
      foreach($array AS $key2=>$val2)
        $this->bits -= $this->joint_probs[$key1][$key2] *
                          log($this->cond_probs[$key1][$key2], 2);
  }
}
?>

Mutual Information

The formula for computing mutual information uses formulas encountered earlier:

I(X;Y) = H(X) - H(X|Y)

The following PHP script computes the mutual information between age and buys_computer:


<?php
/**
* @package IT
*
* Example of how to compute mutual info.
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";

$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();
echo "H(buys_computer) = ". $e->bits ."<br />";

$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer | age) = ". $ce->bits ."<br />";

$mutual_info = $e->bits - $ce->bits;
echo "I(age;buys_computer) = H(buys_computer) - H(buys_computer | age) = 
   $mutual_info <br />";
?>

The output of this script looks like this:

H(buys_computer) = 0.940285958671
H(buys_computer | age) = 0.693536138896
I(age;buys_computer) = H(buys_computer) - H(buys_computer | age) = 0.246749819774

If the conditional entropy score is very low (that is, close to 0), this will have the effect of producing a relatively high mutual information score. A high mutual information score indicates that you can use one variable to reduce your uncertainty about the other variable. Conversely, a low mutual information score shows that you cannot use one variable to reduce your uncertainty about the values of the other random variable. A signal transferred from the age column is not well preserved in the buys_computer column. In signal theory terms, the "channel" between them is full of distortion.

The term mutual information derives from the fact that you will obtain the same mutual information score no matter which of the two variables you use as the conditioning variable in the equation. Instead of conditioning buys_computer on age, you can condition age on buys_computer and get the same result:

H(age) = 1.57740628285
H(age | buys_computer) = 1.33065646308
I(age;buys_computer) = H(age) - H(age | buys_computer) = 0.246749819774

A few other fundamental relationships are worth pointing out:

H(A, B) = H(A) + H(B) - I(A;B)
H(A, B) = H(A) + H(B|A)
H(A, B) = H(B) + H(A|B)

Information Gain

What I have called the mutual information score, Han and Kamber refer to as the Gain score. While the formulas for computing this score appear different, the mutual information score (I(age; buys_computer) = 0.246) is the same value as their gain score (Gain(age) = 0.246).

In the datamining literature, the term gain often appears instead of mutual information, probably because the standard formulas differ and the gain score is sometimes "corrected" to make it more useful for datamining. Be careful, however, in correcting the computation too much, or else you will lose the poweful consistency properties of the information theory calculus.

The term gain is more suggestive of why you might want to compute the mutual information score in a datamining context, and I will use it from here on in. According Han and Kamber, "Gain(A) is the expected reduction in entropy cause by knowing the value of attribute A." Below is some code that computes the gain score for each of the columns in the AllElectronics.Customers table to determine which column most reduces the entropy in the buys_computer column.

<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Example of how to compute gain scores for several columns.
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("buys_computer");
$e->analyze();

$ce = new ConditionalEntropy;
$ce->setTable('Customers');
$ce->setConsequent('buys_computer');

$columns = array('age','income','student','credit_rating');

foreach($columns as $column) {
  $ce->setAntecedent($column);
  $ce->analyze();
  $gain = $e->bits - $ce->bits;
  echo "gain($column): $gain<br />";
}
?>

The setAntecedent() and setConsequent() methods in the ConditionalEntropy.php class exist to handle situations where you want to compute the conditional entropy score for a target column in terms of several other columns. The script outputs the following set of gain scores:

gain(age): 0.246749819774
gain(income): 0.029222565659
gain(student): 0.151835501362
gain(credit_rating): 0.0481270304083

The report shows that the age column reduces the entropy in the buys_computer column the most. You might consider developing a script that ranks the gains from highest to lowest for use in constructing decision tree classifiers, discussed in the next section.

Datamining Applications

Three common datamining applications of entropic analysis are data description, classifier construction, and data discretization. You can easily incorporate the entropy scores calculated previously into web-based tools that report the entropy, joint entropy, conditional entropy, and mutual information/gain scores between the various columns in the database. Use this information much as you would use means, standard deviations, and correlation coefficients to provide a useful descriptive summary of your data sets. Dorian Pyle's book Data Preparation for Data Mining provides examples of such entropic analysis reports along with a useful discussion of how to use such reported entropy information used in the early "data survey" stage of data mining.

Related Reading

MySQL Cookbook
By Paul DuBois

The previous article also developed several such reports; however, they reported only univariate entropy scores. With the new entropy classes and sample code discussed in this article, you should be able to develop a variety of new entropy-based reports to characterize your data columns.

A more advanced and very common application of entropic analysis to data mining is in the construction of classifiers in the form of decision trees. A decision tree might be useful for determining whether an All Electronics customer is likely to buy a computer. This decision tree would involve first asking the age of the customer. If the age were younger than 30 years old, it might then ask whether the customer is a student. After that, it might ask his income range. Eventually it would reach a terminal node in the decision tree that classifies the instance as to whether the customer is likely to buy a computer.

It is beyond the scope of this article to go too deeply into the theory of constructing decision trees, except to note that constructing such trees uses the recursive computation of gain scores extensively. The age attribute forms the root of the decision tree because it has the highest information gain score. It then takes the partitioned customers (in the ranges of <30, 31..40, >40) and computes the gain score again using the remaining attributes. The remaining attribute producing the highest gain score becomes the second-level branch in the decision tree. This process repeats until there is very little gain to be had by adding more attribute tests to the decision tree. See Quinlan's book and the CART literature for more details on how to use gain scores to construct decision trees.

In addition to data description and classifier construction, you can also use entropic analysis to discretize your data. For example, if customer ages were recorded in years, you might want to find appropriate bins to use by first ordering the age data from smallest to largest, selecting the number of bins and sizes to use, and then computing the entropy score for the data using this number and size of bins. Then compare this with other bin numbers and sizes to find the binning strategy that produces the lowest entropy score. A more sophisticated and useful form of discretization would incorporate information about the target class into the selection of bins to use--perhaps selecting bins so as to minimize a joint entropy, conditional entropy, or mutual information score.

In summary, you can use entropic analysis in datamining to prepare your data (converting continuous quantities into discrete quantities), to describe your data (producing entropic analysis reports), and to make inferences about new sample data (applying your buys_computer decision tree to a new customer). You can apply entropic analysis to many aspects and stages of the datamining process.

Consistency Proof

There are many ways to structure the code base of an information theory package. In this series I have proposed a class structure that reserves a class file for each basic logical element of information theory: Entropy.php, JointEntropy.php, and ConditionalEntropy.php.

By preserving the logical structure of the area in the code base class structure, it's possible to construct proofs that the logical house is in order:


<?php
/**
* @package IT
*/
include_once "../config.php";
include_once PHPMATH."/IT/Entropy.php";
require_once PHPMATH."/IT/JointEntropy.php";
include_once PHPMATH."/IT/ConditionalEntropy.php";

/**
* Proof that H(Y,X) = H(Y|X) + H(X)
*/
$e = new Entropy;
$e->setTable("Customers");
$e->setColumn("age");
$e->analyze();
echo "H(age) = ". $e->bits ."<br />";

$ce = new ConditionalEntropy;
$ce->setTable("Customers");
$ce->setConditional("buys_computer|age");
$ce->analyze();
echo "H(buys_computer|age) = ". $ce->bits ."<br />";

$joint_entropy = $ce->bits + $e->bits;
echo "H(buys_computer|age) + H(age) =
$joint_entropy<br />";

$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns(array("age","buys_computer"));
$je->analyze();
echo "H(age; buys_computer) = ". $je->bits ."<br />";

echo "Therefore, H(buys_computer;age) = H(buys_computer|age) + H(age) <br />";
echo "Or, more generally, H(Y;X) = H(Y|X) + H(X)";
?>

Point your browser at the consistency_proof.php script to see the following result:

H(age) = 1.57740628285
H(buys_computer|age) = 0.693536138896
H(buys_computer|age) + H(age) = 2.27094242175
H(age, buys_computer) = 2.27094242175
Therefore, H(buys_computer, age) = H(buys_computer|age) + H(age) 
Or, more generally, H(Y,X) = H(Y|X) + H(X)

In addition to using the IT Package to engage in datamining, you might also want to use it to explore and confirm the mathematical structure of information theory.

Future Technical Work

I spent quite a bit of time designing the set of column setting methods used by the IT Package classes:

I designed them with a view toward their use in multicolumn contexts.

For example, when computing the joint entropy between columns, you might specify more than two columns to use:

$je->setColumns("age,buys_computer,student");

Similarly, when computing a conditional entropy score, you might specify more than one column in the antecedent or consequent part of the conditional expression:

$ce->setConditional('buys_computer|age,student');

Future work on the IT Package might involve improving the joint and conditional entropy classes so that they can evaluate multicolumn queries in a useful manner. One way to evaluate a three-column mutual information query is by arithmetic over the outcomes generated by entropy, joint entropy, and conditional entropy objects.

I(A;B,C) = {H(A) - H(A|B)} + {H(A|B) - H(A|B,C)}

The last term in the three-column mutual information query requires implementing a conditional entropy object with three-column support. There are likely more efficient ways to compute queries with four or more columns.

Another useful class to add to the package is one that computes a relative entropy score (perhaps RelativeEntropy.php), which is useful for model-fitting applications. The relative entropy score is also known as the Kullback-Leibler distance. The formula for computing this score looks like:

D(p || q) = Σi:n p(xi) * log (p(xi)/q(xi))

Finally, I would like to extend these information theory concepts to handling continuous random variables. This involves incorporating a probability distributions library into the calculations. Because I have already coded a PDL Package that supplies such a library, this is a feasible technical direction to move in as well.

Resources

Paul Meagher is a cognitive scientist whose graduate studies focused on mathematical problem solving.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.