AddThis Social Bookmark Button

Listen Print Discuss

Calculating Entropy for Data Miners

by Paul Meagher
03/24/2005

Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:

  • Entropy.php
  • JointEntropy.php
  • ConditionalEntropy.php

Calculating Entropy for Datamining discussed the Entropy.php class, which provides web database programmers with quick access to an entropic analysis of a single database column. This article discusses the concepts of joint and conditional entropy through developing the corresponding classes--JointEntropy.php, ConditionalEntropy--and illustrating how these classes allow web database programmers quick access to an entropic analysis of the relationship between two or more database columns considered at a time.

The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.

AllElectronics Database

In the acclaimed book Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber illustrate many datamining concepts using the fictitious AllElectronics database. A version of the AllElectronics database also appeared in Ross Quinlan's datamining research. This database ably illustrates the information theoretic concepts that follow. Here is a Customers table derived from that database:

id age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31..40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31..40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31..40 medium no excellent yes
13 31..40 high yes fair yes
14 >40 medium no excellent no

I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.

Joint Entropy

The formula for computing the joint entropy between two database columns looks like this:

H(X,Y) = -Σi:nΣj:n p(xi,yj) * log(p(xi,yj))

This formula is similar in many respects to the formula for computing the entropy of a single database column.

H(X) = -Σi:n p(xi) * log(p(xi))

The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.

To see what a joint probability distribution looks like, use the JointEntropy.php class from the information theory package (IT Package) that the last article began to assemble. Below is a PHP script that applies the JointEntropy.php class to two columns (age, buys_computer) in the AllElectronics.Customers table. It outputs a joint frequency table, a joint probability table, and the summary joint entropy score.

<?php
/**
* @package IT
*/

require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";

/**
* Example of how to use JointEntropy class.   
*/

$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>

 <i>Joint frequency table.</i>

<?php
$je->showJointFrequency("%d");
?>

<br />

<i>Joint probability table.</i>

<?php
$je->showJointProbability("%.5f");
?>
<br />

<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>

The output produced by this script looks like:

  buys_computer  
  no yes Σi+
age <=30 3 2 5
31..40 0 4 4
>40 2 3 5
  Σ+j 5 9 14

Joint frequency table

  buys_computer  
  no yes Σi+
age <=30 0.21429 0.14286 0.35714
31..40 0.00000 0.28571 0.28571
>40 0.14286 0.21429 0.35714
  Σ+j 0.35714 0.64286 1

Joint probability table

Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:

H(X,Y) = -1 * [ (0.21429 * log(0.21429)) + 
                (0.14286 * log(0.14286)) + 
                (0.28571 * log(0.28571)) + 
                (0.14286 * log(0.14286)) + 
                (0.21429 * log(0.21429)) ]

H(X,Y) = -1 * [ -0.476226947429 + 
                -0.401050703151 + 
                -0.516387120588 + 
                -0.476226947429 + 
                -0.401050703151 ]

H(X,Y) = -1 * -2.270942421748

H(X,Y) = 2.271

Joint entropy in bits is 2.271.

The calculation of the joint entropy score is the same as the calculation of the univariate entropy score. In both cases, it takes a vector of probabilities summing to 1 and applies the Shannon formula to the probability vector. The interpretation of what the score means is a bit different insofar as what the probabilities encode is not a univariate distribution of signals. That is, P(buys_computer=no), P(buys_computer=yes), but a bivariate distribution of signals; for example, P(age=<30, buys_computer=no), P(age=<30, buys_computer=yes), and so on. The joint entropy score thus signifies the minimum number of questions you need to ask, on average, to identify a signal coming from this bivariate distribution of signals.

Pages: 1, 2, 3, 4, 5

Next Pagearrow