## Calculating Entropy for Data Miners

by Paul Meagher03/24/2005

Information theory has a powerful logical structure that I hope to preserve in the corresponding class structure of a PHP-based IT Package:

`Entropy.php`

`JointEntropy.php`

`ConditionalEntropy.php`

Calculating
Entropy for Datamining discussed the `Entropy.php`

class, which
provides web database programmers with quick access to an entropic analysis of
a single database column. This article discusses the concepts of joint and
conditional entropy through developing the corresponding classes--`JointEntropy.php`

, `ConditionalEntropy`

--and illustrating how
these classes allow web database programmers quick access to an entropic
analysis of the relationship between two or more database columns considered at
a time.

The article concludes with a brief discussion of how to use entropic analysis to solve three specific datamining problems: data description, classifier construction, and data discretization.

`AllElectronics`

Database

In the acclaimed book Data Mining: Concepts and
Techniques, Jiawei Han and Micheline Kamber illustrate many datamining
concepts using the fictitious `AllElectronics`

database. A version
of the `AllElectronics`

database also appeared in Ross Quinlan's
datamining research. This database ably illustrates the information theoretic
concepts that follow. Here is a `Customers`

table derived from that
database:

id | age | income | student | credit_rating | buys_computer |
---|---|---|---|---|---|

1 | <=30 | high | no | fair | no |

2 | <=30 | high | no | excellent | no |

3 | 31..40 | high | no | fair | yes |

4 | >40 | medium | no | fair | yes |

5 | >40 | low | yes | fair | yes |

6 | >40 | low | yes | excellent | no |

7 | 31..40 | low | yes | excellent | yes |

8 | <=30 | medium | no | fair | no |

9 | <=30 | low | yes | fair | yes |

10 | >40 | medium | yes | fair | yes |

11 | <=30 | medium | yes | excellent | yes |

12 | 31..40 | medium | no | excellent | yes |

13 | 31..40 | high | yes | fair | yes |

14 | >40 | medium | no | excellent | no |

I selected this example data set because it is both simple and realistic. Also, using this table makes it possible to compare the information gain scores generated by the information theory classes with Han and Kamber's hand-calculated information gain scores.

### Joint Entropy

The formula for computing the joint entropy between two database columns looks like this:

`H(X,Y) = -Σ`_{i:n}Σ_{j:n} p(x_{i},y_{j}) * log(p(x_{i},y_{j}))

This formula is similar in many respects to the formula for computing the entropy of a single database column.

`H(X) = -Σ`_{i:n} p(x_{i}) * log(p(x_{i}))

The main difference is that the joint entropy formula has two summation signs instead of one, and the probability distribution being summed over is a joint probability distribution instead of a univariate probability distribution.

To see what a joint probability distribution looks like, use the
`JointEntropy.php`

class from the information theory package (IT
Package) that the last article began to assemble. Below is a PHP script that
applies the `JointEntropy.php`

class to two columns (`age`

,
`buys_computer`

) in the `AllElectronics.Customers`

table.
It outputs a joint frequency table, a joint probability table, and the summary
joint entropy score.

```
<?php
/**
* @package IT
*/
require_once "../config.php";
require_once PHPMATH."/IT/JointEntropy.php";
/**
* Example of how to use JointEntropy class.
*/
$je = new JointEntropy;
$je->setTable("Customers");
$je->setColumns("age, buys_computer");
$je->analyze();
?>
<i>Joint frequency table.</i>
<?php
$je->showJointFrequency("%d");
?>
<br />
<i>Joint probability table.</i>
<?php
$je->showJointProbability("%.5f");
?>
<br />
<i>Joint entropy in bits is <?php
printf("%01.3f", $je->bits) ?>.</i>
```

The output produced by this script looks like:

buys_computer |
||||

no | yes | Σ_{i+} |
||
---|---|---|---|---|

age |
<=30 | 3 | 2 | 5 |

31..40 | 0 | 4 | 4 | |

>40 | 2 | 3 | 5 | |

Σ_{+j} |
5 | 9 | 14 |

*Joint frequency table*

buys_computer |
||||

no | yes | Σ_{i+} |
||
---|---|---|---|---|

age |
<=30 | 0.21429 | 0.14286 | 0.35714 |

31..40 | 0.00000 | 0.28571 | 0.28571 | |

>40 | 0.14286 | 0.21429 | 0.35714 | |

Σ_{+j} |
0.35714 | 0.64286 | 1 |

*Joint probability table*

Using the joint probability table above, it's possible to see how the joint entropy score was calculated from this data:

```
H(X,Y) = -1 * [ (0.21429 * log(0.21429)) +
(0.14286 * log(0.14286)) +
(0.28571 * log(0.28571)) +
(0.14286 * log(0.14286)) +
(0.21429 * log(0.21429)) ]
H(X,Y) = -1 * [ -0.476226947429 +
-0.401050703151 +
-0.516387120588 +
-0.476226947429 +
-0.401050703151 ]
H(X,Y) = -1 * -2.270942421748
H(X,Y) = 2.271
```

*Joint entropy in bits is 2.271.*

The calculation of the joint entropy score is the same as the calculation of
the univariate entropy score. In both cases, it takes a vector of probabilities
summing to 1 and applies the Shannon formula to the probability vector. The
interpretation of what the score means is a bit different insofar as what the
probabilities encode is not a univariate distribution of signals. That is,
`P(buys_computer=no)`

, `P(buys_computer=yes)`

, but a
bivariate distribution of signals; for example, ```
P(age=<30,
buys_computer=no)
```

, `P(age=<30, buys_computer=yes)`

, and so on. The joint entropy score thus signifies the minimum number of questions
you need to ask, on average, to identify a signal coming from this bivariate
distribution of signals.