PHP DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Calculating Entropy for Data Mining

by Paul Meagher
01/06/2005

Information theory (IT) is a foundational subject for computer scientists, engineers, statisticians, data miners, biologists, and cognitive scientists. Unfortunately, PHP currently lacks software tools that would make it easy to explore and/or use IT concepts and methods to solve data analytic problems. This two-part series aims to remedy this situation by:

  1. Introducing you to foundational information theory concepts.
  2. Implementing these foundational concepts as classes using PHP and SQL.
  3. Using these classes to mine web data.

This introduction will focus on forging theoretical and practical connections between information theory and database theory. An appreciation of these linkages opens up the possibility of using information theory concepts as a foundation for the design of data mining tools. We will take the first steps down that path.

Univariate and Bivariate Entropy

The central concept of this series is entropy.

The goal of this article is to explore the descriptive uses of entropy in the context of summarizing web access log data. You will learn how to compute entropy for a single database column of values (i.e., univariate entropy) and what the resulting entropy score means. The goal is to obtain a practical appreciation for what entropy measures in order to tackle inference problems that arise in more complex bivariate and multivariate contexts.

Discrete Random Variables

A typical database provides an abundance of discrete random variables:

  • A column of first names appearing in a Members table.
  • A column of product IDs appearing in a Transactions table.
  • A column of folder IDs appearing in a Files table.
  • A column of IP addresses appearing in a Webstats table.
  • A column of web page IDs appearing in a Webstats table.

It's useful to consider the collection of categorical, rank ordered, and small integer data that appears in database columns as values arising from a collection of discrete random variables. Why consider these discrete random variables? It's possible to model the distribution of values in each column using a discrete random number generator (e.g., binomial, poission, multinomial, etc.) parameterized with the appropriate probability for each type of outcome.

Mathematical Definition of Information

Information theorists often equate the concept of a discrete random variable with the concept of a noisy signal. Given this equation, it's possible to regard a database column as a noisy signal source that emits discrete signal states.

The mathematical definition of the information content of a signal state is the negative logarithm of the signal state probability:

I(p(s)) = -log(p(s))

Recall that probability values range from 0 to 1. Taking the logarithm of a number in this range results in a negative value (log2(0.5) = -1). Multiplying the log value by -1 ultimately yields only positive values for our information measure. Here's some PHP code to study the behavior of the information function:

<?php
for($p=0.0; $p<=1.0; $p+=0.1)
  $i = -log($p, 2);  
?>

Embellishing this code with HTML formatting to output the $p and $i values, it produces the following table:

p(s)I(p(s))
01.#INF
0.13.32192809489
0.22.32192809489
0.31.73696559417
0.41.32192809489
0.51
0.60.736965594166
0.70.51457317283
0.80.321928094887
0.90.152003093445
11.60171325191E-016

The amount of information associated with a p(s)=0 signal is an infinite value. Conversely, the amount of information associated with the p(s)=1 signal is 0 (although PHP does not represent it in this way). These values makes sense when regarding information as measuring the unexpectedness of the signal. Unexpected signals (those with a low probabiliy of occurrence) have a higher information content than expected signals. This information formula captures a key property of the concept of information; namely, its inverse relationship to the unexpectedness of message.

Note that the input to the information function is a signal probability (p(heads)=.5) and not a signal value (heads). The information function, like the entropy function, outputs a score that is defined over the probability distribution of signals.

Entropy Formula

The information function allows us to compute the amount of information (measured in bits) conveyed by a particular signal state. Often, it's more interesting to compute the amount of information contained in the complete distribution of signal states (for example, a database column of values). Claude Shannon showed how to compute this overall information score by:

  1. Multiplying the information scores for each signal state by a weighting term equal to the probability of the signal (Ii = p(si) * log(p(si)).
  2. Summing these information scores (H = Σi:n Ii).
  3. Multiplying the result by -1 to make the overall value positive (H = -1 * H).

Combining these operations into one formula produces the formula for computing the entropy (H) of a discrete random variable:

H(P) = -Σi:n p(si) * log(p(si))

The entropy formula takes as input the probability distribution of the signals (denoted with a capital P) as the basis for computing an entropy score. The entropy score denotes the lower bound on the number of bits required to represent each value in the data stream (here, each database column value).

Reducing Uncertainty

You can also think of the entropy score as reflecting the average number of binary questions you need to ask in order to identify any signal in your signal distribution. Consider how many yes/no questions you would need to ask on average in order to identify a letter arising from a stream of equiprobable letters. If you count a space as a "letter," the answer is log2(27), or 4.76 bits per letter. This result means that on average it takes five binary questions to guess the identity of a letter. The more binary questions you need to ask, the greater your uncertainty about what the next signal state will be.

Assuming that each outcome is equiprobable puts an upper bound on the possible entropy of a stream of outcomes (see maximum entropy). The actual entropy of well-formed English text passages will be lower than 4.76 bits because there is structure in the signal distribution, as indicated by the unequal signal probabilities for real English text streams. It is common in natural language processing to compute bigram and trigram probability distributions for text streams and feed these distributions into the entropy function to see if there are sequential dependencies in a text stream. Such an exercise would also be useful in the context of visitor clickstream analysis, where there is a strong reason to believe sequential dependencies exist.

An entropic analysis of clickstream data might involve concatenating visitor clickstreams together and then summarizing the resulting data in terms of three different probability distributions:

  1. Single-page distribution: H(p(si)).
  2. Two-page or bigram distribution: H(p(si, sj)).
  3. Three-page or trigram distribution: H(p(si, sj, sk)).

We could feed these three probability distributions into our entropy function and observe the resulting entropy scores. Our goal would be to see if we can reduce our uncertainty about the clickstream distribution by finding model-based ways to lower entropy scores. Markov process models are particularly useful in this regard.

Implementing Entropy

We can implement the entropy formula as a PHP class. The initial class unnecessarily stores intermediate results so as to be useful for demonstration. These intermediate results will be useful to report (see the next section) as an aid to understanding the computational nature of the entropy calculation.

<?php

/**
* Computes the entropy of an array of tokens (aka symbols).
*/
 class Entropy {

  var $tokens      = array();
  var $num_events  = 0;
  var $token_freqs = array();
  var $token_probs = array();   
  var $num_tokens  = 0;
  var $bits        = 0.0;
  var $maxent      = 0.0;   
  var $ratio       = 0.0;     

  function Entropy($tokens) {
    $this->tokens      = $tokens;
    $this->num_events  = count($this->tokens);
    $this->token_freqs = $this->getTokenFrequencies();
    $this->num_tokens  = count($this->token_freqs);

    foreach ($this->token_freqs as $token => $freq) {
      $this->token_probs[$token]  = $freq / $this->num_events;
      $entropy += $this->token_probs[$token] * log($this->token_probs[$token], 2);
    }

    $this->bits   = -1.0 * $entropy;
    $this->maxent = log($this->num_tokens, 2);     
    $this->ratio  = $this->bits / $this->maxent;
  }

  function getTokenFrequencies() {

    for ($i = 0; $i < $this->num_events; $i++)
      $token_freqs[$this->tokens[$i]]++;

    return $token_freqs;
  }

}
?>

Web Page Entropy

We can illustrate the usage of the Entropy class by measuring the entropy of the page identifier column (self) of a typical Webstats database table:

<?php
/**
* Measures the entropy for a few hypothetical values from a page_id column
* of a typical Webstats table.  Uses the following abbreviations:
* h=HomePage, n=NewsPage, p=ProductsPage, s=ServicesPage, a=AboutPage.
*/
require_once "../Entropy.php";

$tokens = array("h","n","p","s","h","n","h","a","h","p","h","s","h","n","p");
$e      = new Entropy($tokens);
?>

<pre>
<?php print_r($e)?>
</pre>

<p>
The Entropy of the source measured in bits is: <?php echo $e->bits?>
</p>

This code outputs a human-readable depiction of the instantiated entropy object, along with the computed entropy of the supplied clickstream data. Note in particular the need to compute the token frequencies in order to calculate the token probabilities required for the entropy formula. In order to make this code scalable, we will look for efficiencies in how we compute these intermediate token frequencies.

entropy Object
(
    [tokens] => Array
        (
            [0] => h
            [1] => n
            [2] => p
            [3] => s
            [4] => h
            [5] => n
            [6] => h
            [7] => a
            [8] => h
            [9] => p
            [10] => h
            [11] => s
            [12] => h
            [13] => n
            [14] => p
        )

    [num_events] => 15
    [token_freqs] => Array
        (
            [h] => 6
            [n] => 3
            [p] => 3
            [s] => 2
            [a] => 1

        )

    [token_probs] => Array
        (
            [h] => 0.4
            [n] => 0.2
            [p] => 0.2
            [s] => 0.13333333333333
            [a] => 0.066666666666667
        )

    [num_tokens] => 5
    [bits] => 2.1055872616983
    [maxent] => 2.3219280948874
)

The entropy of the source measured in bits is 2.1055872616983.

An entropy score of 2.1 bits tells us that, on average, we will require 2.1 bits to encode information optimally about the sequence of page identifiers making up the clickstream. It is also the lower bound on the average number of binary questions we need to ask to identify a given signal.

Pages: 1, 2

Next Pagearrow




Valuable Online Certification Training

Online Certification for Your Career
Earn a Certificate for Professional Development from the University of Illinois Office of Continuing Education upon completion of each online certificate program.

PHP/SQL Programming Certificate — The PHP/SQL Programming Certificate series is comprised of four courses covering beginning to advanced PHP programming, beginning to advanced database programming using the SQL language, database theory, and integrated Web 2.0 programming using PHP and SQL on the Unix/Linux mySQL platform.

Enroll today!


Sponsored by: