ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Calculating Entropy for Data Mining

by Paul Meagher
01/06/2005

Information theory (IT) is a foundational subject for computer scientists, engineers, statisticians, data miners, biologists, and cognitive scientists. Unfortunately, PHP currently lacks software tools that would make it easy to explore and/or use IT concepts and methods to solve data analytic problems. This two-part series aims to remedy this situation by:

  1. Introducing you to foundational information theory concepts.
  2. Implementing these foundational concepts as classes using PHP and SQL.
  3. Using these classes to mine web data.

This introduction will focus on forging theoretical and practical connections between information theory and database theory. An appreciation of these linkages opens up the possibility of using information theory concepts as a foundation for the design of data mining tools. We will take the first steps down that path.

Univariate and Bivariate Entropy

The central concept of this series is entropy.

The goal of this article is to explore the descriptive uses of entropy in the context of summarizing web access log data. You will learn how to compute entropy for a single database column of values (i.e., univariate entropy) and what the resulting entropy score means. The goal is to obtain a practical appreciation for what entropy measures in order to tackle inference problems that arise in more complex bivariate and multivariate contexts.

Discrete Random Variables

A typical database provides an abundance of discrete random variables:

It's useful to consider the collection of categorical, rank ordered, and small integer data that appears in database columns as values arising from a collection of discrete random variables. Why consider these discrete random variables? It's possible to model the distribution of values in each column using a discrete random number generator (e.g., binomial, poission, multinomial, etc.) parameterized with the appropriate probability for each type of outcome.

Mathematical Definition of Information

Information theorists often equate the concept of a discrete random variable with the concept of a noisy signal. Given this equation, it's possible to regard a database column as a noisy signal source that emits discrete signal states.

The mathematical definition of the information content of a signal state is the negative logarithm of the signal state probability:

I(p(s)) = -log(p(s))

Recall that probability values range from 0 to 1. Taking the logarithm of a number in this range results in a negative value (log2(0.5) = -1). Multiplying the log value by -1 ultimately yields only positive values for our information measure. Here's some PHP code to study the behavior of the information function:

<?php
for($p=0.0; $p<=1.0; $p+=0.1)
  $i = -log($p, 2);  
?>

Embellishing this code with HTML formatting to output the $p and $i values, it produces the following table:

p(s)I(p(s))
01.#INF
0.13.32192809489
0.22.32192809489
0.31.73696559417
0.41.32192809489
0.51
0.60.736965594166
0.70.51457317283
0.80.321928094887
0.90.152003093445
11.60171325191E-016

The amount of information associated with a p(s)=0 signal is an infinite value. Conversely, the amount of information associated with the p(s)=1 signal is 0 (although PHP does not represent it in this way). These values makes sense when regarding information as measuring the unexpectedness of the signal. Unexpected signals (those with a low probabiliy of occurrence) have a higher information content than expected signals. This information formula captures a key property of the concept of information; namely, its inverse relationship to the unexpectedness of message.

Note that the input to the information function is a signal probability (p(heads)=.5) and not a signal value (heads). The information function, like the entropy function, outputs a score that is defined over the probability distribution of signals.

Entropy Formula

The information function allows us to compute the amount of information (measured in bits) conveyed by a particular signal state. Often, it's more interesting to compute the amount of information contained in the complete distribution of signal states (for example, a database column of values). Claude Shannon showed how to compute this overall information score by:

  1. Multiplying the information scores for each signal state by a weighting term equal to the probability of the signal (Ii = p(si) * log(p(si)).
  2. Summing these information scores (H = Σi:n Ii).
  3. Multiplying the result by -1 to make the overall value positive (H = -1 * H).

Combining these operations into one formula produces the formula for computing the entropy (H) of a discrete random variable:

H(P) = -Σi:n p(si) * log(p(si))

The entropy formula takes as input the probability distribution of the signals (denoted with a capital P) as the basis for computing an entropy score. The entropy score denotes the lower bound on the number of bits required to represent each value in the data stream (here, each database column value).

Reducing Uncertainty

You can also think of the entropy score as reflecting the average number of binary questions you need to ask in order to identify any signal in your signal distribution. Consider how many yes/no questions you would need to ask on average in order to identify a letter arising from a stream of equiprobable letters. If you count a space as a "letter," the answer is log2(27), or 4.76 bits per letter. This result means that on average it takes five binary questions to guess the identity of a letter. The more binary questions you need to ask, the greater your uncertainty about what the next signal state will be.

Assuming that each outcome is equiprobable puts an upper bound on the possible entropy of a stream of outcomes (see maximum entropy). The actual entropy of well-formed English text passages will be lower than 4.76 bits because there is structure in the signal distribution, as indicated by the unequal signal probabilities for real English text streams. It is common in natural language processing to compute bigram and trigram probability distributions for text streams and feed these distributions into the entropy function to see if there are sequential dependencies in a text stream. Such an exercise would also be useful in the context of visitor clickstream analysis, where there is a strong reason to believe sequential dependencies exist.

An entropic analysis of clickstream data might involve concatenating visitor clickstreams together and then summarizing the resulting data in terms of three different probability distributions:

  1. Single-page distribution: H(p(si)).
  2. Two-page or bigram distribution: H(p(si, sj)).
  3. Three-page or trigram distribution: H(p(si, sj, sk)).

We could feed these three probability distributions into our entropy function and observe the resulting entropy scores. Our goal would be to see if we can reduce our uncertainty about the clickstream distribution by finding model-based ways to lower entropy scores. Markov process models are particularly useful in this regard.

Implementing Entropy

We can implement the entropy formula as a PHP class. The initial class unnecessarily stores intermediate results so as to be useful for demonstration. These intermediate results will be useful to report (see the next section) as an aid to understanding the computational nature of the entropy calculation.

<?php

/**
* Computes the entropy of an array of tokens (aka symbols).
*/
 class Entropy {

  var $tokens      = array();
  var $num_events  = 0;
  var $token_freqs = array();
  var $token_probs = array();   
  var $num_tokens  = 0;
  var $bits        = 0.0;
  var $maxent      = 0.0;   
  var $ratio       = 0.0;     

  function Entropy($tokens) {
    $this->tokens      = $tokens;
    $this->num_events  = count($this->tokens);
    $this->token_freqs = $this->getTokenFrequencies();
    $this->num_tokens  = count($this->token_freqs);

    foreach ($this->token_freqs as $token => $freq) {
      $this->token_probs[$token]  = $freq / $this->num_events;
      $entropy += $this->token_probs[$token] * log($this->token_probs[$token], 2);
    }

    $this->bits   = -1.0 * $entropy;
    $this->maxent = log($this->num_tokens, 2);     
    $this->ratio  = $this->bits / $this->maxent;
  }

  function getTokenFrequencies() {

    for ($i = 0; $i < $this->num_events; $i++)
      $token_freqs[$this->tokens[$i]]++;

    return $token_freqs;
  }

}
?>

Web Page Entropy

We can illustrate the usage of the Entropy class by measuring the entropy of the page identifier column (self) of a typical Webstats database table:

<?php
/**
* Measures the entropy for a few hypothetical values from a page_id column
* of a typical Webstats table.  Uses the following abbreviations:
* h=HomePage, n=NewsPage, p=ProductsPage, s=ServicesPage, a=AboutPage.
*/
require_once "../Entropy.php";

$tokens = array("h","n","p","s","h","n","h","a","h","p","h","s","h","n","p");
$e      = new Entropy($tokens);
?>

<pre>
<?php print_r($e)?>
</pre>

<p>
The Entropy of the source measured in bits is: <?php echo $e->bits?>
</p>

This code outputs a human-readable depiction of the instantiated entropy object, along with the computed entropy of the supplied clickstream data. Note in particular the need to compute the token frequencies in order to calculate the token probabilities required for the entropy formula. In order to make this code scalable, we will look for efficiencies in how we compute these intermediate token frequencies.

entropy Object
(
    [tokens] => Array
        (
            [0] => h
            [1] => n
            [2] => p
            [3] => s
            [4] => h
            [5] => n
            [6] => h
            [7] => a
            [8] => h
            [9] => p
            [10] => h
            [11] => s
            [12] => h
            [13] => n
            [14] => p
        )

    [num_events] => 15
    [token_freqs] => Array
        (
            [h] => 6
            [n] => 3
            [p] => 3
            [s] => 2
            [a] => 1

        )

    [token_probs] => Array
        (
            [h] => 0.4
            [n] => 0.2
            [p] => 0.2
            [s] => 0.13333333333333
            [a] => 0.066666666666667
        )

    [num_tokens] => 5
    [bits] => 2.1055872616983
    [maxent] => 2.3219280948874
)

The entropy of the source measured in bits is 2.1055872616983.

An entropy score of 2.1 bits tells us that, on average, we will require 2.1 bits to encode information optimally about the sequence of page identifiers making up the clickstream. It is also the lower bound on the average number of binary questions we need to ask to identify a given signal.

Entropy and Optimal Encoding

To demonstrate the link between entropy and optimal encoding (optimal compression), we will use Huffman coding to develop a binary encoding to use to represent our five page states: HomePage (h), NewsPage (n), ProductPage (p), ServicesPage (s), and AboutPage (a). The Huffman algorithm tries to find the smallest code to represent the most frequently occurring item. The Huffman encodings also have the property of being prefix codes. You can send them in a continuous stream of bytes and any recipient supplied with a dictionary (as seen in the character code table below) can decode them uniquely.

The string to compress is hnpshnhahphshnp. The character frequency table is:

Character h p n s a
Frequency 6 3 3 2 1

The Huffman compression algorithm generates the following character code table:

Character h p n s a
Code 0 10 111 1101 1100

Using the character frequency table and character code table, we can compute the average number of bits used to encode the sequence of 15 characters (for example, an individual user clickstream):

E(L) = ((6x1)+(3x2)+(3x3)+(2x4)+(2x4))/15
     = (6+6+9+8+8)/15
     = 37/15
     = 2.4666 bits per symbol

The expected length of the symbol encodings using the Huffman procedure is greater than the Shannon entropy, which represents the theoretical lower bound on what is achievable. While Shannon entropy gives us a lower bound on the average number of bits required to encode each symbol, we might also want to calculate a maximum entropy score that assumes that the probability of each signal state is the same.

Maximum Entropy

Comparing the Shannon entropy score to the maximum entropy score allows us to quickly gauge whether some states are more probable than others. If these two scores are close, this tells us that each state is represented equally often in the distribution. If they are far apart, then some signal states are represented much more often than other signal states.

Fortunately, the maximum entropy score is extremely easy to calculate:

Hmax(P) = -N * 1/N * log(1/N) = log(N)

The maximum entropy score is simply the log of the number of possible signal states, N. If our database column has 100 values but only ten of these values are unique, then we would take the log of these ten possible signal states to compute the maximum entropy value. Using ten instead of 100 to represent N makes sense when you consider that entropy is a measure defined over a probability distribution and not the total number of sample values.

Scaling Up

The problem with our current version of the entropy class is that it will not scale. Try feeding this class 100,000 data points and watch it blow up. We can make a more scalable version of this class by making it database-aware and using the database engine to compute intermediate results (notably the token frequencies). We can also make our class more scalable by adding code that will allow us to limit the result set by month or by some other criterion that reduces the amount of data used in the calculation of the entropy. The TableEntropy.php class below incorporates these features. Notice in particular the SQL used to compute token frequencies.

<?php
/**
* Computes the entropy of a database column.
*/
 class TableEntropy {

  var $table       = "";
  var $column      = "";
  var $select      = "";     
  var $where       = "";   
  var $group_by    = "";     

  var $num_events  = 0;
  var $token_freqs = array();
  var $token_probs = array();   
  var $num_tokens  = 0;
  var $bits        = 0.0;
  var $maxent      = 0.0;   
  var $ratio       = 0.0;     

  function setTable($table) {
    $this->table = $table;
  }

  function setColumn($column) {
    $this->column = $column;
  }

  function setSelect($sql) {
    $this->select = $sql;
  }

  function setWhere($sql) {
    $this->where = $sql;
  }

  function setGroupBy($sql) {
    $this->group_by = $sql;
  }

  function getTokenFrequencies() {
    global $db;            

    $sql  = " SELECT $this->column, count($this->column) as freq ";
    $sql .= " FROM ". $this->table;    

    if ($this->where != "")
      $sql .= " WHERE ". $this->where;    

    if ($this->group_by != "")       
      $sql .= " GROUP BY ". $this->group_by;        
    else
      $sql .= " GROUP BY ". $this->column;            

    $this->token_freqs = $db->getCol($sql, "freq");
  }

  function getEntropy() {
    global $db;     

    $this->getTokenFrequencies();

    $this->num_events  = array_sum($this->token_freqs);
    $this->num_tokens  = count($this->token_freqs);    

    foreach ($this->token_freqs as $token => $freq) {
      $this->token_probs[$token]  = $freq / $this->num_events;
      $entropy += $this->token_probs[$token] * log($this->token_probs[$token], 2);
    }

    $this->bits   = -1.0 * $entropy;
    $this->maxent = log($this->num_tokens, 2);    
    $this->ratio  = $this->bits / $this->maxent;        
  }

}
?>

Monthly IP Traffic Report

Here is some code that computes the total number of visitors and hits for the www.phpmath.com site, broken down by month. These totals come from an analysis of the IP address column in a MySQL database table that logs standard access log information.

<?php
/**
* Script the reports monthly ip traffic.
*/
require_once "config.php";
require_once PHPMATH . "/IT/TableEntropy.php";
require_once PHPMATH . "/IT/ArrayMath.php";

$e = new TableEntropy;

$e->setTable("Webstats");
$e->setColumn("ip");
?>
<html>
  <head>
    <title>Monthly IP Traffic Report</title>
  </head>
  <body>

    <i>Monthly IP Traffic Report: Hits, Visitors, Entropy</i>

    <table border='1' cellpadding='5' cellspacing='0'>
      <tr bgcolor='ffffcc'>
        <td><b>Period</b></td><td><b>Hits</b></td>
        <td><b>Visitors</b></td><td><b>Entropy</b></td>
        <td><b>MaxEnt</b></td><td><b>Ratio</b></td>
      </tr>

      <?php

      for($year=2004; $year<=2004; $year++) {
        for($month=1; $month<=12; $month++) {  
          $start  =  date("Y-m-d", mktime(0,0,0,$month,1,$year));
          $end    =  date("Y-m-d", mktime(0,0,0,$month+1,0,$year));
          $clause = " received >= '$start' AND received <= '$end'";

          $e->setWhere($clause);
          $e->getEntropy();

          echo "<tr>";
          echo "  <td><i>$start to $end</i></td>";
          echo "  <td align='right'>". $e->num_events ."</td>";
          echo "  <td align='right'>". $e->num_tokens ."</td>";
          echo "  <td align='right'>". sprintf("%01.2f", $e->bits) ."</td>";
          echo "  <td align='right'>". sprintf("%01.2f", $e->maxent) ."</td>";     
          echo "  <td align='right'>". sprintf("%01.2f", $e->ratio) ."</td>";
          echo "</tr>";

          if($e->num_events != 0) {
            $hits[]    = $e->num_events;
            $visits[]  = $e->num_tokens;
            $entropy[] = $e->bits;                                  
          }

          if($year == date("Y"))
            if($month == date("m"))
              break(2);

        }
      }
      ?>
    </table>

    <?php
    $am = new ArrayMath;
    echo "correlation(hits, visits): ".$am->correlation($hits,$visits)."<br />";
    echo "correlation(hits, entropy): ".$am->correlation($hits,$entropy)."<br />";
    echo "correlation(visits, entropy): ".$am->correlation($visits,$entropy);
    ?>

  </body>
</html>

The output generated by this code looks like this:

Monthly IP Traffic Report: Hits, Visitors, Entropy
Period Hits Visitors Entropy MaxEnt Ratio
2004-01-01 to 2004-01-31 4663 735 7.20 9.52 0.76
2004-02-01 to 2004-02-29 3785 631 6.99 9.30 0.75
2004-03-01 to 2004-03-31 7293 1291 8.62 10.33 0.83
2004-04-01 to 2004-04-30 7910 966 6.47 9.92 0.65
2004-05-01 to 2004-05-31 6502 1276 8.34 10.32 0.81
2004-06-01 to 2004-06-30 4597 837 7.57 9.71 0.78
2004-07-01 to 2004-07-31 4025 837 8.04 9.71 0.83
2004-08-01 to 2004-08-31 3400 871 8.23 9.77 0.84
2004-09-01 to 2004-09-30 4994 923 7.63 9.85 0.77
2004-10-01 to 2004-10-31 5583 984 8.02 9.94 0.81
2004-11-01 to 2004-11-30 4389 739 7.11 9.53 0.75
correlation(hits, visits): 0.74270326185869
correlation(hits, entropy): 0.0050558270393892
correlation(visits, entropy): 0.6465336491674

The entropy scores vary between seven and eight bits. Seven bits can represent 27 or 128 states. Eight bits can represent 28 or 256 states. Equating bit scores to the number of representable states is a good way to acquire a sense of the measurement scale properties of the entropy score. The difference between one and two bits of information is not the same as the difference between seven and eight bits.

Monthly Page Traffic Report

With minor changes to the monthly_ip_traffic.php script, we can create a monthly_page_traffic.php script that reports monthly page traffic:

Monthly Page Traffic Report: Hits, Pages, Entropy
Period Hits Pages Entropy MaxEnt Ratio
2004-01-01 to 2004-01-31 4663 15 2.09 3.91 0.54
2004-02-01 to 2004-02-29 3785 13 2.11 3.70 0.57
2004-03-01 to 2004-03-31 7293 14 1.98 3.81 0.52
2004-04-01 to 2004-04-30 7910 14 1.80 3.81 0.47
2004-05-01 to 2004-05-31 6502 16 1.82 4.00 0.46
2004-06-01 to 2004-06-30 4597 14 1.46 3.81 0.38
2004-07-01 to 2004-07-31 4025 15 1.74 3.91 0.44
2004-08-01 to 2004-08-31 3400 13 1.76 3.70 0.47
2004-09-01 to 2004-09-30 4994 13 1.54 3.70 0.42
2004-10-01 to 2004-10-31 5583 14 1.78 3.81 0.47
2004-11-01 to 2004-11-30 4390 14 1.74 3.81 0.46
correlation(hits, pages): 0.31172637208172
correlation(hits, entropy): 0.085063761760323
correlation(pages, entropy): 0.14280487789004

The entropy of the page column is quite a bit lower than the entropy of the ip column, which makes sense when you consider that there are many more possible site visitor IPs than site pages (about 15) and that each visitor IP has a relatively low probability of occurrence compared to the probability of occurrence of a site page. Entropy is low when there are only a few possible states the system can be in and when some of the states have high individual probability of occurrence (p(home)).

Is Entropy a Measure of Variance?

Statisticians can make the case that the entropy of a discrete random variable is a measure of the discrete variable's variance:

In short, H is a "variance-like" index that can be computed for any discrete distribution, even though the event classes are purely qualitative ... their main contribution to statistical methods lies, however, in the possibility of extending the familiar notions having to do with variance and factors accounting for variance to qualitative situations. ~ Winkler & Hays (1975) Statistics: Probability, Inference and Decision. Holt, Rinehart and Winston, p. 840.

While it is useful to think of entropy as a "variance-like" index, be careful to note that entropy does differ significantly from the quantitive notion of variance:

Despite the fact that both the entropy and the variance of a probability distribution measure in some sense the uncertainty of the value of a random variable having that distribution, the entropy has an interpretation different from that of the variance. The entropy is defined only by the probabilities of the possible values of the random variable and not by the values themselves. On the other hand, the variance depend on these values. Thus if Y1 takes the values 1 and 2 each with probability 0.5 and Y2 takes the values 1 and 1000 each with probability 0.5, the distribution of Y1 and Y2 have equal entropies but quite different variances. ~ Ewens & Grant (2001) Statistical Methods in Bioinformatics. Springer, p. 44.

With these thoughts in mind, let's examine a report that computes the variance and entropy of the ip column on a per-month basis so that we can compare them. I computed the variance of the ip column by first counting the daily number of hits (count(ip)) and visitors (count(DISTINCT ip)). I computed the monthly variance in hits and visits based on daily hit and visit totals for a particular month. I also fed the same series of IPs into the entropy class in per-month blocks to base variance and entropy scores on the same input data. Actually, instead of using and reporting the raw variance score for hits and visits, I use and report the standard deviation score for hits and visits because it is easier to interpret what the variance estimates mean when reported as standard deviations. Web traffic reports often don't report variance scores, so this script is useful if you want to begin understanding the extent of variability in your web traffic data and to start co-relating these variance scores with other concurrent events that might account for them, such as publication dates of articles.

Monthly Variance and Entropy
Period Hits Visitors Entropy MaxEnt Ratio
2004-01-01 to 2004-01-31 77.20 10.08 7.20 9.52 0.76
2004-02-01 to 2004-02-29 98.30 8.39 6.99 9.30 0.75
2004-03-01 to 2004-03-31 169.29 30.83 8.62 10.33 0.83
2004-04-01 to 2004-04-30 205.14 12.58 6.47 9.92 0.65
2004-05-01 to 2004-05-31 138.74 24.59 8.34 10.32 0.81
2004-06-01 to 2004-06-30 82.71 9.52 7.57 9.71 0.78
2004-07-01 to 2004-07-31 58.47 9.25 8.04 9.71 0.83
2004-08-01 to 2004-08-31 56.95 10.86 8.23 9.77 0.84
2004-09-01 to 2004-09-30 85.40 18.01 7.63 9.85 0.77
2004-10-01 to 2004-10-31 136.34 15.63 8.02 9.94 0.81
2004-11-01 to 2004-11-30 168.13 14.33 7.55 10.07 0.75
correlation(hits, visits): 0.50421542981268
correlation(hits, entropy): -0.18684271641288
correlation(visits, entropy): 0.60620764775012

The correlation between the Visitors column and the Entropy column is approximately 0.6, which is a decent correlation, but not so large so as to tempt me to equate the measurement of IP variance with the measurement of IP entropy. From this result you might (again) conclude that variance and entropy are similiar--but not identical--constructs.

When I created this report, I included the current month in the calculation of the monthly variance in hits and visits as well as the monthly entropy. I noticed that inclusion of current month statistics changed my reported correlations quite a bit. I advise against including the current month statistics because they have less data than other months, and because I haven't fully explored how radically the entropy measure might vary as new data comes into it. I encourage you to explore the small sample behavior of the entropy score further when you have signal probability distributions with shapes ranging from highly skewed to flat (uniform probability density).

Data Preparation

The analyses reported here use data points that I arguably should have eliminated or tagged before conducting our entropic analysis of site traffic patterns. In particular, it would have been desirable to have removed or tagged spider traffic and site administrator traffic before doing these analyses.

We can supply SQL queries to the TableEntropy.php class that filters out records by IP address:

<?php
// snippet showing how to filter spider and administrator traffic
...snip...

$time_constraint = " received >= '$start' AND received <= '$end'";
$ip_constraint   = " ip NOT IN $banned_ips"
$clause          = $time_contraint ." AND ". $ip_constraint;

$e->setWhere($clause);
$e->getEntropy();

...snip...
?>

The IT Package does not address how you compute the list of banned IPs because it involves many context-specific details about how you log and prepare your web access log data for analysis. You should, however, consider adding your own method to the TableEntropy.php class that computes the $banned_ips to use in the $ip_constraint of your WHERE clause (for example, $banned_ips = $e->getBannedIPs()).

Conclusions

This article demonstrated that information entropy is not an esoteric concept, and indeed might play a practical role in summarizing the amount of structure contained in database columns.

Our exploration of entropy began at the most basic level: learning how to compute the entropy score for one discrete random variable. Having mastered univariate entropy, we can move on to bivariate entropy. Bivariate entropy involves computing the joint and conditional entropy scores between two random variables, such as the ip and self columns in the Webstats table.

Like all entropic analysis, one of the main goals in computing bivariate entropy scores will be to determine the possibility of reducing uncertainty. Instead of looking within the sequence of database column values for ways of reducing entropy (computing bigram and trigram distributions for clickstream sequences), we will instead be looking across our columns to see how we might use one data column to reduce our uncertainty about another data column and vice versa.

Resources

Paul Meagher is a cognitive scientist whose graduate studies focused on mathematical problem solving.


Return to the PHP DevCenter.

Copyright © 2009 O'Reilly Media, Inc.