ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


ANOVA Statistical Programming with PHP

by Paul Meagher
07/22/2004

The Analysis of Variance (ANOVA) technique is the most popular statistical technique in behavioral research. The ANOVA technique also comes up often in agricultural, pharmaceutical, and quality control contexts. This article will introduce you to six major steps involved in using this technique by implementing them with a combination of PHP, MySQL, and JpGraph. The result is code and knowledge that you can use to think about and potentially solve your own data-mining problems.

The scope of the ANOVA technique can narrow to include only the formal mathematics required to partition the total variance of a data matrix into between-group and within-group variance estimates. Within this narrow construal, one might also include the machinery to test whether the ratio of the between to within-group variance is significant (i.e., whether there is a treatment effect).

In this article, we will construe the ANOVA technique more broadly to consist of multiple data-analysis steps to take when conducting an ANOVA analysis. The ANOVA technique here is a methodical approach to analyzing data that issues from a particular type of data-generating process. The data-generating process will ideally arise from a blocked and randomized experimental design.

Test Anxiety Study

Related Reading

Learning PHP 5
By David Sklar

The prototypical Single Factor ANOVA experiment is a simple comparative experiment. These are experiments that involve applying a treatment to homogeneous experimental units and measuring the responses that occur when administering different levels of the treatment to these experimental units.

The hypothetical study we will discuss in this article examines the effect of anxiety (i.e., the treatment) on test performance (i.e., the response). The study randomly assigned 30 subjects to a low-anxiety, moderate-anxiety, or high-anxiety treatment condition. The experimenter recorded a test score measurement for each subject. The empirical issue of concern is whether there is an effect of Anxiety Level on Test Score.

The idea for this hypothetical study and the data to analyze originally appeared in the popular textbook by Gene V. Glass & Kenneth D. Hopkins (1995) Statistical Methods in Education and Psychology. The reported results agree with their results. You will also find it useful to consult this textbook for its excellent and comprehensive treatment of the ANOVA technique.

Analysis Source Code

You can use the single factor ANOVA technique to determine whether anxiety significantly influences test scores. The following PHP script implements six major steps in the single-factor ANOVA technique used to analyze data from our hypothetical test anxiety study. After you have examined the overall flow of the script, proceed to the rest of the article where we examine the tabular and graphical output that each step in this script generates.

<?php
/**
* @package SFA
*
* Script performs single factor ANOVA analysis on test anxiety
* data stored in a database.
*
* @author Paul Meagher, Datavore Productions
* @license PHP v3.0
* @version 0.7
*
* The config.php file defines paths to the root of the PHPMATH
* and JPGRAPH libraries and sets up a global database connection.
*/

require_once "config.php";
require_once PHPMATH ."/SFA/SingleFactorANOVA.php";

$sfa = new SingleFactorANOVA;

// Step 1: Specify and analyze data
$sfa->setTable("TestScores");
$sfa->setTreatment("anxiety");
$sfa->setResponse("score");
$sfa->analyze();

// Step 2: Show raw data
$sfa->showRawData();

// Step 3: Show box plot
$params["figureTitle"] = "Anxiety Study";
$params["xTitle"]      = "Anxiety Level";
$params["yTitle"]      = "Test Score";
$params["yMin"]        = 0;
$params["yMax"]        = 100;
$params["yTicks"]      = 10;

$sfa->showBoxPlot($params);

// Step 4: Show descriptive statistics
$sfa->showDescriptiveStatistics();

// Step 5: Show single factor ANOVA source table
$sfa->showSourceTable();

// Step 6: Show mean differences.
$sfa->showMeanDifferences();

?>

Step 1: Specify and Analyze Data

After we instantiate the SingleFactorAnova class, we start by specifying 1) what data table to use, 2) what table field name to use as the treatment column, and 3) what table field name to use as the response column:

<?php
// Step 1: Specify and analyze data
$sfa->setTable("TestScores");
$sfa->setTreatment("anxiety");
$sfa->setResponse("score");
$sfa->analyze();
?>

The culmination of the first step is the invocation of the $this->analyze() method. This method is the centerpiece of this SingleFactorANOVA class and is reproduced below. Note that I am using the PEAR:DB API to interact with a MySQL database.

<?php
/**
* Compute single factor ANOVA statistics.
*/
function analyze() {
  global $db;
  $sql  = " SELECT $this->treatment, sum($this->response), ";
  $sql .= " sum($this->response * $this->response), "
  $sql .= " count($this->response) ";
  $sql .= " FROM $this->table ";
  $sql .= " GROUP BY $this->treatment ";

  $result = $db->query($sql);

  if (DB::isError($result)) {
    die($result->getMessage());
  } else {
    while ($row = $result->fetchRow()) {
      $level                  = $row[0];

      $this->levels[]         = $row[0];
      $this->sums[$level]     = $row[1];
      $this->n[$level]        = $row[3];        

      $this->means[$level]    = 
          $this->sums[$level] / $this->n[$level];
      $this->ss[$level]       = 
          $row[2] - $this->n[$level] * pow($this->means[$level], 2);
      $this->variance[$level] = 
          $this->ss[$level] / ($this->n[$level] - 1);         
    }    

    $this->sums["total"]  = array_sum($this->sums);
    $this->n["total"]     = array_sum($this->n);
    $this->means["grand"] = $this->sums["total"] / $this->n["total"];
    $this->ss["within"]   = array_sum($this->ss);   

    foreach($this->levels as $level) {
      $this->effects[$level] = 
          $this->means[$level] - $this->means["grand"];
      $this->ss["between"]  += 
          $this->n[$level] * pow($this->effects[$level], 2);
    }

    $this->num_levels = count($this->levels);

    $this->df["between"] = $this->num_levels - 1;
    $this->df["within"]  = $this->n["total"] - $this->num_levels;  
    $this->ms["between"] = $this->ss["between"] / $this->df["between"];
    $this->ms["within"]  = $this->ss["within"] / $this->df["within"];

    $this->f    = $this->ms["between"] / $this->ms["within"];
    $F          = new FDistribution($this->df["between"], 
                                    $this->df["within"]);
    $this->p    = 1 - $F->CDF($this->f);
    $this->crit = $F->inverseCDF(1 - $this->alpha);
  }            
}
?>

We could have passed a data matrix into the analysis method. Instead I assume that the data resides in a database and use SQL to extract and sort the records that feed into the subsequent analysis code. I made this storage assumption because of my interest in developing a scalable data-analysis solution.

The bulk of the code involves calculating the value of various instance variables to use in subsequent reporting steps. Most of these instance variables are associative arrays with indices such as total, between, and within. This is because the ANOVA procedure involves computing the total variance (in our test scores) and partitioning it into between-group (i.e., between treatment levels) and within-group (i.e., within a treatment level) variance estimates.

At the end of the analyze method we evaluate the probability of the observed F score by first instantiating an FDistribution class with our degrees of freedom parameters:

$F = new FDistribution($this->df["between"], $this->df["within"]);

To obtain the probability of the obtained F score we subtract 1 minus the value returned by the cumulative distribution function applied to the obtained F score:

$this->p = 1 - $F->CDF($this->f);

Finally, we invoke the inverse cumulative distribution function using 1 minus our alpha setting (i.e., 1 - 0.05) in order set a critical F value that defines the decision criterion we will use to reject the null hypothesis which states that there is no difference between treatment-level means.

$this-crit = $F->inverseCDF(1 - $this->alpha);

If our observed F score is visibly greater than the critical F score, we can conclude that at least one of the means differs significantly from the others. A p value (i.e., $this->p) value less than 0.05 (or whatever your null rejection setting is) would also lead you to reject the null hypothesis.

The formula for decomposing the total sum of squares (first term) into a between-groups component (second term) and a within-group component (third term) appears in Figure 1.

\sum_{t=1}^{k}\sum_{i=1}^{n_t}(y_{ti} - \bar{y})^2 = \sum_{t=1}^{k}n_t(\bar{y}_t - \bar{y})^2 + \sum_{t=1}^{k}\sum_{i=1}^{n_t}(y_{ti} - \bar{y}_t)^2
Figure 1. Formula for decomposing the sum of squares.

The symbol y overbar stands for the grand mean and the symbol yt overbar stands for the treatment mean.

Step 2: Show Raw Data

It is always good to begin your analysis by making sure that you've properly loaded your data. We can call the showRawData() method to dump our test-anxiety data table to a web browser.

<?php
/*
* Output contents of database table.
*/  
function showRawData() {
  global $db;
  $data        = $db->tableInfo($this->table, DB_TABLEINFO_ORDER);
  $columns     = array_keys($data["order"]);
  $num_columns = count($columns);

  ?>

  <table cellspacing='0' cellpadding='0'>
    <tr>
      <td>
        <table border='1' cellspacing='0' cellpadding='3'>
        <?php
          print "<tr bgcolor='ffffcc'>";

          for ($i=0; $i < $num_columns; $i++) {
            print "<td align='center'><b>".$columns[$i]."</b></td>";
          }

          print "</tr>";

          $fields = implode(",", $columns); 
          $sql    = " SELECT $fields FROM $this->table ";
          $result = $db->query($sql);

          if (DB::isError($result)) {
            die( $result->getMessage());
          } else {
            while ($row = $result->fetchRow()) { 
              print "<tr>";

              foreach( $row as $key=>$value) {
                print "<td>$value</td>";
              }

              print "</tr>";
            }
          }
          ?>
        </table>
      </td>
    </tr>
  </table>
  <?php
}  
?>

This code generates as output the table below:

Table 1. Show Raw Data

idanxietyscore
1low26
2low34
3low46
4low48
5low42
6low49
7low74
8low61
9low51
10low53
11moderate51
12moderate50
13moderate33
14moderate28
15moderate47
16moderate50
17moderate48
18moderate60
19moderate71
20moderate42
21high52
22high64
23high39
24high54
25high58
26high53
27high77
28high56
29high63
30high59

A tip for data miners: Maybe you already have some data in your databases to which you can adapt this code. Look for situations where you have an enum data type to act as your treatment-level field and a corresponding integer or float column that measures some response associated with that treatment-level.

Step 3: Show Box Plot

Early in your analysis you should graph your data so that you are being guided by proper overall intuitions about your data. A commonly recommended way to visualize ANOVA data is to use side-by-side box plots for each treatment level. To generate these box plots we need to compute a five-number summary for each treatment level, consisting of the minimum value, first quartile, median, third quartile, and maximum value. The showBoxPlot($params) method computes these summary values and uses them to generate the treatment-level box plots.

Test Score by Anxiety Level
Figure 2 — box plot of test anxiety data

The showBoxPlot($params) method is actually a wrapper around JpGraph library methods. The $params argument allows us to supply parameter values needed to fine tune JpGraph's output as you can see if you examine the method source code:

<?php
/**
* The showBoxPlot method is a wrapper for JPGRAPH methods.  
* The JpGraph constant defining to the root of the JpGraph
* library is specified in the config.php file.  
*
* I only do very basic $params array handling in this method.  There
* is a lot of room for improvement in making parameter handling more
* extensive (i.e., so you have more control over aspects of your plot)
* and more intelligent.
*/
function showBoxPlot($params=array()) {
  include_once JPGRAPH . "/src/JpGraph.php";
  include_once JPGRAPH . "/src/jpgraph_stock.php";

  $summary = $this->getFiveNumberSummary();
  $yData   = array();

  foreach($this->levels AS $level) {
    // Data must be in the format : q1, q3, min, max, median.
    $plot_parts = array("q1","q3","min","max","median");

    foreach($plot_parts AS $part) {
      $yData[] = $summary[$level][$part];
    }
  }

  $figureTitle = $params["figureTitle"];
  $xTitle      = $params["xTitle"];
  $yTitle      = $params["yTitle"];

  if(!isset($figureTitle))
    $figureTitle = "Figure 1";

  if(!isset($xTitle))
    $xTitle = "x-level";    

  if(!isset($yTitle)) 
    $yTitle = "y-level";        

  $plotWidth  = 400;
  $plotHeight = 300;
  $yMin       = $params["yMin"];
  $yMax       = $params["yMax"];
  $yTicks     = $params["yTicks"];
  $yMargin    = 35;
  $xMargin    = 15;
  $xLabels    = $this->levels;

  $graph = new Graph($plotWidth, $plotHeight);

  $graph->SetFrame(false);
  $graph->SetMarginColor('white');
  $graph->title->Set($figureTitle);
  $graph->SetScale("textlin", $yMin, $yMax);
  $graph->yscale->ticks->Set($yTicks);
  $graph->xaxis->SetTickLabels($xLabels);
  $graph->xaxis->SetTitle($xTitle,"center");
  $graph->xaxis->SetTitleMargin($xMargin);
  $graph->yaxis->SetTitle($yTitle, "center");
  $graph->yaxis->SetTitleMargin($yMargin);

  // Create a new box plot.
  $bp = new BoxPlot($yData);

  // Indent bars so they donít start and end at the edge of the plot area.
  $bp->SetCenter();

  // Width of the bars in pixels.
  $bp->SetWidth(25);

  // Set bar colors
  $bp->SetColor('black','white','black','white');

  // Set median colors
  $bp->SetMedianColor('black','black');
  $graph->Add($bp);

  if (isset($params["outputFile"])) {
    $outputFile = $params["outputFile"];
    $graph_name = "temp/$outputfile";      
  } else {
    $graph_name = "temp/boxplot.png";
  }

  $graph->Stroke($graph_name);

  echo "<img src='$graph_name' vspace='15' alt='$figureTitle'>";      
}
?>

Step 4: Show Descriptive Statistics

Another report that we will want to see early in our analysis is a descriptive statistics report. The descriptive statistics table comes directly from the showDescriptiveStatistics() method.

Table 2. Show Descriptive Statistics

Anxiety Levels
lowmoderatehigh
mean48.4048.0057.50
stdev12.6411.639.29
n101010

Step 5: Show the Single-factor ANOVA Source Table

If we have carefully studied our box plots and descriptive statistics then the results of our formal analysis of whether a significant mean difference exists should come as no surprise. Invoking the showSourceTable() method generated the ANOVA source table below. It reports the amount of variance attributable to the effect of our treatment (see "Between" row) versus the amount of variance to chalk up to experimental error (see "Within" row).

Table 3. Show Source Table

ANOVA Source Table
SourceSSdfMSFp
Between577.402288.702.040.15
Within3812.9027141.22  
Critical F(0.05, 2, 27) is 3.35.

The obtained F value comes from dividing the mean square error attributable to the treatment ($ms["between"]) by the mean square error attributable to experimental error ($ms["within"]). If this ratio is sufficiently large then we can reject the null hypothesis that there is no treatment effect (i.e., H0: u1 = u2 = u3). In the example above, the probability p of the observed F value is 0.15 — higher than the conventional 0.05 cutoff for declaring statistical significant. Our critical Fcrit = 3.35 is also larger than the obtained F. Both of these facts tell us that we cannot reject the null hypothesis. This could be because there is in fact no effect of anxiety on test scores. A null finding could occur if we had a poor experimental design that had so much experimental error that it washed out our treatment effects.

Perhaps we need to use a repeated-measures design instead of an independent-samples design to try to remove some individual differences in responding to anxiety.

Step 6: Show Mean Differences

If our F test tells us that a significant treatment effect exists, then we would begin performing multiple comparisons among the treatment-level means to isolate the specific, significant-mean differences. Because our obtained F was not significant, there is no need to proceed to the multiple comparison stage. It is nevertheless worthwhile examining the size of our effects by calling the showMeanDifferences() method. This report arranges treatment-level means from lowest to highest and labels the rows and columns accordingly.

Table 4. Show Mean Differences

Mean Differences
 Moderate [48.00]Low [48.40]High [57.50]
Moderate 0.49.5
Low  9.1
High   

Many people engage in data mining without much theoretical motivation for making particular treatment comparisons. In such cases, I recommend obtaining a significant F before performing post-hoc comparisons. This approach helps keep the Type-I error rate low. When there is a theoretical motivation for believing that a significant difference exists between particular treatment means, then we can bypass the F test and immediately engage in a priori (or planned) approach, such as Dunn or Planned Orthogonal Contrasts methods. These methods can yield significant results (i.e., comparing the high-anxiety group to the combined low- and medium-anxiety groups) even when our F ratio is not significant. In general, there is no harm done in always checking whether the F test is significant before engaging in post-hoc or a priori multiple comparisons.

Concluding Remarks

In this article, we interpreted the single-factor ANOVA procedure broadly to consist of several steps to undertake when conducting a single-factor ANOVA analysis. We illustrated these steps using a hypothetical test-anxiety study in which we carried out six steps of the procedure in a recommended order — the order commonly prescribed in undergraduate statistics textbooks. We have not exhausted all the required steps in this article. Because our treatment effect was not significant, we did not proceed to further stages of the analysis procedure.

Had our result come out significant, we would have engaged in the multiple-comparison step where we statistically analyze (using multiple T tests) the significant particular-mean differences and contrasts. We would also have run several diagnostic tests to determine whether the data met the assumptions of our statistical tests. Finally, we would have begun to construct a theoretical model of how our treatment levels might exert a causal influence on our response variable. The work reported in this article starts us toward a full-featured single factor ANOVA package but there is more implementation work to do.

In addition to teaching the basics of the ANOVA procedure, another purpose of this article was to demonstrate that PHP is viable platform for web-based statistical computing, especially when combined with MySQL and JpGraph. The code distribution for this article contains a benchmark.php script that you can use to verify that the critical analyze() method is very fast — easily crunching 100,000 records in under a half second (0.448 seconds) on a very modest hardware (1000 MHz processor with 256 RAM). A recent American Statistician article repeated the standard advice that new statistics graduates should be proficient in C-based languages, Fortran, and Java. You might add PHP to this list, especially if you intend to work in a Web medium and see a need for online, database-driven analysis packages.

Resources

Acknowledgements

Paul Meagher is a cognitive scientist whose graduate studies focused on mathematical problem solving.


Return to the PHP DevCenter.

Copyright © 2009 O'Reilly Media, Inc.