ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: MySQL Cookbook

Cooking with MySQL

Related Reading

MySQL Cookbook
By Paul DuBois

by Paul DuBois

Editor's note: Paul DuBois has selected sample recipes from the hundreds you'll find in his book, MySQL Cookbook. In this last excerpt in a three-part series showcasing these recipes, learn how to compute team standings, how to calculate the differences between successive rows, and how to find cumulative sums and running averages. (All excerpts are from Chapter 12, "Using Multiple Tables.")

Computing Team Standings

Problem

You want to compute team standings from their win-loss records, including the games-behind (GB) values.

Solution

Determine which team is in first place, then join that result to the original records.

Discussion

Standings for sports teams that compete against each other typically are ranked according to who has the best win-loss record, and the teams not in first place are assigned a "games-behind" value indicating how many games out of first place they are. This section shows how to calculate those values. The first example uses a table containing a single set of team records, to illustrate the logic of the calculations. The second example uses a table containing several sets of records; in this case, it's necessary to use a join to perform the calculations independently for each group of teams.

Consider the following table, standings1, which contains a single set of baseball team records (they represent the final standings for the Northern League in the year 1902):

mysql> SELECT team, wins, losses FROM standings1
    -> ORDER BY wins-losses DESC;
+-------------+------+--------+
| team        | wins | losses |
+-------------+------+--------+
| Winnipeg    |   37 |     20 |
| Crookston   |   31 |     25 |
| Fargo       |   30 |     26 |
| Grand Forks |   28 |     26 |
| Devils Lake |   19 |     31 |
| Cavalier    |   15 |     32 |
+-------------+------+--------+

In This Series

Cooking with MySQL
Paul DuBois has selected sample recipes from the hundreds you'll find in his book, MySQL Cookbook. In this second article in a three-part series showcasing these recipes, find out how to manage simultaneous AUTO_INCREMENT values, and how to use AUTO_INCREMENT values and related tables.

Cooking with MySQL
Paul DuBois has selected sample recipes from the hundreds you'll find in his book, MySQL Cookbook. In this first in a three-part series showcasing these sample recipes, find out how to interpret results from summaries and NULL values and how to identify duplicates in a table or record.

The records are sorted by the win-loss differential, which is how to place teams in order from first place to last place. But displays of team standings typically include each team's winning percentage and a figure indicating how many games behind the leader all the other teams are. So let's add that information to the output. Calculating the percentage is easy. It's the ratio of wins to total games played and can be determined using this expression:

wins / (wins + losses)

If you want to perform standings calculations under conditions when a team may not have played any games yet, that expression evaluates to NULL because it involves a division by zero. For simplicity, I'll assume a nonzero number of games, but if you want to handle this condition by mapping NULL to zero, generalize the expression as follows:

IFNULL(wins / (wins + losses),0)

or as:

wins / IF(wins=0,1,wins + losses)

Determining the games-behind value is a little trickier. It's based on the relationship of the win-loss records for two teams, calculated as the average of two values:

For example, suppose two teams A and B have the following win-loss records:

+------+------+--------+
| team | wins | losses |
+------+------+--------+
| A    |   17 |     11 |
| B    |   14 |     12 |
+------+------+--------+

Here, team B has to win three more games and team A has to lose one more game for the teams to be even. The average of three and one is two, thus B is two games behind A. Mathematically, the games-behind calculation for the two teams can be expressed like this:

((winsA - winsB) + (lossesB - lossesA)) / 2

With a little rearrangement of terms, the expression becomes:

((winsA - lossesA) - (winsB - lossesB)) / 2

The second expression is equivalent to the first, but it has each factor written as a single team's win-loss differential, rather than as a comparison between teams. That makes it easier to work with, because each factor can be determined independently from a single team record. The first factor represents the first place team's win-loss differential, so if we calculate that value first, all the other team GB values can be determined in relation to it.

The first place team is the one with the largest win-loss differential. To find that value and save it in a variable, use this query:

mysql> SELECT @wl_diff := MAX(wins-losses) FROM standings1;
+------------------------------+
| @wl_diff := MAX(wins-losses) |
+------------------------------+
|                           17 |
+------------------------------+

Then use the differential as follows to produce team standings that include winning percentage and GB values:

mysql> SELECT team, wins AS W, losses AS L,
    -> wins/(wins+losses) AS PCT,
    -> (@wl_diff - (wins-losses)) / 2 AS GB
    -> FROM standings1
    -> ORDER BY wins-losses DESC, PCT DESC;
+-------------+------+------+------+------+
| team        | W    | L    | PCT  | GB   |
+-------------+------+------+------+------+
| Winnipeg    |   37 |   20 | 0.65 |    0 |
| Crookston   |   31 |   25 | 0.55 |  5.5 |
| Fargo       |   30 |   26 | 0.54 |  6.5 |
| Grand Forks |   28 |   26 | 0.52 |  7.5 |
| Devils Lake |   19 |   31 | 0.38 | 14.5 |
| Cavalier    |   15 |   32 | 0.32 |   17 |
+-------------+------+------+------+------+

There are a couple of minor formatting issues that can be addressed at this point. Percentages in standings generally are displayed to three decimals, and the GB value for the first place team is displayed as - rather than as 0. To display three decimals, TRUNCATE(expr,3) can be used. To display the GB value for the first place team appropriately, put the expression that calculates the GB column within a call to IF( ) that maps 0 to a dash:

mysql> SELECT team, wins AS W, losses AS L,
    -> TRUNCATE(wins/(wins+losses),3) AS PCT,
    -> IF((@wl_diff - (wins-losses)) = 0,'-',(@wl_diff - (wins-losses))/2) AS GB
    -> FROM standings1
    -> ORDER BY wins-losses DESC, PCT DESC;
+-------------+------+------+-------+------+
| team        | W    | L    | PCT   | GB   |
+-------------+------+------+-------+------+
| Winnipeg    |   37 |   20 | 0.649 | -    |
| Crookston   |   31 |   25 | 0.553 | 5.5  |
| Fargo       |   30 |   26 | 0.535 | 6.5  |
| Grand Forks |   28 |   26 | 0.518 | 7.5  |
| Devils Lake |   19 |   31 | 0.380 | 14.5 |
| Cavalier    |   15 |   32 | 0.319 | 17   |
+-------------+------+------+-------+------+

These queries order the teams by win-loss differential, using winning percentage as a tie-breaker in case there are teams with the same differential value. It would be simpler just to sort by percentage, of course, but then you wouldn't always get the correct ordering. It's a curious fact that a team with a lower winning percentage can actually be higher in the standings than a team with a higher percentage. (This generally occurs early in the season, when teams may have played highly disparate numbers of games, relatively speaking.) Consider the case where two teams A and B have the following records:

+------+------+--------+
| team | wins | losses |
+------+------+--------+
| A    |    4 |      1 |
| B    |    2 |      0 |
+------+------+--------+

Applying the GB and percentage calculations to these team records yields the following result, where the first place team actually has a lower winning percentage than the second place team:

+------+------+------+-------+------+
| team | W    | L    | PCT   | GB   |
+------+------+------+-------+------+
| A    |    4 |    1 | 0.800 | -    |
| B    |    2 |    0 | 1.000 | 0.5  |
+------+------+------+-------+------+

The standings calculations shown thus far can be done without a join. They involve only a single set of team records, so the first place team's win-loss differential can be stored in a variable. A more complex situation occurs when a dataset includes several sets of team records. For example, the 1997 Northern League had two divisions (Eastern and Western). In addition, separate standings were maintained for the first and second halves of the season, because season-half winners in each division played each other for the right to compete in the league championship. The following table, standings2, shows what these records look like, ordered by season half, division, and win-loss differential:

mysql> SELECT half, div, team, wins, losses FROM standings2
    -> ORDER BY half, div, wins-losses DESC;
+------+---------+-----------------+------+--------+
| half | div     | team            | wins | losses |
+------+---------+-----------------+------+--------+
|    1 | Eastern | St. Paul        |   24 |     18 |
|    1 | Eastern | Thunder Bay     |   18 |     24 |
|    1 | Eastern | Duluth-Superior |   17 |     24 |
|    1 | Eastern | Madison         |   15 |     27 |
|    1 | Western | Winnipeg        |   29 |     12 |
|    1 | Western | Sioux City      |   28 |     14 |
|    1 | Western | Fargo-Moorhead  |   21 |     21 |
|    1 | Western | Sioux Falls     |   15 |     27 |
|    2 | Eastern | Duluth-Superior |   22 |     20 |
|    2 | Eastern | St. Paul        |   21 |     21 |
|    2 | Eastern | Madison         |   19 |     23 |
|    2 | Eastern | Thunder Bay     |   18 |     24 |
|    2 | Western | Fargo-Moorhead  |   26 |     16 |
|    2 | Western | Winnipeg        |   24 |     18 |
|    2 | Western | Sioux City      |   22 |     20 |
|    2 | Western | Sioux Falls     |   16 |     26 |
+------+---------+-----------------+------+--------+

Generating the standings for these records requires computing the GB values separately for each of the four combinations of season half and division. Begin by calculating the win-loss differential for the first place team in each group and saving the values into a separate firstplace table:

mysql> CREATE TABLE firstplace
    -> SELECT half, div, MAX(wins-losses) AS wl_diff
    -> FROM standings2
    -> GROUP BY half, div;

Then join the firstplace table to the original standings, associating each team record with the proper win-loss differential to compute its GB value:

mysql> SELECT wl.half, wl.div, wl.team, wl.wins AS W, wl.losses AS L,
    -> TRUNCATE(wl.wins/(wl.wins+wl.losses),3) AS PCT,
    -> IF((fp.wl_diff - (wl.wins-wl.losses)) = 0,
    -> '-', (fp.wl_diff - (wl.wins-wl.losses)) / 2) AS GB
    -> FROM standings2 AS wl, firstplace AS fp
    -> WHERE wl.half = fp.half AND wl.div = fp.div
    -> ORDER BY wl.half, wl.div, wl.wins-wl.losses DESC, PCT DESC;
+------+---------+-----------------+------+------+-------+-------+
| half | div     | team            | W    | L    | PCT   | GB    |
+------+---------+-----------------+------+------+-------+-------+
|    1 | Eastern | St. Paul        |   24 |   18 | 0.571 | -     |
|    1 | Eastern | Thunder Bay     |   18 |   24 | 0.428 | 6.00  |
|    1 | Eastern | Duluth-Superior |   17 |   24 | 0.414 | 6.50  |
|    1 | Eastern | Madison         |   15 |   27 | 0.357 | 9.00  |
|    1 | Western | Winnipeg        |   29 |   12 | 0.707 | -     |
|    1 | Western | Sioux City      |   28 |   14 | 0.666 | 1.50  |
|    1 | Western | Fargo-Moorhead  |   21 |   21 | 0.500 | 8.50  |
|    1 | Western | Sioux Falls     |   15 |   27 | 0.357 | 14.50 |
|    2 | Eastern | Duluth-Superior |   22 |   20 | 0.523 | -     |
|    2 | Eastern | St. Paul        |   21 |   21 | 0.500 | 1.00  |
|    2 | Eastern | Madison         |   19 |   23 | 0.452 | 3.00  |
|    2 | Eastern | Thunder Bay     |   18 |   24 | 0.428 | 4.00  |
|    2 | Western | Fargo-Moorhead  |   26 |   16 | 0.619 | -     |
|    2 | Western | Winnipeg        |   24 |   18 | 0.571 | 2.00  |
|    2 | Western | Sioux City      |   22 |   20 | 0.523 | 4.00  |
|    2 | Western | Sioux Falls     |   16 |   26 | 0.380 | 10.00 |
+------+---------+-----------------+------+------+-------+-------+

That output is somewhat difficult to read, however. To make it easier to understand, you'd likely execute the query from within a program and reformat its results to display each set of team records separately. Here's some Perl code that does that by beginning a new output group each time it encounters a new group of standings. The code assumes that the join query has just been executed and that its results are available through the statement handle $sth:

my ($cur_half, $cur_div) = ("", "");
while (my ($half, $div, $team, $wins, $losses, $pct, $gb)
            = $sth->fetchrow_array ( ))
{
    if ($cur_half ne $half || $cur_div ne $div) # new group of standings?
    {
        # print standings header and remember new half/division values
        print "\n$div Division, season half $half\n";
        printf "%-20s  %3s  %3s  %5s  %s\n", "Team", "W", "L", "PCT", "GB";
        $cur_half = $half;
        $cur_div = $div;
    }
    printf "%-20s  %3d  %3d  %5s  %s\n", $team, $wins, $losses, $pct, $gb;
}

The reformatted output looks like this:

Eastern Division, season half 1
Team                    W    L   PCT  GB
St. Paul               24   18  0.57  -
Thunder Bay            18   24  0.43  6.00
Duluth-Superior        17   24  0.41  6.50
Madison                15   27  0.36  9.00

Western Division, season half 1
Team                    W    L   PCT  GB
Winnipeg               29   12  0.71  -
Sioux City             28   14  0.67  1.50
Fargo-Moorhead         21   21  0.50  8.50
Sioux Falls            15   27  0.36  14.50

Eastern Division, season half 2
Team                    W    L   PCT  GB
Duluth-Superior        22   20  0.52  -
St. Paul               21   21  0.50  1.00
Madison                19   23  0.45  3.00
Thunder Bay            18   24  0.43  4.00

Western Division, season half 2
Team                    W    L   PCT  GB
Fargo-Moorhead         26   16  0.62  -
Winnipeg               24   18  0.57  2.00
Sioux City             22   20  0.52  4.00
Sioux Falls            16   26  0.38  10.00

The code just shown that produces plain text output comes from the script calc_standings.pl in the joins directory of the recipes distribution. That directory also contains a PHP script, calc_standings.php, that takes the alternative approach of producing output in the form of HTML tables, which you might prefer for generating standings in a web environment.

Calculating Differences Between Successive Rows

Problem

You have a table containing successive cumulative values in its rows and you want to compute the differences between pairs of successive rows.

Solution

Use a self-join that matches up pairs of adjacent rows and calculates the differences between members of each pair.

Discussion

Self-joins are useful when you have a set of absolute (or cumulative) values that you want to convert to relative values representing the differences between successive pairs of rows. For example, if you take an automobile trip and write down the total miles traveled at each stopping point, you can compute the difference between successive points to determine the distance from one stop to the next. Here is such a table that shows the stops for a trip from San Antonio, Texas to Madison, Wisconsin. Each row shows the total miles driven as of each stop:

mysql> SELECT seq, city, miles FROM trip_log ORDER BY seq;
+-----+------------------+-------+
| seq | city             | miles |
+-----+------------------+-------+
|   1 | San Antonio, TX  |     0 |
|   2 | Dallas, TX       |   263 |
|   3 | Benton, AR       |   566 |
|   4 | Memphis, TN      |   745 |
|   5 | Portageville, MO |   878 |
|   6 | Champaign, IL    |  1164 |
|   7 | Madison, WI      |  1412 |
+-----+------------------+-------+

A self-join can convert these cumulative values to successive differences that represent the distances from each city to the next. The following query shows how to use the sequence numbers in the records to match up pairs of successive rows and compute the differences between each pair of mileage values:

mysql> SELECT t1.seq AS seq1,  t2.seq AS seq2,
    -> t1.city AS city1, t2.city AS city2,
    -> t1.miles AS miles1, t2.miles AS miles2,
    -> t2.miles-t1.miles AS dist
    -> FROM trip_log AS t1, trip_log AS t2
    -> WHERE t1.seq+1 = t2.seq
    -> ORDER BY t1.seq;
+------+------+------------------+------------------+--------+--------+------+
| seq1 | seq2 | city1            | city2            | miles1 | miles2 | dist |
+------+------+------------------+------------------+--------+--------+------+
|    1 |    2 | San Antonio, TX  | Dallas, TX       |      0 |    263 |  263 |
|    2 |    3 | Dallas, TX       | Benton, AR       |    263 |    566 |  303 |
|    3 |    4 | Benton, AR       | Memphis, TN      |    566 |    745 |  179 |
|    4 |    5 | Memphis, TN      | Portageville, MO |    745 |    878 |  133 |
|    5 |    6 | Portageville, MO | Champaign, IL    |    878 |   1164 |  286 |
|    6 |    7 | Champaign, IL    | Madison, WI      |   1164 |   1412 |  248 |
+------+------+------------------+------------------+--------+--------+------+

The presence of the seq column in the trip_log table is important for calculating successive difference values. It's needed for establishing which row precedes another and matching each row n with row n+1. The implication is that a table should include a sequence column that has no gaps if you want to perform relative-difference calculations from absolute or cumulative values. If the table contains a sequence column but there are gaps, renumber it. If the table contains no such column, add one. and describe how to perform these operations.

A somewhat more complex situation occurs when you compute successive differences for more than one column and use the results in a calculation. The following table, player_stats, shows some cumulative numbers for a baseball player at the end of each month of his season. ab indicates the total at-bats and h the total hits the player has had as of a given date. (The first record indicates the starting point of the player's season, which is why the ab and h values are zero.)

mysql> SELECT id, date, ab, h, TRUNCATE(IFNULL(h/ab,0),3) AS ba
    -> FROM player_stats ORDER BY id;
+----+------------+-----+----+-------+
| id | date       | ab  | h  | ba    |
+----+------------+-----+----+-------+
|  1 | 2001-04-30 |   0 |  0 | 0.000 |
|  2 | 2001-05-31 |  38 | 13 | 0.342 |
|  3 | 2001-06-30 | 109 | 31 | 0.284 |
|  4 | 2001-07-31 | 196 | 49 | 0.250 |
|  5 | 2001-08-31 | 304 | 98 | 0.322 |
+----+------------+-----+----+-------+

The last column of the query result also shows the player's batting average as of each date. This column is not stored in the table, but is easily computed as the ratio of hits to at-bats. The result provides a general idea of how the player's hitting performance changed over the course of the season, but it doesn't give a very informative picture of how the player did during each individual month. To determine that, it's necessary to calculate relative differences between pairs of rows. This is easily done with a self-join that matches each row n with row n+1, to calculate differences between pairs of at-bats and hits values. These differences allow batting average during each month to be computed:

mysql> SELECT
    -> t1.id AS id1, t2.id AS id2,
    -> t2.date,
    -> t1.ab AS ab1, t2.ab AS ab2,
    -> t1.h AS h1, t2.h AS h2,
    -> t2.ab-t1.ab AS abdiff,
    -> t2.h-t1.h AS hdiff,
    -> TRUNCATE(IFNULL((t2.h-t1.h)/(t2.ab-t1.ab),0),3) AS ba
    -> FROM player_stats AS t1, player_stats AS t2
    -> WHERE t1.id+1 = t2.id
    -> ORDER BY t1.id;
+-----+-----+------------+-----+-----+----+----+--------+-------+-------+
| id1 | id2 | date       | ab1 | ab2 | h1 | h2 | abdiff | hdiff | ba    |
+-----+-----+------------+-----+-----+----+----+--------+-------+-------+
|   1 |   2 | 2001-05-31 |   0 |  38 |  0 | 13 |     38 |    13 | 0.342 |
|   2 |   3 | 2001-06-30 |  38 | 109 | 13 | 31 |     71 |    18 | 0.253 |
|   3 |   4 | 2001-07-31 | 109 | 196 | 31 | 49 |     87 |    18 | 0.206 |
|   4 |   5 | 2001-08-31 | 196 | 304 | 49 | 98 |    108 |    49 | 0.453 |
+-----+-----+------------+-----+-----+----+----+--------+-------+-------+

These results show much more clearly than the original table does that the player started off well, but had a slump in the middle of the season, particularly in July. They also indicate just how strong his performance was in August.

Finding Cumulative Sums and Running Averages

Problem

You have a set of observations measured over time and want to compute the cumulative sum of the observations at each measurement point. Or you want to compute a running average at each point.

Solution

Use a self-join to produce the sets of successive observations at each measurement point, then apply aggregate functions to each set of values to compute its sum or average.

Discussion

"Calculating Differences Between Successive Rows" illustrates how a self-join can produce relative values from absolute values. A self-join can do the opposite as well, producing cumulative values at each successive stage of a set of observations. The following table shows a set of rainfall measurements taken over a series of days. The values in each row show the observation date and the amount of precipitation in inches:

mysql> SELECT date, precip FROM rainfall ORDER BY date;
+------------+--------+
| date       | precip |
+------------+--------+
| 2002-06-01 |   1.50 |
| 2002-06-02 |   0.00 |
| 2002-06-03 |   0.50 |
| 2002-06-04 |   0.00 |
| 2002-06-05 |   1.00 |
+------------+--------+

To calculate cumulative rainfall for a given day, sum that day's precipitation value with the values for all the previous days. For example, the cumulative rainfall as of 2002-06-03 is determined like this:

mysql> SELECT SUM(precip) FROM rainfall WHERE date <= '2002-06-03';
+-------------+
| SUM(precip) |
+-------------+
|        2.00 |
+-------------+

If you want the cumulative figures for all days that are represented in the table, it would be tedious to compute the value for each of them separately. A self-join can do this for all days with a single query. Use one instance of the rainfall table as a reference, and determine for the date in each row the sum of the precip values in all rows occurring up through that date in another instance of the table. The following query shows the daily and cumulative precipitation for each day:

mysql> SELECT t1.date, t1.precip AS 'daily precip',
    -> SUM(t2.precip) AS 'cum. precip'
    -> FROM rainfall AS t1, rainfall AS t2
    -> WHERE t1.date >= t2.date
    -> GROUP BY t1.date;
+------------+--------------+-------------+
| date       | daily precip | cum. precip |
+------------+--------------+-------------+
| 2002-06-01 |         1.50 |        1.50 |
| 2002-06-02 |         0.00 |        1.50 |
| 2002-06-03 |         0.50 |        2.00 |
| 2002-06-04 |         0.00 |        2.00 |
| 2002-06-05 |         1.00 |        3.00 |
+------------+--------------+-------------+

The self-join can be extended to display the number of days elapsed at each date, as well as the running averages for amount of precipitation each day:

mysql> SELECT t1.date, t1.precip AS 'daily precip',
    -> SUM(t2.precip) AS 'cum. precip',
    -> COUNT(t2.precip) AS days,
    -> AVG(t2.precip) AS 'avg. precip'
    -> FROM rainfall AS t1, rainfall AS t2
    -> WHERE t1.date >= t2.date
    -> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date       | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 |         1.50 |        1.50 |    1 |    1.500000 |
| 2002-06-02 |         0.00 |        1.50 |    2 |    0.750000 |
| 2002-06-03 |         0.50 |        2.00 |    3 |    0.666667 |
| 2002-06-04 |         0.00 |        2.00 |    4 |    0.500000 |
| 2002-06-05 |         1.00 |        3.00 |    5 |    0.600000 |
+------------+--------------+-------------+------+-------------+

In the preceding query, the number of days elapsed and the precipitation running averages can be computed easily using COUNT( ) and AVG( ) because there are no missing days in the table. If missing days are allowed, the calculation becomes more complicated, because the number of days elapsed for each calculation no longer will be the same as the number of records. You can see this by deleting the records for the days that had no precipitation to produce a couple of "holes" in the table:

mysql> DELETE FROM rainfall WHERE precip = 0;
mysql> SELECT date, precip FROM rainfall ORDER BY date;
+------------+--------+
| date       | precip |
+------------+--------+
| 2002-06-01 |   1.50 |
| 2002-06-03 |   0.50 |
| 2002-06-05 |   1.00 |
+------------+--------+

Deleting those records doesn't change the cumulative sum or running average for the dates that remain, but does change how they must be calculated. If you try the self-join again, it yields incorrect results for the days-elapsed and average precipitation columns:

mysql> SELECT t1.date, t1.precip AS 'daily precip',
    -> SUM(t2.precip) AS 'cum. precip',
    -> COUNT(t2.precip) AS days,
    -> AVG(t2.precip) AS 'avg. precip'
    -> FROM rainfall AS t1, rainfall AS t2
    -> WHERE t1.date >= t2.date
    -> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date       | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 |         1.50 |        1.50 |    1 |    1.500000 |
| 2002-06-03 |         0.50 |        2.00 |    2 |    1.000000 |
| 2002-06-05 |         1.00 |        3.00 |    3 |    1.000000 |
+------------+--------------+-------------+------+-------------+

To fix the problem, it's necessary to determine the number of days elapsed a different way. Take the minimum and maximum date involved in each sum and calculate a days-elapsed value from them using the following expression:

TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1

That value must be used for the days-elapsed column and for computing the running averages. The resulting query is as follows:

mysql> SELECT t1.date, t1.precip AS 'daily precip',
    -> SUM(t2.precip) AS 'cum. precip',
    -> TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1 AS days,
    -> SUM(t2.precip) / (TO_DAYS(MAX(t2.date)) - TO_DAYS(MIN(t2.date)) + 1)
    -> AS 'avg. precip'
    -> FROM rainfall AS t1, rainfall AS t2
    -> WHERE t1.date >= t2.date
    -> GROUP BY t1.date;
+------------+--------------+-------------+------+-------------+
| date       | daily precip | cum. precip | days | avg. precip |
+------------+--------------+-------------+------+-------------+
| 2002-06-01 |         1.50 |        1.50 |    1 |      1.5000 |
| 2002-06-03 |         0.50 |        2.00 |    3 |      0.6667 |
| 2002-06-05 |         1.00 |        3.00 |    5 |      0.6000 |
+------------+--------------+-------------+------+-------------+

As this example illustrates, calculation of cumulative values from relative values requires only a column that allows rows to be placed into the proper order. (For the rainfall table, that's the date column.) Values in the column need not be sequential, or even numeric. This differs from calculations that produce difference values from cumulative values ("Calculating Differences Between Successive Rows"), which require that a table have a column that contains an unbroken sequence.

The running averages in the rainfall examples are based on dividing cumulative precipitation sums by number of days elapsed as of each day. When the table has no gaps, the number of days is the same as the number of values summed, making it easy to find successive averages. When records are missing, the calculations become more complex. What this demonstrates is that it's necessary to consider the nature of your data and calculate averages appropriately. The next example is conceptually similar to the previous ones in that it calculates cumulative sums and running averages, but it performs the computations yet another way.

The following table shows a marathon runner's performance at each stage of a 26-kilometer run. The values in each row show the length of each stage in kilometers and how long the runner took to complete the stage. In other words, the values pertain to intervals within the marathon and thus are relative to the whole:

mysql> SELECT stage, km, t FROM marathon ORDER BY stage;
+-------+----+----------+
| stage | km | t        |
+-------+----+----------+
|     1 |  5 | 00:15:00 |
|     2 |  7 | 00:19:30 |
|     3 |  9 | 00:29:20 |
|     4 |  5 | 00:17:50 |
+-------+----+----------+

To calculate cumulative distance in kilometers at each stage, use a self-join that looks like this:

mysql> SELECT t1.stage, t1.km, SUM(t2.km) AS 'cum. km'
    -> FROM marathon AS t1, marathon AS t2
    -> WHERE t1.stage >= t2.stage
    -> GROUP BY t1.stage;
+-------+----+---------+
| stage | km | cum. km |
+-------+----+---------+
|     1 |  5 |       5 |
|     2 |  7 |      12 |
|     3 |  9 |      21 |
|     4 |  5 |      26 |
+-------+----+---------+

Cumulative distances are easy to compute because they can be summed directly. The calculation for accumulating time values is a little more involved. It's necessary to convert times to seconds, sum the resulting values, and convert the sum back to a time value. To compute the runner's average speed at the end of each stage, take the ratio of cumulative distance over cumulative time. Putting all this together yields the following query:

mysql> SELECT t1.stage, t1.km, t1.t,
    -> SUM(t2.km) AS 'cum. km',
    -> SEC_TO_TIME(SUM(TIME_TO_SEC(t2.t))) AS 'cum. t',
    -> SUM(t2.km)/(SUM(TIME_TO_SEC(t2.t))/(60*60)) AS 'avg. km/hour'
    -> FROM marathon AS t1, marathon AS t2
    -> WHERE t1.stage >= t2.stage
    -> GROUP BY t1.stage;
+-------+----+----------+---------+----------+--------------+
| stage | km | t        | cum. km | cum. t   | avg. km/hour |
+-------+----+----------+---------+----------+--------------+
|     1 |  5 | 00:15:00 |       5 | 00:15:00 |      20.0000 |
|     2 |  7 | 00:19:30 |      12 | 00:34:30 |      20.8696 |
|     3 |  9 | 00:29:20 |      21 | 01:03:50 |      19.7389 |
|     4 |  5 | 00:17:50 |      26 | 01:21:40 |      19.1020 |
+-------+----+----------+---------+----------+--------------+

We can see from this that the runner's average pace increased a little during the second stage of the race, but then (presumably as a result of fatigue) decreased thereafter.

Paul DuBois is one of the primary contributors to the MySQL Reference Manual. He is also the author of Using csh & tcsh and Software Portability with imake by O'Reilly, as well as MySQL and MySQL and Perl for the Web by New Riders.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.