As I have discussed more than once, the previously outlined emulation of analytic functions is fine as long as we wisely restrain ourselves to using only the n first (or last) elements in what would have been our partition clause. This is a factor only when n is greater than 1; if n is 1, we don't need anything more sophisticated than the regular usage of
MIN(). There are cases, however, when we may need to hold more than the first 100 values or so in our list. Suppose, for instance, that we want to compute the median salary for each of our departments. We would need big ordered lists to be able to locate the middle value. Raising the value of the
group_concat_max_len environment variable and hoping for the best is not a responsible option.
Fortunately, there is another possibility, and once again we will use techniques that are not available in the Oracle world: using variables on which we perform calculations in queries. Numbering rows as they are returned and emulating Oracle's rownum virtual column is a well-known technique. Mark Malakanov gave a particularly elegant example on his blog in which the variable is initialized inside the query:
SELECT @rownum:=@rownum+1 rownum, t.* FROM (SELECT @rownum:=0) r, mytable t;
Actually, approach isn't strictly identical to the Oracle result, and it can be checked by adding an
order by clause. With Oracle, rownum is incremented as rows are returned from the database and before any sort is performed. If you add an
order by clause, your rownum values get mixed up. With MySQL, the value is incremented as it is displayed and therefore computed after any sort. From a practical point of view, this is rather more convenient.
Assuming that our query is suitably sorted by department number and decreasing salary, if we want to rank by department, we need to reinitialize the counter each time we switch department in our result set. We therefore need two variables—one to compute the number (in the following example we use a simple row number and ignore ties, for the sake of simplification), and one to hold the previous department number value so we can compare it to the value in the current row. We might be very tempted to use something such as:
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1) rownum, (@dept := DEPTNO) DEPTNO, LASTNAME, FIRSTNAME, SAL from EMPLOYEES, (select (@dept := 0)) as x order by DEPTNO, SAL desc;
There is a catch, however. The MySQL documentation states clearly that there is no guarantee about the order of evaluation of the various variables in the different (pseudo) columns. This is legitimate, because in relational theory, the order of columns is irrelevant. However, the previous query implicitly assumes that we compare the current department number to the value stored in
@dept before assigning it to the same variable. If assignments are performed the other way around, it no longer works.
The only solution available to force comparisons and assignments to be performed in the order we want is to have all of them done inside a single element in our select list.
An elegant way to compute a variable and hide it from public view is to assign it in an expression that discards part of its result. Baron Schwartz gives an excellent example in a recent ONLamp article that also discusses ranking: the idea is to use the
least() function, which only returns the smallest of its parameters but has to evaluate all of them. Since department numbers are positive values, we know that
least(0, @dept := DEPTNO) will assign the value of
DEPTNO to the
@dept variable, then return
0, which we can safely add to any numerical value (the rank, for instance.) A proper way to rewrite the previous, potentially wrong query might therefore be something such as:
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rownum, DEPTNO, LASTNAME, FIRSTNAME, SAL from EMPLOYEES, (select (@dept := 0)) as x order by DEPTNO, SAL desc;
In that case, the department number is first compared to the contents of the variable; if they are equal, we simply have to increment the row number. If they aren't, we initialize the row number to 1 and surreptitiously add a 0, the only purpose of which is to set
@dept to the new
DEPTNO value. Neat. And we control the order in which the various operations are performed.
Now that you understand the principle, we can refine and transform our row number into a true rank. If we want to emulate
rank(), we need to hold not only the previous value of
DEPTNO but also the previous value of
SAL, to check whether we must increase our rank. Moreover, we must check how many identical values we have in order to decide by how much we must increase the rank value next (
dense_rank() is easier to deal with).
1: if (@dept = DEPTNO, 2: if (@sal = SAL, 3: @rnk := @rnk + least(0, @inc := @inc + 1), 4: @rnk := @rnk + greatest(@inc, @inc := 1) 5: + least(0, @sal := SAL)), 6: @rnk := 1 + least(0, @dept := DEPTNO) 7: + least(0, @sal := SAL) 8: + least(0, @inc := 1)) rank
A few explanations are probably welcome:
least()trick to initialize three variables—one to store the current department number as with the row number computation, one to store the current salary, and one to store by how much I wil have to increase the rank next (1 by default).
@rnkvariable and use the
least()trick to increment, behind the scenes, the value by which the rank will change the next time the salary changes. I may have a tie—for instance, a rank 23 followed by another rank 23—but if afterwards the salary is smaller, then I want a rank 25.
If the salary is different (which means less) than the previous one, then MySQL executes what we find on lines 4 and 5. We add the increment and reset it to 1 in a single operation by using a trick very similar to what we have done with
least(). This particular step is somewhat bold, because although we know that both expressions will be evaluated, I have found nowhere any guarantee that they will be evaluated from left to right. The left-to-right evaluation order is, however, "natural" in this context (witness
coalesce()), and it works. Then we use
least() to reset the
@sal variable to the new current value of the salary. The final query becomes:
mysql> select * -> from (select if (@dept = DEPTNO, -> if (@sal = SAL, -> @rnk := @rnk + least(0, @inc := @inc + 1), -> @rnk := @rnk + greatest(@inc, @inc := 1) -> + least(0, @sal := SAL)), -> @rnk := 1 + least(0, @dept := DEPTNO) -> + least(0, @sal := SAL) -> + least(0, @inc := 1)) rank, -> DEPTNO, -> LASTNAME, -> FIRSTNAME, -> SAL -> from EMPLOYEES, -> (select (@dept := 0)) as x -> order by DEPTNO, SAL desc) as y -> where rank <= 5; +------+--------+------------+-----------+---------+ | rank | DEPTNO | LASTNAME | FIRSTNAME | SAL | +------+--------+------------+-----------+---------+ | 1 | 10 | WOLFORD | KATHY | 7997.34 | | 2 | 10 | HARRIS | GARLAND | 7993.74 | [snip] | 4 | 60 | BURMEISTER | HATTIE | 7973.04 | | 5 | 60 | LADD | JEFFREY | 7952.5 | +------+--------+------------+-----------+---------+ 30 rows in set (0.06 sec)
We are now on equal footing with the Oracle analytic function from a performance standpoint. Admittedly, from a syntactical point of view the analytic functions are much better. The
find_in_set() combination is a good deal better, too, if the restrictions and the performance are acceptable. But the functionality is here and can be easily extended to several analytic functions.
Another commonly used Oracle analytic function is:
LAG(expression, offset) over (partition by ... order by ...)
which returns the offsetth previous value for the current row's expression with regard to what is specified by the
over () expression. The most commonly used offset is 1, and
LAG() is very convenient for computing a difference between two successive measures of the same cumulative counter (it returns NULL when the offsetth previous value doesn't exist). "Memorizing" previous values is what we have just done with user variables, and therefore there is nothing new here.
If we want to compute the average number of days between two successive hires by department, we can write something like this with Oracle:
SQL> select DEPTNO, AVG(HIRE_INTERVAL) 2 from (select DEPTNO, 3 HIREDATE - LAG(HIREDATE, 1) 4 over (partition by DEPTNO 5 order by HIREDATE) HIRE_INTERVAL 6 from EMPLOYEES) 7 group by DEPTNO 8 /
DEPTNO AVG(HIRE_INTERVAL) ---------- ------------------ 10 6.6842104 20 3.5556686 30 3.52857866 40 3.50300597 50 3.44334974 60 7.12665996
6 rows selected.
Elapsed: 00:00:00.06 SQL>
By applying the techniques we have used for ranks, we create this with MySQL:
mysql> select DEPTNO, avg(HIRE_INTERVAL) -> from (select DEPTNO, -> if (@dept = DEPTNO, -> datediff(HIREDATE, @hd) + least(0, @hd := HIREDATE), -> NULL + least(0, @dept := DEPTNO) + (@hd := NULL)) -> HIRE_INTERVAL -> from EMPLOYEES, -> (select (@dept := 0)) as a -> order by DEPTNO, HIREDATE) as b -> group by DEPTNO; +--------+--------------------+ | DEPTNO | avg(HIRE_INTERVAL) | +--------+--------------------+ | 10 | 6.6858237547893 | | 20 | 3.5564598168871 | | 30 | 3.5303643724696 | | 40 | 3.5002506265664 | | 50 | 3.4406111384919 | | 60 | 7.1329243353783 | +--------+--------------------+ 6 rows in set (0.06 sec)
We must be very careful not to forget the
order by clause in the subquery aliased as b, since this is what matches the
order by in the
over () part of the analytic function. You may notice that although the data is the same, the numerical values are not identical in Oracle and in MySQl, though they're very close. This is because the Oracle difference of two dates gives a fractional number of days, while the
datediff() MySQL function ignores the time part. I have been too lazy to adjust one or the other...
The opposite of
LAG() is the
LEAD() function, which returns the offsetth next value. It's easy to cumulate or save values we have just fetched from the database. But dealing with values that haven't been displayed yet requires buffering.
LEAD() illustrates this case extremely well—little more is required to get the desired result than inverting the sort order in the
over () clause with Oracle, or in the subquery with MySQL. Nevertheless, there are some cases that require attention. I mentioned when I introduced analytic functions that the
over () clause can also include a range restriction. In all the previous examples, I have applied the function to the full set of rows defined by the
partition by expression. There are several cases, however, when you may wish to further restrict the scope of application, for instance when computing a cumulative amount.
The most common range restrictions are:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, usually shortened to
RANGE UNBOUNDED PRECEDING, which simply means "don't compute on all the partition—only from the beginning of the partition up to and including the current row" (what we would want for a cumulated amount).
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING, usually shortened to
RANGE UNBOUNDED FOLLOWING, which is the opposite of the previous one and might be useful for computing how much remains to be paid on a loan at various dates, for instance.
ROWS BETWEEN n PRECEDING AND n FOLLOWING, which is the type of expression we would use to compute a moving average.
RANGE UNBOUNDED PRECEDING presents no challenge; we can best solve it with user variables. As I said when comparing
RANGE UNBOUNDED FOLLOWING is mostly a case of inverting the ordering, at least temporarily before reordering again.
Windows that are centered on the current row are more difficult to handle—at least, more difficult to handle efficiently—because we encounter once again a case of strong correlation with the current row. Let's take, for instance, the computation of the average salary of the current employee and of the two employees who were hired before him in the same department, and of the two employees who were hired after him in the same department.
SQL> select DEPTNO, 2 EMPNO, LASTNAME, HIREDATE, 3 SAL, 4 round(AVG(SAL) over (partition by DEPTNO 5 order by HIREDATE 6 rows between 2 preceding 7 and 2 following), 2) MOVING_AVERAGE 8 from EMPLOYEES 9 order by DEPTNO, HIREDATE, EMPNO;
[snip] 60 3284 ANDERSON 15-FEB-07 4443.75 3196.43 60 3336 DUNBAR 16-FEB-07 2167.19 3191.85 60 4687 LUO 18-FEB-07 6457.5 4594.29 60 6840 JACKSON 28-FEB-07 1995.17 4599.79 60 10417 NEILL 28-FEB-07 7907.86 4480.14 60 6232 CARLSON 01-MAR-07 4471.24 3985.81 60 9697 LYNCH 02-MAR-07 1568.95 4649.35
10000 rows selected.
Elapsed: 00:00:00.83 SQL>
Finding, for each row, the two rows preceding it and the two rows following it by date isn't obvious (it would be easier if date intervals were regular, but that isn't the case). Our first attempts at computing the rank has convinced you, I hope, that correlated subqueries that fire for every row and search the same table using unindexed criteria is performance suicide; we don't want to scan 10,000 times 10,000 rows.
Can we exploit what we have done so far? We can try, for instance, computing a row number, then self-joining with a condition on the row numbers, but we still end up joining a 10,000 row set to another 10,000 row set on a computed column. This is not likely to be very fast, and definitely unlikely to compare favorably to the sub-second Oracle execution (and display) time.
You must realize that when you have a sliding window (as is the case here, since we have a window of five rows), each row intervenes more than one time, even if it displayed only once. If we consider employee #4687 (LUO) in the Oracle output, her attributes are displayed only once but her salary actually intervenes not only in the computation of the average displayed on "her" line, but also in the computation of the average and the two lines preceding and two lines following. Actually, with the exception of the very first and very last employees hired in each department, the salary on each row is used five times. Can we find an easy and inexpensive way to see each row appear five times? We can, and here is how.
First, let's use the techniques we have already seen to compute a simple row number. We need to number everything, so we use a user variable.
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE
Then, since we need each row five times, we are going to multiply our rows by using a pivot table—that is, simply by performing a Cartesian join (without any join condition) with a five-row table.
select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p
Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of
NUM (0 to 5). Then I compute a value that I call
BATCH, which is the sum of
NUM and of the initial row number
rn. Why this value? Because since
NUM varies, we will have some overlap for the values of
Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23, all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use
MAX() to aggregate and squash five rows into one. And here it is:
select a.DEPTNO, max(case a.NUM when 2 then a.EMPNO else NULL end) EMPNO, max(case a.NUM when 2 then a.LASTNAME else NULL end) LASTNAME, max(case a.NUM when 2 then a.HIREDATE else NULL end) HIREDATE, round(max(case a.NUM when 2 then a.SAL else NULL end), 2) SAL, round(avg(SAL), 2) from (select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p) as a group by a.DEPTNO, a.BATCH having count(*) > 2 order by 1, 4, 2;
Two passing remarks about this code:
And now the verdict:
| 60 | 1903 | BARNES | 2006-12-15 | 4760.32 | 3435.31 |
| 60 | 3240 | FARRELL | 2006-12-21 | 4046.99 | 4260.59 |
| 60 | 5030 | POULIN | 2006-12-23 | 3942.44 | 5099.71 |
| 60 | 8345 | PADRON | 2007-01-03 | 5956.80 | 4551.25 |
| 60 | 3218 | LADD | 2007-01-12 | 6791.98 | 3920.99 |
| 60 | 3046 | WELLMAN | 2007-01-18 | 2018.05 | 4021.25 |
| 60 | 2787 | WILLIAMS | 2007-02-05 | 895.66 | 3263.33 |
| 60 | 3284 | ANDERSON | 2007-02-15 | 4443.75 | 3196.43 |
| 60 | 3336 | DUNBAR | 2007-02-16 | 2167.19 | 3191.85 |
| 60 | 4687 | LUO | 2007-02-18 | 6457.50 | 4594.29 |
| 60 | 6840 | JACKSON | 2007-02-28 | 1995.17 | 4599.79 |
| 60 | 10417 | NEILL | 2007-02-28 | 7907.86 | 4480.14 |
| 60 | 6232 | CARLSON | 2007-03-01 | 4471.24 | 3985.81 |
| 60 | 9697 | LYNCH | 2007-03-02 | 1568.95 | 4649.35 |
10000 rows in set (0.44 sec)
Wham, bang. Almost twice as fast as the Oracle analytic function on this example...
I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of variables, or to the particularly wild last example.
One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."
There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view
V_EMPLOYEES_WITH_SAL_RANK, it will only be a matter of time before someone writes:
select distinct e1.DEPTNO, e1.SAL from V_EMPLOYEES_WITH_SAL_RANK e1 where e1.RANK = (select max(e2.RANK) from V_EMPLOYEES_WITH_SAL_RANK e2 where e2.DEPTNO = e1.DEPTNO)
which, as you may have realized (but the optimizer won't), is just an inefficient
select DEPTNO, MIN(SAL) FROM EMPLOYEES group by DEPTNO
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.