As I have discussed more than once, the previously outlined emulation of analytic functions is fine as long as we wisely restrain ourselves to using only the n first (or last) elements in what would have been our partition clause. This is a factor only when n is greater than 1; if n is 1, we don't need anything more sophisticated than the regular usage of MAX()
or MIN()
. There are cases, however, when we may need to hold more than the first 100 values or so in our list. Suppose, for instance, that we want to compute the median salary for each of our departments. We would need big ordered lists to be able to locate the middle value. Raising the value of the group_concat_max_len
environment variable and hoping for the best is not a responsible option.
Fortunately, there is another possibility, and once again we will use techniques that are not available in the Oracle world: using variables on which we perform calculations in queries. Numbering rows as they are returned and emulating Oracle's rownum virtual column is a wellknown technique. Mark Malakanov gave a particularly elegant example on his blog in which the variable is initialized inside the query:
SELECT @rownum:=@rownum+1 rownum, t.* FROM (SELECT @rownum:=0) r, mytable t;
Actually, approach isn't strictly identical to the Oracle result, and it can be checked by adding an order by
clause. With Oracle, rownum is incremented as rows are returned from the database and before any sort is performed. If you add an order by
clause, your rownum values get mixed up. With MySQL, the value is incremented as it is displayed and therefore computed after any sort. From a practical point of view, this is rather more convenient.
Assuming that our query is suitably sorted by department number and decreasing salary, if we want to rank by department, we need to reinitialize the counter each time we switch department in our result set. We therefore need two variables—one to compute the number (in the following example we use a simple row number and ignore ties, for the sake of simplification), and one to hold the previous department number value so we can compare it to the value in the current row. We might be very tempted to use something such as:
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1) rownum, (@dept := DEPTNO) DEPTNO, LASTNAME, FIRSTNAME, SAL from EMPLOYEES, (select (@dept := 0)) as x order by DEPTNO, SAL desc;
There is a catch, however. The MySQL documentation states clearly that there is no guarantee about the order of evaluation of the various variables in the different (pseudo) columns. This is legitimate, because in relational theory, the order of columns is irrelevant. However, the previous query implicitly assumes that we compare the current department number to the value stored in @dept
before assigning it to the same variable. If assignments are performed the other way around, it no longer works.
The only solution available to force comparisons and assignments to be performed in the order we want is to have all of them done inside a single element in our select list.
An elegant way to compute a variable and hide it from public view is to assign it in an expression that discards part of its result. Baron Schwartz gives an excellent example in a recent ONLamp article that also discusses ranking: the idea is to use the least()
function, which only returns the smallest of its parameters but has to evaluate all of them. Since department numbers are positive values, we know that least(0, @dept := DEPTNO)
will assign the value of DEPTNO
to the @dept
variable, then return 0
, which we can safely add to any numerical value (the rank, for instance.) A proper way to rewrite the previous, potentially wrong query might therefore be something such as:
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rownum, DEPTNO, LASTNAME, FIRSTNAME, SAL from EMPLOYEES, (select (@dept := 0)) as x order by DEPTNO, SAL desc;
In that case, the department number is first compared to the contents of the variable; if they are equal, we simply have to increment the row number. If they aren't, we initialize the row number to 1 and surreptitiously add a 0, the only purpose of which is to set @dept
to the new DEPTNO
value. Neat. And we control the order in which the various operations are performed.
Now that you understand the principle, we can refine and transform our row number into a true rank. If we want to emulate rank()
, we need to hold not only the previous value of DEPTNO
but also the previous value of SAL
, to check whether we must increase our rank. Moreover, we must check how many identical values we have in order to decide by how much we must increase the rank value next (dense_rank()
is easier to deal with).
1: if (@dept = DEPTNO, 2: if (@sal = SAL, 3: @rnk := @rnk + least(0, @inc := @inc + 1), 4: @rnk := @rnk + greatest(@inc, @inc := 1) 5: + least(0, @sal := SAL)), 6: @rnk := 1 + least(0, @dept := DEPTNO) 7: + least(0, @sal := SAL) 8: + least(0, @inc := 1)) rank
A few explanations are probably welcome:
least()
trick to initialize three variables—one to store the current department number as with the row number computation, one to store the current salary, and one to store by how much I wil have to increase the rank next (1 by default).@rnk
variable and use the least()
trick to increment, behind the scenes, the value by which the rank will change the next time the salary changes. I may have a tie—for instance, a rank 23 followed by another rank 23—but if afterwards the salary is smaller, then I want a rank 25.If the salary is different (which means less) than the previous one, then MySQL executes what we find on lines 4 and 5. We add the increment and reset it to 1 in a single operation by using a trick very similar to what we have done with least()
. This particular step is somewhat bold, because although we know that both expressions will be evaluated, I have found nowhere any guarantee that they will be evaluated from left to right. The lefttoright evaluation order is, however, "natural" in this context (witness coalesce()
), and it works. Then we use least()
to reset the @sal
variable to the new current value of the salary. The final query becomes:
mysql> select * > from (select if (@dept = DEPTNO, > if (@sal = SAL, > @rnk := @rnk + least(0, @inc := @inc + 1), > @rnk := @rnk + greatest(@inc, @inc := 1) > + least(0, @sal := SAL)), > @rnk := 1 + least(0, @dept := DEPTNO) > + least(0, @sal := SAL) > + least(0, @inc := 1)) rank, > DEPTNO, > LASTNAME, > FIRSTNAME, > SAL > from EMPLOYEES, > (select (@dept := 0)) as x > order by DEPTNO, SAL desc) as y > where rank <= 5; ++++++  rank  DEPTNO  LASTNAME  FIRSTNAME  SAL  ++++++  1  10  WOLFORD  KATHY  7997.34   2  10  HARRIS  GARLAND  7993.74  [snip]  4  60  BURMEISTER  HATTIE  7973.04   5  60  LADD  JEFFREY  7952.5  ++++++ 30 rows in set (0.06 sec)
mysql>
We are now on equal footing with the Oracle analytic function from a performance standpoint. Admittedly, from a syntactical point of view the analytic functions are much better. The group_concat()
/find_in_set()
combination is a good deal better, too, if the restrictions and the performance are acceptable. But the functionality is here and can be easily extended to several analytic functions.

Another commonly used Oracle analytic function is:
LAG(expression, offset) over (partition by ... order by ...)
which returns the offsetth previous value for the current row's expression with regard to what is specified by the over ()
expression. The most commonly used offset is 1, and LAG()
is very convenient for computing a difference between two successive measures of the same cumulative counter (it returns NULL when the offsetth previous value doesn't exist). "Memorizing" previous values is what we have just done with user variables, and therefore there is nothing new here.
If we want to compute the average number of days between two successive hires by department, we can write something like this with Oracle:
SQL> select DEPTNO, AVG(HIRE_INTERVAL) 2 from (select DEPTNO, 3 HIREDATE  LAG(HIREDATE, 1) 4 over (partition by DEPTNO 5 order by HIREDATE) HIRE_INTERVAL 6 from EMPLOYEES) 7 group by DEPTNO 8 /
DEPTNO AVG(HIRE_INTERVAL)   10 6.6842104 20 3.5556686 30 3.52857866 40 3.50300597 50 3.44334974 60 7.12665996
6 rows selected.
Elapsed: 00:00:00.06 SQL>
By applying the techniques we have used for ranks, we create this with MySQL:
mysql> select DEPTNO, avg(HIRE_INTERVAL) > from (select DEPTNO, > if (@dept = DEPTNO, > datediff(HIREDATE, @hd) + least(0, @hd := HIREDATE), > NULL + least(0, @dept := DEPTNO) + (@hd := NULL)) > HIRE_INTERVAL > from EMPLOYEES, > (select (@dept := 0)) as a > order by DEPTNO, HIREDATE) as b > group by DEPTNO; +++  DEPTNO  avg(HIRE_INTERVAL)  +++  10  6.6858237547893   20  3.5564598168871   30  3.5303643724696   40  3.5002506265664   50  3.4406111384919   60  7.1329243353783  +++ 6 rows in set (0.06 sec)
We must be very careful not to forget the order by
clause in the subquery aliased as b, since this is what matches the order by
in the over ()
part of the analytic function. You may notice that although the data is the same, the numerical values are not identical in Oracle and in MySQl, though they're very close. This is because the Oracle difference of two dates gives a fractional number of days, while the datediff()
MySQL function ignores the time part. I have been too lazy to adjust one or the other...
The opposite of LAG()
is the LEAD()
function, which returns the offsetth next value. It's easy to cumulate or save values we have just fetched from the database. But dealing with values that haven't been displayed yet requires buffering.
Very often—and LEAD()
illustrates this case extremely well—little more is required to get the desired result than inverting the sort order in the over ()
clause with Oracle, or in the subquery with MySQL. Nevertheless, there are some cases that require attention. I mentioned when I introduced analytic functions that the over ()
clause can also include a range restriction. In all the previous examples, I have applied the function to the full set of rows defined by the partition by
expression. There are several cases, however, when you may wish to further restrict the scope of application, for instance when computing a cumulative amount.
The most common range restrictions are:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, usually shortened to RANGE UNBOUNDED PRECEDING
, which simply means "don't compute on all the partition—only from the beginning of the partition up to and including the current row" (what we would want for a cumulated amount).RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
, usually shortened to RANGE UNBOUNDED FOLLOWING
, which is the opposite of the previous one and might be useful for computing how much remains to be paid on a loan at various dates, for instance.ROWS BETWEEN n PRECEDING AND n FOLLOWING
, which is the type of expression we would use to compute a moving average.RANGE UNBOUNDED PRECEDING
presents no challenge; we can best solve it with user variables. As I said when comparing LEAD()
with LAG()
, RANGE UNBOUNDED FOLLOWING
is mostly a case of inverting the ordering, at least temporarily before reordering again.
Windows that are centered on the current row are more difficult to handle—at least, more difficult to handle efficiently—because we encounter once again a case of strong correlation with the current row. Let's take, for instance, the computation of the average salary of the current employee and of the two employees who were hired before him in the same department, and of the two employees who were hired after him in the same department.
SQL> select DEPTNO, 2 EMPNO, LASTNAME, HIREDATE, 3 SAL, 4 round(AVG(SAL) over (partition by DEPTNO 5 order by HIREDATE 6 rows between 2 preceding 7 and 2 following), 2) MOVING_AVERAGE 8 from EMPLOYEES 9 order by DEPTNO, HIREDATE, EMPNO;
[snip] 60 3284 ANDERSON 15FEB07 4443.75 3196.43 60 3336 DUNBAR 16FEB07 2167.19 3191.85 60 4687 LUO 18FEB07 6457.5 4594.29 60 6840 JACKSON 28FEB07 1995.17 4599.79 60 10417 NEILL 28FEB07 7907.86 4480.14 60 6232 CARLSON 01MAR07 4471.24 3985.81 60 9697 LYNCH 02MAR07 1568.95 4649.35
10000 rows selected.
Elapsed: 00:00:00.83 SQL>
Finding, for each row, the two rows preceding it and the two rows following it by date isn't obvious (it would be easier if date intervals were regular, but that isn't the case). Our first attempts at computing the rank has convinced you, I hope, that correlated subqueries that fire for every row and search the same table using unindexed criteria is performance suicide; we don't want to scan 10,000 times 10,000 rows.
Can we exploit what we have done so far? We can try, for instance, computing a row number, then selfjoining with a condition on the row numbers, but we still end up joining a 10,000 row set to another 10,000 row set on a computed column. This is not likely to be very fast, and definitely unlikely to compare favorably to the subsecond Oracle execution (and display) time.
You must realize that when you have a sliding window (as is the case here, since we have a window of five rows), each row intervenes more than one time, even if it displayed only once. If we consider employee #4687 (LUO) in the Oracle output, her attributes are displayed only once but her salary actually intervenes not only in the computation of the average displayed on "her" line, but also in the computation of the average and the two lines preceding and two lines following. Actually, with the exception of the very first and very last employees hired in each department, the salary on each row is used five times. Can we find an easy and inexpensive way to see each row appear five times? We can, and here is how.
First, let's use the techniques we have already seen to compute a simple row number. We need to number everything, so we use a user variable.
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE

Then, since we need each row five times, we are going to multiply our rows by using a pivot table—that is, simply by performing a Cartesian join (without any join condition) with a fiverow table.
select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p
Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of NUM
(0 to 5). Then I compute a value that I call BATCH
, which is the sum of NUM
and of the initial row number rn
. Why this value? Because since NUM
varies, we will have some overlap for the values of BATCH
.
Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23, all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use MAX()
to aggregate and squash five rows into one. And here it is:
select a.DEPTNO, max(case a.NUM when 2 then a.EMPNO else NULL end) EMPNO, max(case a.NUM when 2 then a.LASTNAME else NULL end) LASTNAME, max(case a.NUM when 2 then a.HIREDATE else NULL end) HIREDATE, round(max(case a.NUM when 2 then a.SAL else NULL end), 2) SAL, round(avg(SAL), 2) from (select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p) as a group by a.DEPTNO, a.BATCH having count(*) > 2 order by 1, 4, 2;
Two passing remarks about this code:
having
clause.And now the verdict:
[snip]
 60  1903  BARNES  20061215  4760.32  3435.31 
 60  3240  FARRELL  20061221  4046.99  4260.59 
 60  5030  POULIN  20061223  3942.44  5099.71 
 60  8345  PADRON  20070103  5956.80  4551.25 
 60  3218  LADD  20070112  6791.98  3920.99 
 60  3046  WELLMAN  20070118  2018.05  4021.25 
 60  2787  WILLIAMS  20070205  895.66  3263.33 
 60  3284  ANDERSON  20070215  4443.75  3196.43 
 60  3336  DUNBAR  20070216  2167.19  3191.85 
 60  4687  LUO  20070218  6457.50  4594.29 
 60  6840  JACKSON  20070228  1995.17  4599.79 
 60  10417  NEILL  20070228  7907.86  4480.14 
 60  6232  CARLSON  20070301  4471.24  3985.81 
 60  9697  LYNCH  20070302  1568.95  4649.35 
+++++++
10000 rows in set (0.44 sec)
mysql>
Wham, bang. Almost twice as fast as the Oracle analytic function on this example...
I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of variables, or to the particularly wild last example.
One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."
There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view V_EMPLOYEES_WITH_SAL_RANK
, it will only be a matter of time before someone writes:
select distinct e1.DEPTNO, e1.SAL from V_EMPLOYEES_WITH_SAL_RANK e1 where e1.RANK = (select max(e2.RANK) from V_EMPLOYEES_WITH_SAL_RANK e2 where e2.DEPTNO = e1.DEPTNO)
which, as you may have realized (but the optimizer won't), is just an inefficient
select DEPTNO, MIN(SAL) FROM EMPLOYEES group by DEPTNO
Have fun.
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.