O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Emulating Analytic (AKA Ranking) Functions with MySQL: Part 2

by Stephane Faroult
04/13/2007

As I have discussed more than once, the previously outlined emulation of analytic functions is fine as long as we wisely restrain ourselves to using only the n first (or last) elements in what would have been our partition clause. This is a factor only when n is greater than 1; if n is 1, we don't need anything more sophisticated than the regular usage of MAX() or MIN(). There are cases, however, when we may need to hold more than the first 100 values or so in our list. Suppose, for instance, that we want to compute the median salary for each of our departments. We would need big ordered lists to be able to locate the middle value. Raising the value of the group_concat_max_len environment variable and hoping for the best is not a responsible option.

Fortunately, there is another possibility, and once again we will use techniques that are not available in the Oracle world: using variables on which we perform calculations in queries. Numbering rows as they are returned and emulating Oracle's rownum virtual column is a well-known technique. Mark Malakanov gave a particularly elegant example on his blog in which the variable is initialized inside the query:

SELECT @rownum:=@rownum+1 rownum, t.* FROM (SELECT @rownum:=0) r, mytable t;

Actually, approach isn't strictly identical to the Oracle result, and it can be checked by adding an order by clause. With Oracle, rownum is incremented as rows are returned from the database and before any sort is performed. If you add an order by clause, your rownum values get mixed up. With MySQL, the value is incremented as it is displayed and therefore computed after any sort. From a practical point of view, this is rather more convenient.

Assuming that our query is suitably sorted by department number and decreasing salary, if we want to rank by department, we need to reinitialize the counter each time we switch department in our result set. We therefore need two variables—one to compute the number (in the following example we use a simple row number and ignore ties, for the sake of simplification), and one to hold the previous department number value so we can compare it to the value in the current row. We might be very tempted to use something such as:

select  if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1) rownum,
          (@dept := DEPTNO) DEPTNO,
          LASTNAME,
          FIRSTNAME,
          SAL
   from  EMPLOYEES,
        (select (@dept := 0)) as x
   order  by DEPTNO, SAL desc;

There is a catch, however. The MySQL documentation states clearly that there is no guarantee about the order of evaluation of the various variables in the different (pseudo) columns. This is legitimate, because in relational theory, the order of columns is irrelevant. However, the previous query implicitly assumes that we compare the current department number to the value stored in @dept before assigning it to the same variable. If assignments are performed the other way around, it no longer works.
The only solution available to force comparisons and assignments to be performed in the order we want is to have all of them done inside a single element in our select list.

An elegant way to compute a variable and hide it from public view is to assign it in an expression that discards part of its result. Baron Schwartz gives an excellent example in a recent ONLamp article that also discusses ranking: the idea is to use the least() function, which only returns the smallest of its parameters but has to evaluate all of them. Since department numbers are positive values, we know that least(0, @dept := DEPTNO) will assign the value of DEPTNO to the @dept variable, then return 0, which we can safely add to any numerical value (the rank, for instance.) A proper way to rewrite the previous, potentially wrong query might therefore be something such as:

  select if (@dept =  DEPTNO, @rn := @rn + 1,
                            @rn := 1 + least(0, @dept  := DEPTNO)) rownum,
        DEPTNO,
        LASTNAME,
        FIRSTNAME,
        SAL
 from  EMPLOYEES,
      (select (@dept := 0)) as x
 order  by DEPTNO, SAL desc;


In that case, the department number is first compared to the contents of the variable; if they are equal, we simply have to increment the row number. If they aren't, we initialize the row number to 1 and surreptitiously add a 0, the only purpose of which is to set @dept to the new DEPTNO value. Neat. And we control the order in which the various operations are performed.

Now that you understand the principle, we can refine and transform our row number into a true rank. If we want to emulate rank(), we need to hold not only the previous value of DEPTNO but also the previous value of SAL, to check whether we must increase our rank. Moreover, we must check how many identical values we have in order to decide by how much we must increase the rank value next (dense_rank() is easier to deal with).

1: if (@dept =  DEPTNO, 
2:       if (@sal = SAL,
3:              @rnk := @rnk + least(0, @inc :=  @inc + 1),
4:              @rnk := @rnk + greatest(@inc,  @inc := 1)
5:                           + least(0, @sal :=  SAL)),
6:       @rnk := 1 + least(0, @dept := DEPTNO) 
7:                 + least(0, @sal := SAL)
8:                 + least(0, @inc := 1)) rank

A few explanations are probably welcome:

If the salary is different (which means less) than the previous one, then MySQL executes what we find on lines 4 and 5. We add the increment and reset it to 1 in a single operation by using a trick very similar to what we have done with least(). This particular step is somewhat bold, because although we know that both expressions will be evaluated, I have found nowhere any guarantee that they will be evaluated from left to right. The left-to-right evaluation order is, however, "natural" in this context (witness coalesce()), and it works. Then we use least() to reset the @sal variable to the new current value of the salary. The final query becomes:

  mysql>  select *
     -> from (select if (@dept = DEPTNO, 
     ->                    if (@sal = SAL,
     ->                        @rnk := @rnk + least(0,  @inc := @inc + 1),
     ->                        @rnk := @rnk +  greatest(@inc, @inc := 1)
     ->                                     + least(0,  @sal := SAL)),
     ->                    @rnk := 1 + least(0, @dept  := DEPTNO) 
     ->                              + least(0, @sal :=  SAL)
     ->                              + least(0, @inc :=  1)) rank,
     ->              DEPTNO,
     ->              LASTNAME,
     ->               FIRSTNAME,
     ->              SAL
     ->       from EMPLOYEES,
     ->            (select (@dept := 0)) as x
     ->       order by DEPTNO, SAL desc) as y
     -> where rank <= 5;
 +------+--------+------------+-----------+---------+
 | rank  | DEPTNO | LASTNAME   | FIRSTNAME |  SAL     |
 +------+--------+------------+-----------+---------+
 |    1 |      10 | WOLFORD    | KATHY     | 7997.34 | 
 |    2 |      10 | HARRIS     | GARLAND   | 7993.74 | 
     [snip]
 |    4 |      60 | BURMEISTER | HATTIE    | 7973.04 | 
 |    5 |      60 | LADD       | JEFFREY   |   7952.5 | 
 +------+--------+------------+-----------+---------+
 30  rows in set (0.06 sec)
mysql>

We are now on equal footing with the Oracle analytic function from a performance standpoint. Admittedly, from a syntactical point of view the analytic functions are much better. The group_concat()/find_in_set() combination is a good deal better, too, if the restrictions and the performance are acceptable. But the functionality is here and can be easily extended to several analytic functions.

Another Analytic Example

Another commonly used Oracle analytic function is:

LAG(expression, offset) over (partition by ... order by ...)

which returns the offsetth  previous value for the current row's expression with regard to what is specified by the over () expression. The most commonly used offset is 1, and LAG() is very convenient for computing a difference between two successive measures of the same cumulative counter (it returns NULL when the offsetth  previous value doesn't exist). "Memorizing" previous values is what we have just done with user variables, and therefore there is nothing new here.

If we want to compute the average number of days between two successive hires by department, we can write something like this with Oracle:

  SQL>  select DEPTNO, AVG(HIRE_INTERVAL)
   2  from  (select DEPTNO,
   3               HIREDATE - LAG(HIREDATE, 1)
   4                             over (partition by  DEPTNO
   5                                   order by HIREDATE)  HIRE_INTERVAL
   6         from EMPLOYEES)
   7   group by DEPTNO
   8  /
    DEPTNO AVG(HIRE_INTERVAL)
   ----------  ------------------
           10          6.6842104
           20          3.5556686
           30         3.52857866
           40         3.50300597
           50         3.44334974
           60         7.12665996
6 rows  selected.
Elapsed:  00:00:00.06
   SQL>

By applying the techniques we have used for ranks,  we create this with MySQL:

 mysql>  select DEPTNO, avg(HIRE_INTERVAL)
       -> from (select DEPTNO,
       ->              if (@dept = DEPTNO,
       ->                     datediff(HIREDATE, @hd) +  least(0, @hd := HIREDATE),
       ->                     NULL + least(0, @dept :=  DEPTNO) + (@hd := NULL))
       ->                                                      HIRE_INTERVAL
       ->        from EMPLOYEES,
       ->            (select (@dept := 0)) as a
       ->        order by DEPTNO, HIREDATE) as b
       -> group by DEPTNO;
   +--------+--------------------+
   |  DEPTNO | avg(HIRE_INTERVAL) |
   +--------+--------------------+
   |     10 |     6.6858237547893 | 
   |     20 |     3.5564598168871 | 
   |     30 |     3.5303643724696 | 
   |     40 |     3.5002506265664 | 
   |     50 |     3.4406111384919 | 
   |     60 |     7.1329243353783 | 
   +--------+--------------------+
   6 rows  in set (0.06 sec)

We must be very careful not to forget the order by clause in the subquery aliased as b, since this is what matches the order by in the over () part of the analytic function. You may notice that although the data is the same, the numerical values are not identical in Oracle and in MySQl, though they're very close. This is because the Oracle difference of two dates gives a fractional number of days, while the datediff() MySQL function ignores the time part. I have been too lazy to adjust one or the other...

Foreseeing Values

The opposite of LAG() is the LEAD() function, which returns the offsetth next value. It's easy to cumulate or save values we have just fetched from the database. But dealing with values that haven't been displayed yet requires buffering.

Very often—and LEAD() illustrates this case extremely well—little more is required to get the desired result than inverting the sort order in the over () clause with Oracle, or in the subquery with MySQL. Nevertheless, there are some cases that require attention. I mentioned when I introduced analytic functions that the over ()  clause can also include a range restriction. In all the previous examples, I have applied the function to the full set of rows defined by the partition by expression. There are several cases, however, when you may wish to further restrict the scope of application, for instance when computing a cumulative amount.

The most common range restrictions are:

RANGE UNBOUNDED PRECEDING presents no challenge; we can best solve it with user variables. As I said when comparing LEAD() with LAG(), RANGE UNBOUNDED FOLLOWING is mostly a case of inverting the ordering, at least temporarily before reordering again.

Windows that are centered on the current row are more difficult to handle—at least, more difficult to handle efficiently—because we encounter once again a case of strong correlation with the current row. Let's take, for instance, the computation of the average salary of the current employee and of the two employees who were hired before him in the same department, and of the two employees who were hired after him in the same department.

 SQL>  select DEPTNO,
     2         EMPNO, LASTNAME, HIREDATE,
     3          SAL,
     4          round(AVG(SAL) over (partition by DEPTNO
     5                              order by HIREDATE
     6                              rows between 2  preceding
     7                                       and 2  following), 2) MOVING_AVERAGE
     8  from  EMPLOYEES
     9   order by DEPTNO, HIREDATE, EMPNO;


  [snip]
           60        3284 ANDERSON             15-FEB-07    4443.75        3196.43
           60        3336 DUNBAR               16-FEB-07    2167.19        3191.85
           60        4687 LUO                   18-FEB-07     6457.5        4594.29
           60        6840 JACKSON              28-FEB-07    1995.17        4599.79
           60       10417 NEILL                 28-FEB-07    7907.86        4480.14
           60        6232 CARLSON              01-MAR-07    4471.24        3985.81
           60        9697 LYNCH                 02-MAR-07    1568.95        4649.35
10000  rows selected.
Elapsed:  00:00:00.83
   SQL>

Finding, for each row, the two rows preceding it and the two rows following it by date isn't obvious (it would be easier if date intervals were regular, but that isn't the case). Our first attempts at computing the rank has convinced you, I hope, that correlated subqueries that fire for every row and search the same table using unindexed criteria is performance suicide; we don't want to scan 10,000 times 10,000 rows.
Can we exploit what we have done so far? We can try, for instance, computing a row number, then self-joining with a condition on the row numbers,  but we still end up joining a 10,000 row set to another 10,000 row set on a computed column. This is not likely to be very fast, and definitely unlikely to compare favorably to the sub-second Oracle execution (and display) time.

You must realize that when you have a sliding window (as is the case here, since we have a window of five rows), each row intervenes more than one time, even if it displayed only once. If we consider employee #4687 (LUO) in the Oracle output, her attributes are displayed only once but her salary actually intervenes not only in the computation of the average displayed on "her" line, but also in the computation of the average and the two lines preceding and two lines following. Actually, with the exception of the very first and very last employees hired  in each department, the salary on each row is used five times. Can we find an easy and inexpensive way to see each row appear five times? We can, and here is how.
First, let's use the techniques we have already seen to compute a simple row number. We need to number everything, so we use a user variable.

  select  if (@dept = DEPTNO, 
              @rn := @rn + 1,
              @rn := 1 + least(0, @dept :=  DEPTNO)) rn, 
        e.DEPTNO,
        e.EMPNO,
        e.LASTNAME,
        e.HIREDATE,
        e.SAL
 from EMPLOYEES  as e,
      (select (@dept := 0)) as z
 order  by e.DEPTNO, e.HIREDATE


Then, since we need each row five times, we are going to multiply our rows by using a pivot table—that is, simply by performing a Cartesian join (without any join condition) with a five-row table.

   select x.rn + p.NUM BATCH,
          p.NUM,
          x.DEPTNO,
          x.EMPNO,
          x.LASTNAME,
          x.HIREDATE,
          x.SAL
   from  (select if (@dept = DEPTNO, 
                       @rn := @rn + 1,
                       @rn := 1 + least(0, @dept  := DEPTNO)) rn, 
                 e.DEPTNO,
                 e.EMPNO,
                 e.LASTNAME,
                 e.HIREDATE,
                 e.SAL
         from EMPLOYEES as e,
               (select (@dept := 0)) as z
         order by e.DEPTNO, e.HIREDATE) as x,
        (select 0 NUM
         union all
         select 1 NUM
         union all
         select 2 NUM
         union all
         select 3 NUM
         union all
         select 4 NUM) as p

Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of NUM (0 to 5). Then I compute a value that I call BATCH, which is the sum of NUM and of the initial row number rn. Why this value? Because since NUM varies, we will have some overlap for the values of BATCH.

Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23,  all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use MAX() to aggregate and squash five rows into one. And here it is:

  select a.DEPTNO,
        max(case a.NUM
             when 2 then a.EMPNO
             else NULL
            end) EMPNO,
        max(case a.NUM
             when 2 then a.LASTNAME
             else NULL
            end) LASTNAME,
        max(case a.NUM
             when 2 then a.HIREDATE
             else NULL
            end) HIREDATE,
        round(max(case a.NUM
                   when 2 then a.SAL
                   else NULL
                  end), 2) SAL,
        round(avg(SAL), 2)
 from  (select x.rn + p.NUM BATCH,
              p.NUM,
              x.DEPTNO,
              x.EMPNO,
              x.LASTNAME,
              x.HIREDATE,
              x.SAL
       from (select if (@dept = DEPTNO, 
                            @rn := @rn + 1,
                            @rn := 1 + least(0,  @dept := DEPTNO)) rn, 
                     e.DEPTNO,
                     e.EMPNO,
                     e.LASTNAME,
                     e.HIREDATE,
                     e.SAL
             from EMPLOYEES as e,
                  (select (@dept := 0)) as z
             order by e.DEPTNO, e.HIREDATE) as  x,
            (select 0 NUM
             union all
             select 1 NUM
             union all
             select 2 NUM
             union all
             select 3 NUM
             union all
             select 4 NUM) as p) as a
 group  by a.DEPTNO, a.BATCH
 having  count(*) > 2
 order  by 1, 4, 2;

Two passing remarks about this code:

And now the verdict:

    [snip]
|     60 |  1903 | BARNES        | 2006-12-15 | 4760.32 |            3435.31 |
|     60 |  3240 | FARRELL       | 2006-12-21 | 4046.99 |            4260.59 |
|     60 |  5030 | POULIN        | 2006-12-23 | 3942.44 |            5099.71 |
|     60 |  8345 | PADRON        | 2007-01-03 | 5956.80 |            4551.25 |
|     60 |  3218 | LADD          | 2007-01-12 | 6791.98 |            3920.99 |
|     60 |  3046 | WELLMAN       | 2007-01-18 | 2018.05 |            4021.25 |
|     60 |  2787 | WILLIAMS      | 2007-02-05 |  895.66 |            3263.33 |
|     60 |  3284 | ANDERSON      | 2007-02-15 | 4443.75 |            3196.43 |
|     60 |  3336 | DUNBAR        | 2007-02-16 | 2167.19 |            3191.85 |
|     60 |  4687 | LUO           | 2007-02-18 | 6457.50 |            4594.29 |
|     60 |  6840 | JACKSON       | 2007-02-28 | 1995.17 |            4599.79 |
|     60 | 10417 | NEILL         | 2007-02-28 | 7907.86 |            4480.14 |
|     60 |  6232 | CARLSON       | 2007-03-01 | 4471.24 |            3985.81 |
|     60 |  9697 | LYNCH         | 2007-03-02 | 1568.95 |            4649.35 |
+--------+-------+---------------+------------+---------+--------------------+
10000 rows in set (0.44 sec)

mysql>

Wham, bang. Almost twice as fast as the Oracle analytic function on this example...

Conclusion

I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of  variables, or to the particularly wild last example.

One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."

There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view V_EMPLOYEES_WITH_SAL_RANK, it will only be a matter of time before someone writes:

       select distinct e1.DEPTNO, e1.SAL
          from V_EMPLOYEES_WITH_SAL_RANK e1
          where e1.RANK = (select max(e2.RANK)
                           from V_EMPLOYEES_WITH_SAL_RANK  e2
                           where e2.DEPTNO =  e1.DEPTNO)

which, as you may have realized (but the optimizer won't), is just an inefficient

 select  DEPTNO, MIN(SAL)
   FROM  EMPLOYEES
   group  by DEPTNO

Have fun.

Stephane Faroult first discovered the relational model and the SQL language in 1983.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.