AddThis Social Bookmark Button


Emulating Analytic (AKA Ranking) Functions with MySQL: Part 2

by Stephane Faroult

As I have discussed more than once, the previously outlined emulation of analytic functions is fine as long as we wisely restrain ourselves to using only the n first (or last) elements in what would have been our partition clause. This is a factor only when n is greater than 1; if n is 1, we don't need anything more sophisticated than the regular usage of MAX() or MIN(). There are cases, however, when we may need to hold more than the first 100 values or so in our list. Suppose, for instance, that we want to compute the median salary for each of our departments. We would need big ordered lists to be able to locate the middle value. Raising the value of the group_concat_max_len environment variable and hoping for the best is not a responsible option.

Fortunately, there is another possibility, and once again we will use techniques that are not available in the Oracle world: using variables on which we perform calculations in queries. Numbering rows as they are returned and emulating Oracle's rownum virtual column is a well-known technique. Mark Malakanov gave a particularly elegant example on his blog in which the variable is initialized inside the query:

SELECT @rownum:=@rownum+1 rownum, t.* FROM (SELECT @rownum:=0) r, mytable t;

Actually, approach isn't strictly identical to the Oracle result, and it can be checked by adding an order by clause. With Oracle, rownum is incremented as rows are returned from the database and before any sort is performed. If you add an order by clause, your rownum values get mixed up. With MySQL, the value is incremented as it is displayed and therefore computed after any sort. From a practical point of view, this is rather more convenient.

Assuming that our query is suitably sorted by department number and decreasing salary, if we want to rank by department, we need to reinitialize the counter each time we switch department in our result set. We therefore need two variables—one to compute the number (in the following example we use a simple row number and ignore ties, for the sake of simplification), and one to hold the previous department number value so we can compare it to the value in the current row. We might be very tempted to use something such as:

select  if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1) rownum,
          (@dept := DEPTNO) DEPTNO,
   from  EMPLOYEES,
        (select (@dept := 0)) as x
   order  by DEPTNO, SAL desc;

There is a catch, however. The MySQL documentation states clearly that there is no guarantee about the order of evaluation of the various variables in the different (pseudo) columns. This is legitimate, because in relational theory, the order of columns is irrelevant. However, the previous query implicitly assumes that we compare the current department number to the value stored in @dept before assigning it to the same variable. If assignments are performed the other way around, it no longer works.
The only solution available to force comparisons and assignments to be performed in the order we want is to have all of them done inside a single element in our select list.

An elegant way to compute a variable and hide it from public view is to assign it in an expression that discards part of its result. Baron Schwartz gives an excellent example in a recent ONLamp article that also discusses ranking: the idea is to use the least() function, which only returns the smallest of its parameters but has to evaluate all of them. Since department numbers are positive values, we know that least(0, @dept := DEPTNO) will assign the value of DEPTNO to the @dept variable, then return 0, which we can safely add to any numerical value (the rank, for instance.) A proper way to rewrite the previous, potentially wrong query might therefore be something such as:

  select if (@dept =  DEPTNO, @rn := @rn + 1,
                            @rn := 1 + least(0, @dept  := DEPTNO)) rownum,
      (select (@dept := 0)) as x
 order  by DEPTNO, SAL desc;

In that case, the department number is first compared to the contents of the variable; if they are equal, we simply have to increment the row number. If they aren't, we initialize the row number to 1 and surreptitiously add a 0, the only purpose of which is to set @dept to the new DEPTNO value. Neat. And we control the order in which the various operations are performed.

Now that you understand the principle, we can refine and transform our row number into a true rank. If we want to emulate rank(), we need to hold not only the previous value of DEPTNO but also the previous value of SAL, to check whether we must increase our rank. Moreover, we must check how many identical values we have in order to decide by how much we must increase the rank value next (dense_rank() is easier to deal with).

1: if (@dept =  DEPTNO, 
2:       if (@sal = SAL,
3:              @rnk := @rnk + least(0, @inc :=  @inc + 1),
4:              @rnk := @rnk + greatest(@inc,  @inc := 1)
5:                           + least(0, @sal :=  SAL)),
6:       @rnk := 1 + least(0, @dept := DEPTNO) 
7:                 + least(0, @sal := SAL)
8:                 + least(0, @inc := 1)) rank

A few explanations are probably welcome:

  • On line 1 I test whether we are still dealing with the same department, which is handled by lines 2 to 5.
  • If we have a different department (lines 6 and 7), operations are relatively simple: I set the rank to 1 and use the least() trick to initialize three variables—one to store the current department number as with the row number computation, one to store the current salary, and one to store by how much I wil have to increase the rank next (1 by default).
  • Let's now consider in detail lines 2 through 5, and what happens when we stay within the same department. First of all, is the current salary identical to the previous one? In the case that it is (line 3), I reassign the very same value to the @rnk variable and use the least() trick to increment, behind the scenes, the value by which the rank will change the next time the salary changes. I may have a tie—for instance, a rank 23 followed by another rank 23—but if afterwards the salary is smaller, then I want a rank 25.

If the salary is different (which means less) than the previous one, then MySQL executes what we find on lines 4 and 5. We add the increment and reset it to 1 in a single operation by using a trick very similar to what we have done with least(). This particular step is somewhat bold, because although we know that both expressions will be evaluated, I have found nowhere any guarantee that they will be evaluated from left to right. The left-to-right evaluation order is, however, "natural" in this context (witness coalesce()), and it works. Then we use least() to reset the @sal variable to the new current value of the salary. The final query becomes:

  mysql>  select *
     -> from (select if (@dept = DEPTNO, 
     ->                    if (@sal = SAL,
     ->                        @rnk := @rnk + least(0,  @inc := @inc + 1),
     ->                        @rnk := @rnk +  greatest(@inc, @inc := 1)
     ->                                     + least(0,  @sal := SAL)),
     ->                    @rnk := 1 + least(0, @dept  := DEPTNO) 
     ->                              + least(0, @sal :=  SAL)
     ->                              + least(0, @inc :=  1)) rank,
     ->              DEPTNO,
     ->              LASTNAME,
     ->               FIRSTNAME,
     ->              SAL
     ->       from EMPLOYEES,
     ->            (select (@dept := 0)) as x
     ->       order by DEPTNO, SAL desc) as y
     -> where rank <= 5;
 | rank  | DEPTNO | LASTNAME   | FIRSTNAME |  SAL     |
 |    1 |      10 | WOLFORD    | KATHY     | 7997.34 | 
 |    2 |      10 | HARRIS     | GARLAND   | 7993.74 | 
 |    4 |      60 | BURMEISTER | HATTIE    | 7973.04 | 
 |    5 |      60 | LADD       | JEFFREY   |   7952.5 | 
 30  rows in set (0.06 sec)

We are now on equal footing with the Oracle analytic function from a performance standpoint. Admittedly, from a syntactical point of view the analytic functions are much better. The group_concat()/find_in_set() combination is a good deal better, too, if the restrictions and the performance are acceptable. But the functionality is here and can be easily extended to several analytic functions.

Pages: 1, 2, 3

Next Pagearrow