AddThis Social Bookmark Button

Print

Emulating Analytic (AKA Ranking) Functions with MySQL: Part 2
Pages: 1, 2, 3

Then, since we need each row five times, we are going to multiply our rows by using a pivot table—that is, simply by performing a Cartesian join (without any join condition) with a five-row table.



   select x.rn + p.NUM BATCH,
          p.NUM,
          x.DEPTNO,
          x.EMPNO,
          x.LASTNAME,
          x.HIREDATE,
          x.SAL
   from  (select if (@dept = DEPTNO, 
                       @rn := @rn + 1,
                       @rn := 1 + least(0, @dept  := DEPTNO)) rn, 
                 e.DEPTNO,
                 e.EMPNO,
                 e.LASTNAME,
                 e.HIREDATE,
                 e.SAL
         from EMPLOYEES as e,
               (select (@dept := 0)) as z
         order by e.DEPTNO, e.HIREDATE) as x,
        (select 0 NUM
         union all
         select 1 NUM
         union all
         select 2 NUM
         union all
         select 3 NUM
         union all
         select 4 NUM) as p

Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of NUM (0 to 5). Then I compute a value that I call BATCH, which is the sum of NUM and of the initial row number rn. Why this value? Because since NUM varies, we will have some overlap for the values of BATCH.

Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23,  all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use MAX() to aggregate and squash five rows into one. And here it is:

  select a.DEPTNO,
        max(case a.NUM
             when 2 then a.EMPNO
             else NULL
            end) EMPNO,
        max(case a.NUM
             when 2 then a.LASTNAME
             else NULL
            end) LASTNAME,
        max(case a.NUM
             when 2 then a.HIREDATE
             else NULL
            end) HIREDATE,
        round(max(case a.NUM
                   when 2 then a.SAL
                   else NULL
                  end), 2) SAL,
        round(avg(SAL), 2)
 from  (select x.rn + p.NUM BATCH,
              p.NUM,
              x.DEPTNO,
              x.EMPNO,
              x.LASTNAME,
              x.HIREDATE,
              x.SAL
       from (select if (@dept = DEPTNO, 
                            @rn := @rn + 1,
                            @rn := 1 + least(0,  @dept := DEPTNO)) rn, 
                     e.DEPTNO,
                     e.EMPNO,
                     e.LASTNAME,
                     e.HIREDATE,
                     e.SAL
             from EMPLOYEES as e,
                  (select (@dept := 0)) as z
             order by e.DEPTNO, e.HIREDATE) as  x,
            (select 0 NUM
             union all
             select 1 NUM
             union all
             select 2 NUM
             union all
             select 3 NUM
             union all
             select 4 NUM) as p) as a
 group  by a.DEPTNO, a.BATCH
 having  count(*) > 2
 order  by 1, 4, 2;

Two passing remarks about this code:

  • The rather wild treatment creates some rounding errors. I have had to round not only the average salary, but the salary itself.
  • We have a problem at the boundaries. For the computations to be valid, we need at least three rows in the aggregate. This is the reason for the having clause.

And now the verdict:

    [snip]
|     60 |  1903 | BARNES        | 2006-12-15 | 4760.32 |            3435.31 |
|     60 |  3240 | FARRELL       | 2006-12-21 | 4046.99 |            4260.59 |
|     60 |  5030 | POULIN        | 2006-12-23 | 3942.44 |            5099.71 |
|     60 |  8345 | PADRON        | 2007-01-03 | 5956.80 |            4551.25 |
|     60 |  3218 | LADD          | 2007-01-12 | 6791.98 |            3920.99 |
|     60 |  3046 | WELLMAN       | 2007-01-18 | 2018.05 |            4021.25 |
|     60 |  2787 | WILLIAMS      | 2007-02-05 |  895.66 |            3263.33 |
|     60 |  3284 | ANDERSON      | 2007-02-15 | 4443.75 |            3196.43 |
|     60 |  3336 | DUNBAR        | 2007-02-16 | 2167.19 |            3191.85 |
|     60 |  4687 | LUO           | 2007-02-18 | 6457.50 |            4594.29 |
|     60 |  6840 | JACKSON       | 2007-02-28 | 1995.17 |            4599.79 |
|     60 | 10417 | NEILL         | 2007-02-28 | 7907.86 |            4480.14 |
|     60 |  6232 | CARLSON       | 2007-03-01 | 4471.24 |            3985.81 |
|     60 |  9697 | LYNCH         | 2007-03-02 | 1568.95 |            4649.35 |
+--------+-------+---------------+------------+---------+--------------------+
10000 rows in set (0.44 sec)

mysql>

Wham, bang. Almost twice as fast as the Oracle analytic function on this example...

Conclusion

I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of  variables, or to the particularly wild last example.

One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."

There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view V_EMPLOYEES_WITH_SAL_RANK, it will only be a matter of time before someone writes:

       select distinct e1.DEPTNO, e1.SAL
          from V_EMPLOYEES_WITH_SAL_RANK e1
          where e1.RANK = (select max(e2.RANK)
                           from V_EMPLOYEES_WITH_SAL_RANK  e2
                           where e2.DEPTNO =  e1.DEPTNO)

which, as you may have realized (but the optimizer won't), is just an inefficient

 select  DEPTNO, MIN(SAL)
   FROM  EMPLOYEES
   group  by DEPTNO

Have fun.

Stephane Faroult first discovered the relational model and the SQL language in 1983.


Return to ONLamp.com.


-->