Emulating Analytic (AKA Ranking) Functions with MySQL: Part 2
Pages: 1, 2, 3
select x.rn + p.NUM BATCH,
p.NUM,
x.DEPTNO,
x.EMPNO,
x.LASTNAME,
x.HIREDATE,
x.SAL
from (select if (@dept = DEPTNO,
@rn := @rn + 1,
@rn := 1 + least(0, @dept := DEPTNO)) rn,
e.DEPTNO,
e.EMPNO,
e.LASTNAME,
e.HIREDATE,
e.SAL
from EMPLOYEES as e,
(select (@dept := 0)) as z
order by e.DEPTNO, e.HIREDATE) as x,
(select 0 NUM
union all
select 1 NUM
union all
select 2 NUM
union all
select 3 NUM
union all
select 4 NUM) as p
Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of NUM (0 to 5). Then I compute a value that I call BATCH, which is the sum of NUM and of the initial row number rn. Why this value? Because since NUM varies, we will have some overlap for the values of BATCH.
Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23, all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use MAX() to aggregate and squash five rows into one. And here it is:
select a.DEPTNO,
max(case a.NUM
when 2 then a.EMPNO
else NULL
end) EMPNO,
max(case a.NUM
when 2 then a.LASTNAME
else NULL
end) LASTNAME,
max(case a.NUM
when 2 then a.HIREDATE
else NULL
end) HIREDATE,
round(max(case a.NUM
when 2 then a.SAL
else NULL
end), 2) SAL,
round(avg(SAL), 2)
from (select x.rn + p.NUM BATCH,
p.NUM,
x.DEPTNO,
x.EMPNO,
x.LASTNAME,
x.HIREDATE,
x.SAL
from (select if (@dept = DEPTNO,
@rn := @rn + 1,
@rn := 1 + least(0, @dept := DEPTNO)) rn,
e.DEPTNO,
e.EMPNO,
e.LASTNAME,
e.HIREDATE,
e.SAL
from EMPLOYEES as e,
(select (@dept := 0)) as z
order by e.DEPTNO, e.HIREDATE) as x,
(select 0 NUM
union all
select 1 NUM
union all
select 2 NUM
union all
select 3 NUM
union all
select 4 NUM) as p) as a
group by a.DEPTNO, a.BATCH
having count(*) > 2
order by 1, 4, 2;
Two passing remarks about this code:
- The rather wild treatment creates some rounding errors. I have had to round not only the average salary, but the salary itself.
- We have a problem at the boundaries. For the computations to be valid, we need at least three rows in the aggregate. This is the reason for the
havingclause.
And now the verdict:
[snip]
| 60 | 1903 | BARNES | 2006-12-15 | 4760.32 | 3435.31 |
| 60 | 3240 | FARRELL | 2006-12-21 | 4046.99 | 4260.59 |
| 60 | 5030 | POULIN | 2006-12-23 | 3942.44 | 5099.71 |
| 60 | 8345 | PADRON | 2007-01-03 | 5956.80 | 4551.25 |
| 60 | 3218 | LADD | 2007-01-12 | 6791.98 | 3920.99 |
| 60 | 3046 | WELLMAN | 2007-01-18 | 2018.05 | 4021.25 |
| 60 | 2787 | WILLIAMS | 2007-02-05 | 895.66 | 3263.33 |
| 60 | 3284 | ANDERSON | 2007-02-15 | 4443.75 | 3196.43 |
| 60 | 3336 | DUNBAR | 2007-02-16 | 2167.19 | 3191.85 |
| 60 | 4687 | LUO | 2007-02-18 | 6457.50 | 4594.29 |
| 60 | 6840 | JACKSON | 2007-02-28 | 1995.17 | 4599.79 |
| 60 | 10417 | NEILL | 2007-02-28 | 7907.86 | 4480.14 |
| 60 | 6232 | CARLSON | 2007-03-01 | 4471.24 | 3985.81 |
| 60 | 9697 | LYNCH | 2007-03-02 | 1568.95 | 4649.35 |
+--------+-------+---------------+------------+---------+--------------------+
10000 rows in set (0.44 sec)
mysql>
Wham, bang. Almost twice as fast as the Oracle analytic function on this example...
Conclusion
I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of variables, or to the particularly wild last example.
One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."
There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view V_EMPLOYEES_WITH_SAL_RANK, it will only be a matter of time before someone writes:
select distinct e1.DEPTNO, e1.SAL
from V_EMPLOYEES_WITH_SAL_RANK e1
where e1.RANK = (select max(e2.RANK)
from V_EMPLOYEES_WITH_SAL_RANK e2
where e2.DEPTNO = e1.DEPTNO)
which, as you may have realized (but the optimizer won't), is just an inefficient
select DEPTNO, MIN(SAL) FROM EMPLOYEES group by DEPTNO
Have fun.
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to ONLamp.com.


