AddThis Social Bookmark Button


Emulating Analytic (AKA Ranking) Functions with MySQL: Part 2
Pages: 1, 2, 3

Another Analytic Example

Another commonly used Oracle analytic function is:

LAG(expression, offset) over (partition by ... order by ...)

which returns the offsetth  previous value for the current row's expression with regard to what is specified by the over () expression. The most commonly used offset is 1, and LAG() is very convenient for computing a difference between two successive measures of the same cumulative counter (it returns NULL when the offsetth  previous value doesn't exist). "Memorizing" previous values is what we have just done with user variables, and therefore there is nothing new here.

If we want to compute the average number of days between two successive hires by department, we can write something like this with Oracle:

   2  from  (select DEPTNO,
   3               HIREDATE - LAG(HIREDATE, 1)
   4                             over (partition by  DEPTNO
   5                                   order by HIREDATE)  HIRE_INTERVAL
   6         from EMPLOYEES)
   7   group by DEPTNO
   8  /
   ----------  ------------------
           10          6.6842104
           20          3.5556686
           30         3.52857866
           40         3.50300597
           50         3.44334974
           60         7.12665996
6 rows  selected.
Elapsed:  00:00:00.06

By applying the techniques we have used for ranks,  we create this with MySQL:

 mysql>  select DEPTNO, avg(HIRE_INTERVAL)
       -> from (select DEPTNO,
       ->              if (@dept = DEPTNO,
       ->                     datediff(HIREDATE, @hd) +  least(0, @hd := HIREDATE),
       ->                     NULL + least(0, @dept :=  DEPTNO) + (@hd := NULL))
       ->                                                      HIRE_INTERVAL
       ->        from EMPLOYEES,
       ->            (select (@dept := 0)) as a
       ->        order by DEPTNO, HIREDATE) as b
       -> group by DEPTNO;
   |     10 |     6.6858237547893 | 
   |     20 |     3.5564598168871 | 
   |     30 |     3.5303643724696 | 
   |     40 |     3.5002506265664 | 
   |     50 |     3.4406111384919 | 
   |     60 |     7.1329243353783 | 
   6 rows  in set (0.06 sec)

We must be very careful not to forget the order by clause in the subquery aliased as b, since this is what matches the order by in the over () part of the analytic function. You may notice that although the data is the same, the numerical values are not identical in Oracle and in MySQl, though they're very close. This is because the Oracle difference of two dates gives a fractional number of days, while the datediff() MySQL function ignores the time part. I have been too lazy to adjust one or the other...

Foreseeing Values

The opposite of LAG() is the LEAD() function, which returns the offsetth next value. It's easy to cumulate or save values we have just fetched from the database. But dealing with values that haven't been displayed yet requires buffering.

Very often—and LEAD() illustrates this case extremely well—little more is required to get the desired result than inverting the sort order in the over () clause with Oracle, or in the subquery with MySQL. Nevertheless, there are some cases that require attention. I mentioned when I introduced analytic functions that the over ()  clause can also include a range restriction. In all the previous examples, I have applied the function to the full set of rows defined by the partition by expression. There are several cases, however, when you may wish to further restrict the scope of application, for instance when computing a cumulative amount.

The most common range restrictions are:

  • RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, usually shortened to RANGE UNBOUNDED PRECEDING, which simply means "don't compute on all the partition—only from the beginning of the partition up to and including the current row" (what we would want for a cumulated amount).
  • RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING, usually shortened to RANGE UNBOUNDED FOLLOWING, which is the opposite of the previous one and might be useful for computing how much remains to be paid on a loan at various dates, for instance.
  • ROWS BETWEEN n PRECEDING AND n FOLLOWING, which is the type of expression we would use to compute a moving average.

RANGE UNBOUNDED PRECEDING presents no challenge; we can best solve it with user variables. As I said when comparing LEAD() with LAG(), RANGE UNBOUNDED FOLLOWING is mostly a case of inverting the ordering, at least temporarily before reordering again.

Windows that are centered on the current row are more difficult to handle—at least, more difficult to handle efficiently—because we encounter once again a case of strong correlation with the current row. Let's take, for instance, the computation of the average salary of the current employee and of the two employees who were hired before him in the same department, and of the two employees who were hired after him in the same department.

 SQL>  select DEPTNO,
     3          SAL,
     4          round(AVG(SAL) over (partition by DEPTNO
     5                              order by HIREDATE
     6                              rows between 2  preceding
     7                                       and 2  following), 2) MOVING_AVERAGE
     8  from  EMPLOYEES
     9   order by DEPTNO, HIREDATE, EMPNO;

           60        3284 ANDERSON             15-FEB-07    4443.75        3196.43
           60        3336 DUNBAR               16-FEB-07    2167.19        3191.85
           60        4687 LUO                   18-FEB-07     6457.5        4594.29
           60        6840 JACKSON              28-FEB-07    1995.17        4599.79
           60       10417 NEILL                 28-FEB-07    7907.86        4480.14
           60        6232 CARLSON              01-MAR-07    4471.24        3985.81
           60        9697 LYNCH                 02-MAR-07    1568.95        4649.35
10000  rows selected.
Elapsed:  00:00:00.83

Finding, for each row, the two rows preceding it and the two rows following it by date isn't obvious (it would be easier if date intervals were regular, but that isn't the case). Our first attempts at computing the rank has convinced you, I hope, that correlated subqueries that fire for every row and search the same table using unindexed criteria is performance suicide; we don't want to scan 10,000 times 10,000 rows.
Can we exploit what we have done so far? We can try, for instance, computing a row number, then self-joining with a condition on the row numbers,  but we still end up joining a 10,000 row set to another 10,000 row set on a computed column. This is not likely to be very fast, and definitely unlikely to compare favorably to the sub-second Oracle execution (and display) time.

You must realize that when you have a sliding window (as is the case here, since we have a window of five rows), each row intervenes more than one time, even if it displayed only once. If we consider employee #4687 (LUO) in the Oracle output, her attributes are displayed only once but her salary actually intervenes not only in the computation of the average displayed on "her" line, but also in the computation of the average and the two lines preceding and two lines following. Actually, with the exception of the very first and very last employees hired  in each department, the salary on each row is used five times. Can we find an easy and inexpensive way to see each row appear five times? We can, and here is how.
First, let's use the techniques we have already seen to compute a simple row number. We need to number everything, so we use a user variable.

  select  if (@dept = DEPTNO, 
              @rn := @rn + 1,
              @rn := 1 + least(0, @dept :=  DEPTNO)) rn, 
 from EMPLOYEES  as e,
      (select (@dept := 0)) as z
 order  by e.DEPTNO, e.HIREDATE

Pages: 1, 2, 3

Next Pagearrow