Another Analytic Example
Another commonly used Oracle analytic function is:
LAG(expression, offset) over (partition by ... order by ...)
which returns the offsetth previous value for the current row's expression with regard to what is specified by the
over () expression. The most commonly used offset is 1, and
LAG() is very convenient for computing a difference between two successive measures of the same cumulative counter (it returns NULL when the offsetth previous value doesn't exist). "Memorizing" previous values is what we have just done with user variables, and therefore there is nothing new here.
If we want to compute the average number of days between two successive hires by department, we can write something like this with Oracle:
SQL> select DEPTNO, AVG(HIRE_INTERVAL) 2 from (select DEPTNO, 3 HIREDATE - LAG(HIREDATE, 1) 4 over (partition by DEPTNO 5 order by HIREDATE) HIRE_INTERVAL 6 from EMPLOYEES) 7 group by DEPTNO 8 /
DEPTNO AVG(HIRE_INTERVAL) ---------- ------------------ 10 6.6842104 20 3.5556686 30 3.52857866 40 3.50300597 50 3.44334974 60 7.12665996
6 rows selected.
Elapsed: 00:00:00.06 SQL>
By applying the techniques we have used for ranks, we create this with MySQL:
mysql> select DEPTNO, avg(HIRE_INTERVAL) -> from (select DEPTNO, -> if (@dept = DEPTNO, -> datediff(HIREDATE, @hd) + least(0, @hd := HIREDATE), -> NULL + least(0, @dept := DEPTNO) + (@hd := NULL)) -> HIRE_INTERVAL -> from EMPLOYEES, -> (select (@dept := 0)) as a -> order by DEPTNO, HIREDATE) as b -> group by DEPTNO; +--------+--------------------+ | DEPTNO | avg(HIRE_INTERVAL) | +--------+--------------------+ | 10 | 6.6858237547893 | | 20 | 3.5564598168871 | | 30 | 3.5303643724696 | | 40 | 3.5002506265664 | | 50 | 3.4406111384919 | | 60 | 7.1329243353783 | +--------+--------------------+ 6 rows in set (0.06 sec)
We must be very careful not to forget the
order by clause in the subquery aliased as b, since this is what matches the
order by in the
over () part of the analytic function. You may notice that although the data is the same, the numerical values are not identical in Oracle and in MySQl, though they're very close. This is because the Oracle difference of two dates gives a fractional number of days, while the
datediff() MySQL function ignores the time part. I have been too lazy to adjust one or the other...
The opposite of
LAG() is the
LEAD() function, which returns the offsetth next value. It's easy to cumulate or save values we have just fetched from the database. But dealing with values that haven't been displayed yet requires buffering.
LEAD() illustrates this case extremely well—little more is required to get the desired result than inverting the sort order in the
over () clause with Oracle, or in the subquery with MySQL. Nevertheless, there are some cases that require attention. I mentioned when I introduced analytic functions that the
over () clause can also include a range restriction. In all the previous examples, I have applied the function to the full set of rows defined by the
partition by expression. There are several cases, however, when you may wish to further restrict the scope of application, for instance when computing a cumulative amount.
The most common range restrictions are:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, usually shortened to
RANGE UNBOUNDED PRECEDING, which simply means "don't compute on all the partition—only from the beginning of the partition up to and including the current row" (what we would want for a cumulated amount).
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING, usually shortened to
RANGE UNBOUNDED FOLLOWING, which is the opposite of the previous one and might be useful for computing how much remains to be paid on a loan at various dates, for instance.
ROWS BETWEEN n PRECEDING AND n FOLLOWING, which is the type of expression we would use to compute a moving average.
RANGE UNBOUNDED PRECEDING presents no challenge; we can best solve it with user variables. As I said when comparing
RANGE UNBOUNDED FOLLOWING is mostly a case of inverting the ordering, at least temporarily before reordering again.
Windows that are centered on the current row are more difficult to handle—at least, more difficult to handle efficiently—because we encounter once again a case of strong correlation with the current row. Let's take, for instance, the computation of the average salary of the current employee and of the two employees who were hired before him in the same department, and of the two employees who were hired after him in the same department.
SQL> select DEPTNO, 2 EMPNO, LASTNAME, HIREDATE, 3 SAL, 4 round(AVG(SAL) over (partition by DEPTNO 5 order by HIREDATE 6 rows between 2 preceding 7 and 2 following), 2) MOVING_AVERAGE 8 from EMPLOYEES 9 order by DEPTNO, HIREDATE, EMPNO;
[snip] 60 3284 ANDERSON 15-FEB-07 4443.75 3196.43 60 3336 DUNBAR 16-FEB-07 2167.19 3191.85 60 4687 LUO 18-FEB-07 6457.5 4594.29 60 6840 JACKSON 28-FEB-07 1995.17 4599.79 60 10417 NEILL 28-FEB-07 7907.86 4480.14 60 6232 CARLSON 01-MAR-07 4471.24 3985.81 60 9697 LYNCH 02-MAR-07 1568.95 4649.35
10000 rows selected.
Elapsed: 00:00:00.83 SQL>
Finding, for each row, the two rows preceding it and the two rows following it by date isn't obvious (it would be easier if date intervals were regular, but that isn't the case). Our first attempts at computing the rank has convinced you, I hope, that correlated subqueries that fire for every row and search the same table using unindexed criteria is performance suicide; we don't want to scan 10,000 times 10,000 rows.
Can we exploit what we have done so far? We can try, for instance, computing a row number, then self-joining with a condition on the row numbers, but we still end up joining a 10,000 row set to another 10,000 row set on a computed column. This is not likely to be very fast, and definitely unlikely to compare favorably to the sub-second Oracle execution (and display) time.
You must realize that when you have a sliding window (as is the case here, since we have a window of five rows), each row intervenes more than one time, even if it displayed only once. If we consider employee #4687 (LUO) in the Oracle output, her attributes are displayed only once but her salary actually intervenes not only in the computation of the average displayed on "her" line, but also in the computation of the average and the two lines preceding and two lines following. Actually, with the exception of the very first and very last employees hired in each department, the salary on each row is used five times. Can we find an easy and inexpensive way to see each row appear five times? We can, and here is how.
First, let's use the techniques we have already seen to compute a simple row number. We need to number everything, so we use a user variable.
select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE