AddThis Social Bookmark Button

Print

Emulating Analytic (AKA Ranking) Functions with MySQL
Pages: 1, 2, 3, 4

MySQL now:

mysql>select *
    -> from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
    ->              (select 1 + count(*)
    ->               from EMPLOYEES b
    ->               where b.DEPTNO = a.DEPTNO
    ->                 and b.SAL > a.SAL) RANK
    ->       from EMPLOYEES as a) as x
    -> where x.RANK <= 5
    -> order by x.DEPTNO, x.RANK;
+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME   | FIRSTNAME | SAL     | RANK |
+--------+-------+------------+-----------+---------+------+
|     10 |  4751 | WOLFORD    | KATHY     | 7997.34 |    1 |
|     10 | 10517 | HARRIS     | GARLAND   | 7993.74 |    2 |
  [snip]
|     60 |  4013 | BURMEISTER | HATTIE    | 7973.04 |    4 |
|     60 |  1472 | LADD       | JEFFREY   |  7952.5 |    5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set (5 min 29.12 sec)

Exactly the same result (<blush>except for the slightly embarrassing line under the table</blush>.) Let's take a closer look at what we are doing. Our subquery that computes the rank is correlated: that means that it refers to the current row of EMPLOYEES, through both the DEPTNO and the SAL columns. We need them to compute how many employees in the same department earn a bigger salary than the current employee. In other words, we must fire the query for each of the 10,000 rows we inspect. Now, when I created my table, I defined EMPNO as the primary key, and created an additional index on LASTNAME, since I expected it to be a frequent search criterion. Neither DEPTNO nor SAL are indexed, in either database. Indexing DEPTNO, given the distribution, is pointless, because it isn't selective enough. Ten thousand times a full scan over a ten thousand row table: one hundred million rows inspected in all. Somewhat more visible than fourteen scans of a fourteen row table...

Let me be upfront about this, I dislike adding indexes. People often underestimate their impact on inserts, deletes, and (sometimes) updates, which can be very high. I don't want to fix a query and break the nightly batch program. There are also cases when I couldn't index. Suppose that instead of having departments, we have stockmarkets. Instead of having employees, we have securities, and instead of ranking salaries, we want to rank the difference between the current price and yesterday's closing price, as a percentage. There is a high update rate, and even if we wanted to, MySQL doesn't allow you to index a computed column. What would we do? Never mind, this isn't a real life assignment, just an ONLamp article. Let's just try to save face and index the two columns that we need to correlate the main query and the subquery.

mysql> create index tempidx on EMPLOYEES(DEPTNO, SAL);
Query OK, 10000 rows affected (0.14 sec)
Records: 10000  Duplicates: 0  Warnings: 0

When we run the query again, here is the bottom line:

30 rows in set (26.48 sec)

We have avoided having to commit hara-kiri (just) but compared to Oracle, it's still far from glorious. Shall we have to kowtow to the mighty Oracle? Let's not throw in the towel yet.

Getting the Top Values Efficiently, Take One

When we run into a performance problem, there is no such thing as RTFMing. We are trying to get with MySQL the same results as SQL extensions that Oracle has, and MySQL hasn't. Instead of using an otherwise squeaky clean SQL construct such as the correlated subquery in the select list, couldn't we find extensions that MySQL has and Oracle hasn't to help us? Perhaps it would be wise to restate the problem. We want to compare the employee in the current row to the employees in the same department. How would we do it manually? I don't know about you, but I don't think that I would compute for each employee in turn how many people make more in the same department. I'd rather establish a full list of employees in the same department, and sort them by salary (which is probably what Oracle does behind the scene); and I'd reuse the same list for each employee in the department instead of recomputing the list each time. What we need is a sorted list.

There is precisely a function that builds a list, GROUP_CONCAT, and interestingly it is a function that Oracle natively lacks (even if it is possible to create such a function as a user-written function). The MySQL documentation provides the full syntax as:

GROUP_CONCAT([DISTINCT] expr [,expr ...]
             [ORDER BY {unsigned_integer | col_name | expr}
                 [ASC | DESC] [,col_name ...]]
             [SEPARATOR str_val])

Have you noticed the order by passed as a parameter? Doesn't it ring a bell? There is of course a catch: the resulting string has a limited length (although adjustable through the group_concat_max_len variable), 1024 characters by default. I shall return to this topic in a moment, but since for the time being I am only interested by the first five items in the list (since I want the people who have the top five salaries) the default maximum length is much more than I need.

I now have found how to construct my sorted list. But how shall I determine the position of the employee described by the current row inside this list? We have another function (which once again Oracle lacks) that fits our requirements, FIND_IN_SET(str,strlist). You should take note that in this function, the separator used in strlist is required to be a comma, which means that we must use a comma (the default) as the separator in our call to GROUP_CONCAT(). So, if whatever we need to store in our list may contain a comma, we must make use of REPLACE() to substitute this comma for an innocuous character whenever we add an item to the list, or search an item in the list.

Properly equipped with these two functions, I can now write and run my query (after having dropped the index on DEPTNO and SAL to be in the same context as Oracle):

mysql> select *
    -> from (select e.DEPTNO,
    ->              e.EMPNO,
    ->              e.LASTNAME,
    ->              e.FIRSTNAME,
    ->              e.SAL,
    ->              find_in_set(e.SAL, x.SALLIST) RANK
    ->       from EMPLOYEES as e,
    ->            (select DEPTNO, group_concat(SAL
    ->                                         order by SAL desc) SALLIST
    ->             from EMPLOYEES
    ->             group by DEPTNO) as x
    ->       where e.DEPTNO = x.DEPTNO) as z
    -> where RANK <= 5
    ->   and RANK > 0
    -> order by DEPTNO, RANK;

A word about the list we build: we haved simply concatenated the salaries by decreasing order, letting MySQL implicitly convert float values to character strings. Since we have about 1,600 employees per department, the default length of 1024 characters is grossly insufficient to build a complete list. However, as I have said previously, we don't need the complete list, we just need the first five values. Nevertheless, we must anticipate that salaries who are not in the top 120 or whereabout will fail to be stored, and therefore will not be found by the FIND_IN_SET() function (which will return 0.) Hence the necessity for an additional condition that RANK must be strictly positive. We get the same result as usual:

+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME   | FIRSTNAME | SAL     | RANK |
+--------+-------+------------+-----------+---------+------+
|     10 |  4751 | WOLFORD    | KATHY     | 7997.34 |    1 |
|     10 | 10517 | HARRIS     | GARLAND   | 7993.74 |    2 |
    [snip]
|     60 |  4013 | BURMEISTER | HATTIE    | 7973.04 |    4 |
|     60 |  1472 | LADD       | JEFFREY   |  7952.5 |    5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set, 1 warning (0.30 sec)

Same result, except that this time the performance, without being quite as excellent as Oracle's, is very decent. We get a warning, in punishment of our sin of trying to build lists that are more than 1K in length, but otherwise we get precisely what we want, with an SQL usage that is both relatively elegant and efficient. We have in the GROUP_CONCAT()/FIND_IN_SET() combination a kind of MySQL crypto-analytic function that enables us to answer almost all the "what are the top (or bottom) n" questions, as long as n isn't too large, and can be used to emulate several Oracle analytic functions. For instance, Oracle can handle both rank() and dense_rank(); the former focuses on what is ranked, and the latter on how it is ranked. Let's suppose that Tom, Dick, and Harry have obtained the following scores at a game:

Tom

   825

Dick

    825

Harry

  750

Harry's rank() is 3, because rank() focuses on what is ranked, the people, and there are two people who have a better score than Harry. Both Tom and Dick have rank 1. However, dense_rank() focuses on how people are ranked, which is to say scores. And here, Harry's dense_rank() is 2, since there is only one score value higher than 750; "dense"  refers to the fact that there is no gap in rank numbers. If you want to emulate dense_rank(), all you need to do is to add a distinct to the expression in group_concat(). And if you want to emulate row_number(), that assigns a unique number within the group defined by the partition clause, all you need to do is build a list using unique identifiers. In our example, keeping in mind that we accept that we don't need more values than what the longest string returned by group_concat() is able to hold, we can emulate:

   row_number() over (partition by deptno
                      order by hiredate)

by using

    FIND_IN_SET(EMPNO, EMPNOLIST)
with EMPNOLIST built as
    GROUP_CONCAT(EMPNO ORDER BY HIREDATE) as EMPNOLIST

associated with the usual group by DEPTNO. A word of caution though, as with in our example with SUM(), any additional restrictive condition that Oracle would need only once will have to be added to the group by query as well with MySQL.

In my next article, we'll look at another way to do this same analytical query, as well as how to implement the LAG function and how to foresee values in queries.

Stephane Faroult first discovered the relational model and the SQL language in 1983.


Return to ONLamp.com.


-->