How to Optimize Rank Data in MySQL
Pages: 1, 2, 3, 4, 5, 6, 7
How Efficient Is This Compared to the Alternatives?
The strategy I just showed you takes one pass through the table. That is, it's an O(n) algorithm in the size of the table (as I showed, the best case is O(1), but the worst case is bounded above by n). This is as good as you can get when you have to calculate a rank for each row.
The alternatives I know of are quadratic, or O(n2) in the size of the table. For example:
-- This query is illegal in MySQL, but even if it weren't,
-- it would take a very long time to run.
update score_ranked as o
set rank_in_game = (
select count(distinct score) from score_ranked as i
where i.score >= o.score and i.game = o.game);
-- This is legal, but creates an un-indexed temp table (even worse!)
-- inside the subquery. Don't try this at home!
update score_ranked as o
set rank_in_game = (
select count(distinct score) from (select * from score_ranked) as i
where i.score >= o.score and i.game = o.game);
-- This is still a cross join because of the <=
update score_ranked as o
inner join (
select l.game, l.score, count(distinct r.score) as rank
from score_ranked as l
left outer join score_ranked as r on l.game = r.game
and l.score <= r.score
group by l.game, l.score
) as i on o.game = i.game and o.score = i.score
set o.rank_in_game = i.rank;
If that were the best you could do in SQL, what are your other options? One is to simulate cursors in a programming language like PHP. This means selecting the entire table and looping through the rows, calculating each row's rank and sending an UPDATE back to the MySQL server. I advise against this. It's not quadratic, but it's death by 100,000 cuts.
Failing that, you have to abandon rank columns, and settle for all the COUNT() queries and limits with offsets. In my experience as a consultant, this is the path most people take, and it doesn't scale. If it did, they wouldn't have called me for help.
Think about it this way: the leaderboards (reads) for Design #1 are O(n) in the size of the table, and writes are O(1). If I eliminate COUNT and LIMIT with OFFSET queries, the reads are O(1) and writes are O(n). If I postulate that there are many more reads than writes, I've probably made the right decision.
Does the Rank Column Really Help?
It's not enough to just rely on gut feeling, look at query plans, or run queries and see how long they take. Profiling and benchmarking are essential to understanding whether "optimizations" actually help. The first thing I always do is profile the queries on an otherwise quiet machine so I can measure how much work they really do. This is why I wrote the MySQL Query Profiler.
I profiled the three sets of queries to test the leaderboards. Because gamer 1010 is fairly high in the rankings in those leaderboards, I also profiled gamer 755, who is very low in the rankings, to get a sense of the two extremes. Here is a sampling of the most relevant results:
| Design | Design #1 | Design #2 | Design #3 | |||
|---|---|---|---|---|---|---|
| Gamer | 1010 | 755 | 1010 | 755 | 1010 | 755 |
| Time | 0.03 | 0.2 | 0.05 | 0.2 | 0.2 | 0.01 |
| Queries | 4 | 4 | 3 | 3 | 2 | 2 |
| Optimizer Cost | 21793 | 21793 | 5177 | 5163 | 2 | 2 |
| Table locks | 5 | 5 | 5 | 5 | 3 | 3 |
| Table scans | 1 | 1 | 2 | 2 | 1 | 0 |
| Index range scans | 2 | 2 | 1 | 1 | 0 | 1 |
| Rows sorted | 320 | 10000 | 40 | 18 | 20 | 8 |
| Rows read | 30943 | 59988 | 20403 | 20051 | 27482 | 39 |
| Fixed reads | 320 | 10000 | 20 | 9 | 20 | 8 |
| Next-row reads | 20002 | 20002 | 42 | 20 | 8728 | 9 |
| Bookmark lookup | 10010 | 10010 | 32 | 22 | 8734 | 14 |
| Full scans initiated | 1 | 1 | 0 | 0 | 0 | |
| Next in index | 610 | 19975 | 20309 | 20000 | 10000 | 8 |
| Temp table inserts | 10000 | 10000 | 40 | 18 | 8727 | 8 |
It looks like some of the queries might not be using indexes the same way on the two gamers, but I don't want to get into early optimization. I just want to look for a general improvement with the ranking columns. The profiling data shows why it's not a good idea to rely only on query plans and execution times; the query times aren't that different, but the other numbers show how different the workload is.


