In a recent dialogue on the firstname.lastname@example.org mail group, I ran into a good prototype for a recurring class of SQL tuning problems. The coder had two versions of SQL getting dramatically different performance levels for a very frequent single-row query, and wanted to know the reason for the difference, citing logical and physical I/O figures for each alternative. To make a long story short, neither version was near optimal, and the optimal solution came from a strategy I've applied to a recurring class of problems, which I describe below. First, I present the original problem, summarized with his permission from the note by Mark Bobak (thanks, Mark!):
There are two tables involved:
ADDS_USERS, which has
AUSR_ID as the primary key, and
The original query was:
SELECT A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A, AUTHORIZED_IP_ADDRESSES B WHERE A.AUSR_ID = B.AUSR_ID AND :B1 BETWEEN B.AIA_IP_ADDRESS_START AND B.AIA_IP_ADDRESS_END;
Autotrace apparently misrepresented the execution plan, which was the original source of confusion, and the reason for the original note. The trace statistics for the query were correct, however, and are edited here just to show the important statistics:
Statistics ---------------------------------------------------------- 1321 consistent gets 864 physical reads 1 rows processed
1321 buffer gets were too many, not to mention the physical I/O, considering the frequency with which this was to be executed.
Therefore, Mark tried re-creating
ADDS_USERS ordered by
AUSR_ID, to improve
AUSR_PK index clustering factor. This didn't help. Creating
AUTHORIZED_IP_ADDRESSES as an index-organized table was also useless.
So Mark went back to SQL hacking, and finally came up with the following:
SELECT (select A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A where a.ausr_id = b.ausr_id) from AUTHORIZED_IP_ADDRESSES B WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND B.AIA_IP_ADDRESS_END;
which produced the (accurate) plan:
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=110 Bytes=3080) 1 0 TABLE ACCESS (BY INDEX ROWID) OF 'ADDS_USERS' (Cost=2 Card=1 Bytes=14) 2 1 INDEX (UNIQUE SCAN) OF 'AUSR_PK' (UNIQUE) (Cost=1 Card=1) 3 0 INDEX (RANGE SCAN) OF 'AIA_INDX_PR01' (NON-UNIQUE) (Cost=4 Card=3D110 Bytes=3D3080)
Statistics were much improved:
Statistics ---------------------------------------------------------- 67 consistent gets 0 physical reads 1 rows processed
Consistent gets were down to 67 from 1321, but I noticed that neither plan delivered a path that maximized efficiency for a single-row query. Here is the approach I recommend for this class of problems.
A class of tables I'll call range-keyed tables are laid out with one row per range of values, where the ranges of values are non-overlapping and (usually) inclusive, meaning they leave no "holes" in the midst of the ranges if the ranges are laid out end-to-end, so to speak. (These ranges must not overlap even at the endpoints, unless one end of the range is defined as non-inclusive.) The most common examples of this involve date ranges, although this particular example came from a case of IP-address ranges. Thus, to point to a row, you say:
:bindvariable between lowcol and highcol
or (where one range's
highcol equals the next range's
(:bindvariable > lowcol and : bindvariable <= highcol).
I'll show this another way.
highcol must reflect values belonging to some order-able set, some set capable of being mapped to points on a line, such as a number line or a timeline. Taking the case where one range's
highcol value is the next range's
lowcol value, the ranges would look something like Figure 1.
Figure 1. Ranges represented by rows 1 through 4
Most commonly, the designer chooses to create an index on (
lowcol, highcol), but sometimes the order is reversed, or just one of these is indexed.
The problem is that any such index strategy means that the index range scan to find the single range that meets the query condition is likely to have to read about half of the index, if you ask for a range in the middle of the whole set of ranges. (With date ranges, you can often escape most of the dilemma because applications tend to ask for the most-recent range, or one of the most-recent ranges, anyway, so an index leading with
highcol will see a very short range scan, usually. If, by contrast, you index (
Low_Date, High_Date), in that order, you will guarantee that most index range scans will read the entire range, since every range will have
Low_Date lower than the recent date inside the most-recent date range.)
E. F. Codd discussed (in The Relational Model for Database Management, Version 2, Addison Wesley) a useful set of comparators (
least-greater-than-or-equal-to) that turn out to be ideal for this problem. With these available, you'd just ask for an index range scan with:
highcol least-greater-than-or-equal-to :bindvariable
and the database would find the very first (least) value of
highcol >= :bindvariable in an index range scan on a conventional index of
highcol, and would then quit, because that's the row you want. The cost, if you look at the I/O, would be no different than an index unique scan! The database design for inclusive, non-overlapping ranges would no longer even require
lowcol would be implied by the previous range's
highcol, and the database would automatically avoid potentially thorny corrupt-data issues with accidentally overlapping ranges!
Sadly, I've never found a database that has implemented this lovely concept. Instead, the best I've found is the following admittedly hacky solution I recommend for use on Oracle:
(Forewarning: the following solution may offend the relational purists among you. Yes, I know it's a hack.) Let's say the index on
highcol is called
highcolind. If you want to quickly find the row you're after, the query you need is:
SELECT /*+ INDEX(t highcolind) */ <whatever> FROM rangetable t WHERE highcol >= :bindvariable AND lowcol <= :bindvariable AND ROWNUM = 1
Here, the database won't keep searching for more ranges that might enclose
:bindvariable once it finds the first one, and the range scan on
highcolind finds the first one in the first index entry it sees, as long as the database follows the hint! (If it fails to follow the hint, performance will suffer, clearly, but the database will still return the right row, since it will keep looking until it satisfies both conditions on
:bindvariable. This will not find multiple ranges that include
:bindvariable, though, if the table contains overlapping ranges--I have no easy answer to find multiple ranges if your ranges overlap. If your ranges have holes, it will also have to do the entire range scan from
:bindvariable to the end of the range before it will conclude that no range contains the particular value of
:bindvariable you're looking with--that's why having no holes in the ranges helps.)
To build this into a more-complex query, you'd do something like:
SELECT /*+ ORDERED <maybe other hints> */ <whatever> FROM (SELECT /*+ INDEX(t highcolind) */ * FROM rangetable t WHERE highcol >= :bindvariable AND lowcol <= :bindvariable AND ROWNUM = 1) t, taba a, tabb b, ... WHERE t.fkey = a.pkey ...
If you use this strategy, be sure to watch the endpoint cases. Mark's database design had endpoints that were inclusive, so he used
BETWEEN to find the single range desired, and (presumably) the application guaranteed that the next range's value of
AIA_IP_ADDRESS_START was the next legal address after the previous range's
AIA_IP_ADDRESS_END. If these were Oracle
DATE-type columns, the
Low_Date for the next range would be one second after the
High_Date for the previous range. If, on the other hand, the
lowcol value for one range is equal to the
highcol value of the previous range (as shown in Figure 1), then you must not use
BETWEEN, and must use
> for the comparison with the endpoint that is non-inclusive. For example, if
High_Date is the last included time for one range, and
Low_Date for the next range equals
High_Date for the previous range, then the range condition (which is non-inclusive on the low end) will look like this:
WHERE High_Date >= :bindvariable AND Low_Date < :bindvariable
In Mark's particular case, a new version of the query would be:
SELECT /*+ ORDERED USE_NL(A) */ A.AUSR_LOGIN_SCREEN_NAME FROM (SELECT /*+ INDEX(B AIA_IP_ADDRESS_END_INDEX) */ * FROM AUTHORIZED_IP_ADDRESSES B WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND B.AIA_IP_ADDRESS_END AND ROWNUM = 1) IV, ADDS_USERS A WHERE A.AUSR_ID = IV.AUSR_ID;
which joins a single row from the range-keyed table to the users table. When Mark implemented this strategy, the fixed version of the query required just five logical I/Os, while the better version of the earlier two versions required 67 logical I/Os. More importantly, the fixed version read just a single row within each of its five logical I/Os, while most of the 67 logical I/Os of the better of the earlier queries were expensive range reads in index leaf blocks, which covered every row entry in the leaf block, burning far more CPU than a single-row read in a logical I/O.
Although I created this trick for use on Oracle, it should work on MySQL; use the clause
“LIMIT 1” at the end of the query (or at the subquery inside the FROM clause), in place of the condition
“ROWNUM=1”, and use the hint
USE INDEX(<Keylist>) (or FORCE INDEX(<Keylist>) ) at the end of the table-reference, and the hint
/*! STRAIGHT JOIN */ to force the join order (in place of
ORDERED), if necessary. Thus, the generic case (for MySQL versions 4.1+) above becomes:
SELECT /*! STRAIGHT JOIN <maybe other hints> */ <whatever> FROM (SELECT * FROM rangetable t FORCE INDEX(highcol) WHERE highcol >= <bindvariable> AND lowcol <= <bindvariable> LIMIT 1) t INNER JOIN taba a ON t.fkey = a.pkey INNER JOIN tabb b ... WHERE ...
Similar tricks are likely possible in other open databases, as long as there is a way to force a join order and to stop a query, especially a subquery in a
FROM clause, at the first row. I’d like to invite the readers to submit the equivalent versions for PostgreSQL, DB2, Firebird, and others.
Dan Tow is an independent consultant, operating under the banner SingingSQL (www.singingsql.com). His experience solving Oracle-related performance problems goes all the way back to his 1989 hire by Oracle Corporation. He has a Ph.D. in chemical engineering from the University of Wisconsin at Madison.
Return to the Linux DevCenter.
Copyright © 2009 O'Reilly Media, Inc.