It may seem surprising to state it so, but the
FROM clause of
SQL statements seems to be one of the most often misused parts of SQL queries.
Misused? How is that possible? We put into the
FROM clause all the
tables to join together in a query, don't we?
Well, well, well. Not quite. At the risk of sounding pedantic, perhaps a bit of (applied) theory would be welcome.
Relational theory calls relations what most practitioners know as tables--a list of unordered statements (the rows) about the various qualities (the column values) associated with some entities uniquely identified by the primary key. When you join tables, you derive other relations from the known relations, just like you often derive a theorem from other theorems. This is basically what joins are about, returning related data from different sources and creating new relations.
However, when you look closely, you very often realize that the returned
data comes from only some of the tables listed in the
typically something in the guise of:
select distinct a.CUSTOMER_ID, a.CUSTOMER_NAME from CUSTOMERS a, ORDERS b where a.ZIP_CODE in ... and b.ORDERED_DATE >= ... and b.CUSTOMER_ID = a.CUSTOMER_ID order by a.CUSTOMER_NAME
This is a very simple example of a pattern you meet over and over. Isn't this a very fine query, which does just what it asks?
In my view, this is a very bad query; it's logically flawed, which isn't very surprising, and its execution plan, execution time, and various statistics prove that it also performs badly. Poor performance is more often than not the direct consequence of poor logic. The good news is that it's possible to improve the speed of such a slow query significantly. (As anybody who has ever successfully tuned a SQL query can testify, this is an area where significant means an order of magnitude or two, not 20 percent).
Where is the logical flaw? As stated above, this is a typical query in which
all the data comes from a single table:
Interestingly, it returns a
CUSTOMER_ID. There's no prize for
guessing that this is likely to be the primary key, uniquely identifying each
CUSTOMERS. However, we need a
customer may have ordered several times in the period of interest. In other
words, we are fixing with
DISTINCT a problem of our own
This is a "Penelope" type of query. Penelope, as told in Homer's Odyssey, was the wife of Ulysses, king of Ithaca. Hoping for the return of her husband from Troy, she kept suitors at bay by pretending that she would consider them once she completed her weaving--which she kept going by undoing at night what she had done during the day.
Like Penelope, we are doing a lot of work by executing a regular join and bringing back many rows, only to undo a large part of it in order to retrieve the resulting set. (The "best" examples of Penelope queries are usually met with queries involving complex views, however.)
In this case, the
ORDERS table doesn't belong to the
FROM clause. It belongs to the
WHERE clause, because
it appears only for the purpose of filtering on the
column. Since it belongs to the
WHERE clause, it's better as a
subquery--either correlated, which means that it refers to the current
row of the main query and therefore fires each time the check must be
performed, or uncorrelated, which means that it can execute independently of
the main query.
In our simple example, a correlated subquery would be:
select a.CUSTOMER_ID, a.CUSTOMER_NAME from CUSTOMERS a where a.ZIP_CODE in ... and exists (select NULL from ORDERS b where b.CUSTOMER_ID = a.CUSTOMER_ID and b.ORDERED_DATE >= ...) order by a.CUSTOMER_NAME
while an uncorrelated subquery would be:
select a.CUSTOMER_ID, a.CUSTOMER_NAME from CUSTOMERS a where a.ZIP_CODE in ... and a.CUSTOMER_ID in (select b.CUSTOMER_ID from ORDERS b where b.ORDERED_DATE >= ...) order by a.CUSTOMER_NAME
It may seem surprising, but on a very simple example such as this one, the RDBMS optimizer--that part of the code in charge of finding the most efficient way to execute the query--may sometimes perfectly decide to transform a subquery back into a join, for sound reasons that are beyond the scope of this article.
However, when it does this, the optimizer does it on purpose. It should be obvious that the really complicated case involves many tables. Then the complexity of singling out each possible rewriting increases exponentially. As the optimizing phase is part of the overall response time, what the optimizer tries as a fraction of what is possible falls dramatically when possibilities widen.
The result is that the more complicated the query, the higher the risk that the optimizer will take the wrong decision. That is when writing queries in a sensible way becomes all-important.
Because we have several ways to write the filtering subquery, we of course have to consider whether one is significantly better than the other.
Over the past 20 years, many SQL tuning books and articles have advocated the use of
NOT EXISTS over
NOT IN. (Interestingly,
many such elements of popular technical culture have roots in what used to be
true facts long ago but which are not necessarily true anymore.) True or not, this assertion has become in the mind of many a SQL developer "Use
IN." In some cases, this is justifiable,
but it's not a good rule or even good practice.
As its name implies, a correlated subquery (the inner query) depends on an outer query that fires it. How often it fires depends on the circumstances.
The defining characteristic of a correlated subquery is that it refers to at
least one value from the current row of the outer query--
CUSTOMER_ID in our example above. If we have no other filtering
condition, we will have to evaluate the correlated subquery for each row we
inspect during the execution of the outer query. In that case, it will
probably cost much more than the
as a sophisticated optimizer will probably notice. If there are other, simpler
filtering conditions to apply prior to this correlated subquery, we can fire it
only for the candidate rows that have successfully passed all the other
conditions. If those other conditions are highly selective, we may need to
fire the subquery for only a handful of rows more than in our final resulting set,
applying an efficient search criterion (once again, the
CUSTOMER_ID) each time.
The uncorrelated subquery, on the other hand, doesn't refer to the outer
query. As a result, it executes only once. The important question is how many
rows this subquery will return. If it's a (relatively) small number, it is
likely that the engine will use an index (on
CUSTOMERS(CUSTOMER_ID) in our example). It is interesting to point
out that in the correlated subquery case, it would have been
ORDERS(CUSTOMER_ID). If, however, the subquery returns a really
huge number of rows, the engine will use some kind of hash or merge join. This
may prove much more efficient than a correlated subquery if the final, global
result set is large, since a correlated subquery executes at least as
many times as the number of rows returned, and possibly much more if the other
criteria are not very selective.
To summarize, both correlated and uncorrelated subqueries may each in turn prove to be more efficient. The former wins when the final result set is relatively small and the other criteria are efficient. The latter works best when dealing with large tables from which you expect many rows. When in doubt, test with both.
In short, when trying to correct a query that returns duplicate rows,
a quick fix of adding a
DISTINCT is definitely the solution to
avoid. Unfortunately, it is widely practiced. Searching the code for
DISTINCT with joins is a way to improve code performance
Remember that the
FROM clause of a query should refer to only two types of tables:
Any table that does not fall in either of those two categories should appear in a subquery. For instance, given a join such as:
... FROM T1, T2, T3, T4, T5, T6 WHERE ... ...
the various links between the various tables may allow us to draw a diagram such as:
T1 - T2 - T3 -T6 | T4 - T5
Assume that the data returned comes from
T4. Those two tables have their place in the FROM
clause, obviously. So does
T2, since it allows the joining
T4. The "end of link" tables, however--
T5--belong to subqueries.
These simple rules produce effective results when applied intelligently.
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to the MySQL DevCenter
Copyright © 2009 O'Reilly Media, Inc.