ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


How to Misuse SQL's FROM Clause

by Stephane Faroult
09/30/2004

FROM what?

It may seem surprising to state it so, but the FROM clause of SQL statements seems to be one of the most often misused parts of SQL queries. Misused? How is that possible? We put into the FROM clause all the tables to join together in a query, don't we?

Well, well, well. Not quite. At the risk of sounding pedantic, perhaps a bit of (applied) theory would be welcome.

How to Misuse the FROM Clause

Relational theory calls relations what most practitioners know as tables--a list of unordered statements (the rows) about the various qualities (the column values) associated with some entities uniquely identified by the primary key. When you join tables, you derive other relations from the known relations, just like you often derive a theorem from other theorems. This is basically what joins are about, returning related data from different sources and creating new relations.

However, when you look closely, you very often realize that the returned data comes from only some of the tables listed in the FROM clause, typically something in the guise of:

select distinct a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a,
     ORDERS    b
where a.ZIP_CODE in ...
  and b.ORDERED_DATE >= ...
  and b.CUSTOMER_ID = a.CUSTOMER_ID
order by a.CUSTOMER_NAME

This is a very simple example of a pattern you meet over and over. Isn't this a very fine query, which does just what it asks?

In my view, this is a very bad query; it's logically flawed, which isn't very surprising, and its execution plan, execution time, and various statistics prove that it also performs badly. Poor performance is more often than not the direct consequence of poor logic. The good news is that it's possible to improve the speed of such a slow query significantly. (As anybody who has ever successfully tuned a SQL query can testify, this is an area where significant means an order of magnitude or two, not 20 percent).

Related Reading

MySQL Reference Manual
Documentation from the source
By Michael "Monty" Widenius, David Axmark, MySQL AB

Where is the logical flaw? As stated above, this is a typical query in which all the data comes from a single table: CUSTOMERS. Interestingly, it returns a CUSTOMER_ID. There's no prize for guessing that this is likely to be the primary key, uniquely identifying each row in CUSTOMERS. However, we need a DISTINCT; one customer may have ordered several times in the period of interest. In other words, we are fixing with DISTINCT a problem of our own design.

This is a "Penelope" type of query. Penelope, as told in Homer's Odyssey, was the wife of Ulysses, king of Ithaca. Hoping for the return of her husband from Troy, she kept suitors at bay by pretending that she would consider them once she completed her weaving--which she kept going by undoing at night what she had done during the day.

Like Penelope, we are doing a lot of work by executing a regular join and bringing back many rows, only to undo a large part of it in order to retrieve the resulting set. (The "best" examples of Penelope queries are usually met with queries involving complex views, however.)

An Improved Query

In this case, the ORDERS table doesn't belong to the FROM clause. It belongs to the WHERE clause, because it appears only for the purpose of filtering on the ORDERED_DATE column. Since it belongs to the WHERE clause, it's better as a subquery--either correlated, which means that it refers to the current row of the main query and therefore fires each time the check must be performed, or uncorrelated, which means that it can execute independently of the main query.

In our simple example, a correlated subquery would be:

select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
  and exists (select NULL
              from ORDERS b
              where b.CUSTOMER_ID = a.CUSTOMER_ID
                and b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME

while an uncorrelated subquery would be:

select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
  and a.CUSTOMER_ID in (select b.CUSTOMER_ID
                        from ORDERS b
                        where b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME

It may seem surprising, but on a very simple example such as this one, the RDBMS optimizer--that part of the code in charge of finding the most efficient way to execute the query--may sometimes perfectly decide to transform a subquery back into a join, for sound reasons that are beyond the scope of this article.

However, when it does this, the optimizer does it on purpose. It should be obvious that the really complicated case involves many tables. Then the complexity of singling out each possible rewriting increases exponentially. As the optimizing phase is part of the overall response time, what the optimizer tries as a fraction of what is possible falls dramatically when possibilities widen.

The result is that the more complicated the query, the higher the risk that the optimizer will take the wrong decision. That is when writing queries in a sensible way becomes all-important.

Correlated or Uncorrelated?

Because we have several ways to write the filtering subquery, we of course have to consider whether one is significantly better than the other.

Over the past 20 years, many SQL tuning books and articles have advocated the use of NOT EXISTS over NOT IN. (Interestingly, many such elements of popular technical culture have roots in what used to be true facts long ago but which are not necessarily true anymore.) True or not, this assertion has become in the mind of many a SQL developer "Use EXISTS, not IN." In some cases, this is justifiable, but it's not a good rule or even good practice.

As its name implies, a correlated subquery (the inner query) depends on an outer query that fires it. How often it fires depends on the circumstances.

The defining characteristic of a correlated subquery is that it refers to at least one value from the current row of the outer query--CUSTOMER_ID in our example above. If we have no other filtering condition, we will have to evaluate the correlated subquery for each row we inspect during the execution of the outer query. In that case, it will probably cost much more than the JOIN and DISTINCT, as a sophisticated optimizer will probably notice. If there are other, simpler filtering conditions to apply prior to this correlated subquery, we can fire it only for the candidate rows that have successfully passed all the other conditions. If those other conditions are highly selective, we may need to fire the subquery for only a handful of rows more than in our final resulting set, applying an efficient search criterion (once again, the CUSTOMER_ID) each time.

The uncorrelated subquery, on the other hand, doesn't refer to the outer query. As a result, it executes only once. The important question is how many rows this subquery will return. If it's a (relatively) small number, it is likely that the engine will use an index (on CUSTOMERS(CUSTOMER_ID) in our example). It is interesting to point out that in the correlated subquery case, it would have been ORDERS(CUSTOMER_ID). If, however, the subquery returns a really huge number of rows, the engine will use some kind of hash or merge join. This may prove much more efficient than a correlated subquery if the final, global result set is large, since a correlated subquery executes at least as many times as the number of rows returned, and possibly much more if the other criteria are not very selective.

To summarize, both correlated and uncorrelated subqueries may each in turn prove to be more efficient. The former wins when the final result set is relatively small and the other criteria are efficient. The latter works best when dealing with large tables from which you expect many rows. When in doubt, test with both.

Conclusion

In short, when trying to correct a query that returns duplicate rows, a quick fix of adding a DISTINCT is definitely the solution to avoid. Unfortunately, it is widely practiced. Searching the code for DISTINCT with joins is a way to improve code performance easily.

Remember that the FROM clause of a query should refer to only two types of tables:

Any table that does not fall in either of those two categories should appear in a subquery. For instance, given a join such as:

...
FROM T1, T2, T3, T4, T5, T6
WHERE ...
...

the various links between the various tables may allow us to draw a diagram such as:

T1 - T2 - T3 -T6
      |
     T4 - T5

Assume that the data returned comes from T1 and T4. Those two tables have their place in the FROM clause, obviously. So does T2, since it allows the joining T1 to T4. The "end of link" tables, however--T3, T6, and T5--belong to subqueries.

These simple rules produce effective results when applied intelligently.

Stephane Faroult first discovered the relational model and the SQL language in 1983.


Return to the MySQL DevCenter

Copyright © 2009 O'Reilly Media, Inc.