Wrong-Errors Bugs: A New Class of Bug?by Dan Tow, author of SQL Tuning
Above all else, we count on databases to reflect the truth consistently, or at least to reflect the table data perfectly. The database cannot be blamed when an application (or the end users of an application) place inaccurate data in its tables, but a database must accurately report the data it holds. Therefore, bugs are not all created equal; there are bugs, and there are wrong-rows bugs, bugs that silently misrepresent the data that the tables hold. Even the craziest, most obscure corner case that potentially misrepresents your data should rightly bring a loud chorus: "The emperor has no clothes!" We depend on the database, above all, not to lie.
Unnecessary errors are comparatively benign. If the database returns an unnecessary error, at least we haven't been lied to, and we know that we need to take corrective action. A wrong-rows bug might lead to months, even years, of expensive business mistakes before anyone notices. The database vendors rightly attach so much importance to fixing wrong-rows bugs that when a query returns rows other than those needed by the application, it almost always indicates a mistake in the application SQL, not in the RDBMS vendor code!
Setting aside the wrong-rows bugs, most ordinary RDBMS bugs consist of the database returning an unnecessary error; for example, disconnecting your session without good cause. Not all errors come from RDBMS bugs, however. For example, if you select an expression that calls for division by zero, the database is supposed to return an error.
A New Class of Bug
In this article, I propose the recognition of a new class of bug, a class that
is not generally considered a bug at all. Specifically, I propose that errors
such as attempted conversions of unconvertible values or division-by-zero should
appear only when absolutely necessary, when any execution plan conceivable would
encounter the error. An SQL statement that returned an unnecessary error (an
error that would not result from every conceivable path to the data) would be
guilty of this new class of bug, a wrong-errors bug. A common means to demonstrate
a wrong-rows bug is simply to show that the same query returns different rows
depending on which execution plan it follows. Except for unusual, non-relational
operators such as the Oracle
ROWNUM operator, we recognize that however much
developers might argue what rows a query should return, there must surely be
just one correct result, not one result when the database follows one path,
and another result when it follows another path! I propose that function errors,
such as attempted conversions of unconvertible values or division-by-zero, ought
to be equally consistent! Furthermore, I propose that these wrong-errors bugs
are nearly as serious as wrong-rows bugs, even though the current standards
do not treat them as bugs at all! I have two lines of argument for why these
ought to be considered bugs, and why they are so serious:
The idealistic argument: Errors are just as important a part of the result of a query as the rows returned. Whether a query returns an error should be just as well-defined by the SQL standard as which rows the query returns, and just as independent of the path the optimizer chooses to reach the data!
The pragmatic argument: Execution plans change all the time, unpredictably, and it is very costly to have a mission-critical process suddenly, mysteriously error out three hours before your quarter close! It's extraordinarily hard to test for a problem that might only occur long after release, when some customer encounters a data distribution that leads to an execution plan that reaches a row that triggers an error.
Since the behavior I propose is new and subtly inconsistent with past accepted database behavior, instantly classifying the old behavior as buggy may be asking a lot. I concede that, at best, we'll need a clean migration path, perhaps a SQL standard calling for the new behavior, and configuration parameters that would allow developers to choose to keep the old behavior if they felt it important. However, I believe that our past acceptance of the old behavior was more historical accident than anything else, and that if we'd begun with the database behavior I propose, we'd view failure to follow that behavior as a serious bug.
Here's a manufactured example of the problem, on Oracle, to make this concrete:
SELECT O.Status, O.Order_Date, O.Order_ID, O.Customer_ID, E.Last_Name, E.First_Name, E.Phone_Extension FROM Orders O, Ext_Expediters E WHERE (O.Customer_ID=:1 OR :1 IS NULL) AND O.Char_Extension_Col01=E.Numerical_Expediter_ID AND (O.Expediter_ID=:2 OR :2 IS NULL) ORDER BY E.Last_Name, E.First_Name, E.Phone_Extension, O.Status, O.Order_Date
This is a custom query joining a packaged
Orders table to a custom table created
by the IT department to add expediting functionality to the packaged application.
The optimal order of the join would depend on which of the two bind variables,
:2, was provided with a non-null value. Given
different data distributions and bind-variable choices, the optimizer might,
or might not, choose a hash join that would read all rows in
including rows for which an implicitly added function, such as:
used to make the join type-consistent, would encounter a character value that
could not convert to a number, triggering an error. We want the optimizer to
have the freedom to choose the fastest join order and join method, and even
to change that choice as the data distributions and bind variables change from
one execution to the next. However, we also wish to avoid the risk that a change
to the execution plan, or a single row of bad data (bad data that does not even
WHERE-clause conditions), leads this query to error out at a horribly
inconvenient time, in production, when it had always worked in testing.
Eliminating Unnecessary Errors
Here's a breakdown of unnecessary errors, from the easiest to eliminate, to the hardest:
Errors outside of the
WHEREclause: If you have error-prone expressions irrelevant to determining the rows returned, for example in the
SELECTclause, error-avoidance is trivial--just postpone the evaluation of the expressions until all joins and other conditions evaluate to
TRUE. Since the rows returned are completely determined by the SQL, independent of the execution plan, whatever errors result at the final stage of the execution plan are necessary errors.
Errors in single-table filter conditions: If a single-table filter in a
WHEREclause (for example,
TO_NUMBER(A.Char_Col) = 1) yields an error during evaluation, one alternative would be to simply mark the row as potentially an error, and only return the error if the row survives the rest of the conditions in the query. I'll show later that there are better alternatives, though.
Errors in join clauses: If a join condition (for example,
TO_NUMBER(A.Char_Col)=B.Num_Col) yields an error, it can become impossible to proceed with the rest of the execution plan without assigning some unambiguous truth value to the condition, since the remaining steps in the plan likely require values from the joined-to table columns. This class of error is the hardest to handle without compromise to the stated goal that error behavior must be independent of the chosen execution plan. I will focus on this case, and the rest will follow.
Consider, then, a simple-looking join, between dissimilar column types:
I hear the groans, already--no foreign key should ever be deliberately configured
with a type inconsistent with the primary key it points to, and a join such
as this often points to a serious database design error. This is not a discussion
of the ideal world, though--I'm talking about the real world. In the real world,
we must cope with outright design errors. Certainly, we would never forgive
a wrong-rows bug just because it happened only against flawed database design,
and flawed database design does not excuse wrong-errors bugs, either. There
are even comparatively benign cases where these joins come up: tables in packaged
applications often have some extra flexibility built in in the form of generic
character-type columns (
Attribute1, Attribute2, ... , for example).
Character-type is the natural choice for these columns, since you can convert character strings to numbers and dates, but numbers and dates are not so flexible. When the customer adds custom functionality to such generic applications, it is natural to make use of these generic columns as foreign keys pointing to custom-table primary keys with dissimilar types--few applications are designed top-down as a unified whole.
Simple-looking mixed-type joins like this turn out to be far more problematic
than they appear. What does it even mean to say two columns of different types
match? Vendors disagree: Oracle always converts the character-type column, with
TO_NUMBER(), to match a number, or a default-format
TO_DATE() to match a date. DB2, on the other hand, simply returns
an error, refusing to even attempt the match unless the developer makes the
conversion explicit on one side or the other. (This has the virtue of forcing
the developer to think about the issue, and possibly even to correct a functional
or design bug revealed by the type inconsistency, but it is inconvenient when
the issue requires no special work.) The type conversion that enables such a
join to proceed, whether it is implicitly added under the covers or explicitly
added by the developer, generally only allows an efficient nested-loops plan
to follow the join from the converted side to the unconverted side. If we enabled
conversion on either side of the join, the optimizer would have a much better
chance to find a fast path to the data. Unfortunately, this surfaces a subtle
wrong-rows issue: The rows returned by
won't even necessarily match, so one of the sets of rows must be wrong! The
problem is that
TO_CHAR(TO_NUMBER(C)) doesn't necessarily equal
C. For example:
The same can even happen with character strings reflecting dates. For example, with Oracle's default U.S. date format:
So, granting that the two alternative forms,
TypeA=ConvertBToA(TypeB), are not functionally interchangeable, which is correct
for a given query? Oracle assumes it should convert the character-type, unless
the SQL explicitly converts the other side. DB2 refuses to assume one side or
the other is correct and requires an explicit conversion in the SQL. Both behaviors
are consistent, and avoid returned-rows results that change depending on the
chosen execution plan. Consistency does not guarantee correctness, though. What
are the odds that the developer writing this SQL has actually thought through
which version of the join handles all of the corner cases correctly? Even if the
original developer thought it through, will every developer who maintains the
SQL also know to handle these cases correctly? (Failure to think through these
details is especially likely where the developer has not even made the conversion
explicit, as Oracle allows!)
Pages: 1, 2