ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Wrong-Errors Bugs: A New Class of Bug?

by Dan Tow, author of SQL Tuning
09/02/2004

Above all else, we count on databases to reflect the truth consistently, or at least to reflect the table data perfectly. The database cannot be blamed when an application (or the end users of an application) place inaccurate data in its tables, but a database must accurately report the data it holds. Therefore, bugs are not all created equal; there are bugs, and there are wrong-rows bugs, bugs that silently misrepresent the data that the tables hold. Even the craziest, most obscure corner case that potentially misrepresents your data should rightly bring a loud chorus: "The emperor has no clothes!" We depend on the database, above all, not to lie.

Unnecessary errors are comparatively benign. If the database returns an unnecessary error, at least we haven't been lied to, and we know that we need to take corrective action. A wrong-rows bug might lead to months, even years, of expensive business mistakes before anyone notices. The database vendors rightly attach so much importance to fixing wrong-rows bugs that when a query returns rows other than those needed by the application, it almost always indicates a mistake in the application SQL, not in the RDBMS vendor code!

Setting aside the wrong-rows bugs, most ordinary RDBMS bugs consist of the database returning an unnecessary error; for example, disconnecting your session without good cause. Not all errors come from RDBMS bugs, however. For example, if you select an expression that calls for division by zero, the database is supposed to return an error.

A New Class of Bug

Related Reading

SQL Tuning
By Dan Tow

In this article, I propose the recognition of a new class of bug, a class that is not generally considered a bug at all. Specifically, I propose that errors such as attempted conversions of unconvertible values or division-by-zero should appear only when absolutely necessary, when any execution plan conceivable would encounter the error. An SQL statement that returned an unnecessary error (an error that would not result from every conceivable path to the data) would be guilty of this new class of bug, a wrong-errors bug. A common means to demonstrate a wrong-rows bug is simply to show that the same query returns different rows depending on which execution plan it follows. Except for unusual, non-relational operators such as the Oracle ROWNUM operator, we recognize that however much developers might argue what rows a query should return, there must surely be just one correct result, not one result when the database follows one path, and another result when it follows another path! I propose that function errors, such as attempted conversions of unconvertible values or division-by-zero, ought to be equally consistent! Furthermore, I propose that these wrong-errors bugs are nearly as serious as wrong-rows bugs, even though the current standards do not treat them as bugs at all! I have two lines of argument for why these ought to be considered bugs, and why they are so serious:

  1. The idealistic argument: Errors are just as important a part of the result of a query as the rows returned. Whether a query returns an error should be just as well-defined by the SQL standard as which rows the query returns, and just as independent of the path the optimizer chooses to reach the data!

  2. The pragmatic argument: Execution plans change all the time, unpredictably, and it is very costly to have a mission-critical process suddenly, mysteriously error out three hours before your quarter close! It's extraordinarily hard to test for a problem that might only occur long after release, when some customer encounters a data distribution that leads to an execution plan that reaches a row that triggers an error.

Since the behavior I propose is new and subtly inconsistent with past accepted database behavior, instantly classifying the old behavior as buggy may be asking a lot. I concede that, at best, we'll need a clean migration path, perhaps a SQL standard calling for the new behavior, and configuration parameters that would allow developers to choose to keep the old behavior if they felt it important. However, I believe that our past acceptance of the old behavior was more historical accident than anything else, and that if we'd begun with the database behavior I propose, we'd view failure to follow that behavior as a serious bug.

An Example

Here's a manufactured example of the problem, on Oracle, to make this concrete:


SELECT O.Status, O.Order_Date, O.Order_ID, O.Customer_ID, E.Last_Name,
E.First_Name, E.Phone_Extension
FROM Orders O, Ext_Expediters E
WHERE (O.Customer_ID=:1 OR :1 IS NULL)
AND O.Char_Extension_Col01=E.Numerical_Expediter_ID
AND (O.Expediter_ID=:2 OR :2 IS NULL)
ORDER BY E.Last_Name, E.First_Name, E.Phone_Extension, O.Status, O.Order_Date

This is a custom query joining a packaged Orders table to a custom table created by the IT department to add expediting functionality to the packaged application. The optimal order of the join would depend on which of the two bind variables, :1 or :2, was provided with a non-null value. Given different data distributions and bind-variable choices, the optimizer might, or might not, choose a hash join that would read all rows in Orders, potentially including rows for which an implicitly added function, such as:


TO_NUMBER(O.Char_Extension_Col01)

used to make the join type-consistent, would encounter a character value that could not convert to a number, triggering an error. We want the optimizer to have the freedom to choose the fastest join order and join method, and even to change that choice as the data distributions and bind variables change from one execution to the next. However, we also wish to avoid the risk that a change to the execution plan, or a single row of bad data (bad data that does not even meet the WHERE-clause conditions), leads this query to error out at a horribly inconvenient time, in production, when it had always worked in testing.

Eliminating Unnecessary Errors

Here's a breakdown of unnecessary errors, from the easiest to eliminate, to the hardest:

  1. Errors outside of the WHERE clause: If you have error-prone expressions irrelevant to determining the rows returned, for example in the SELECT clause, error-avoidance is trivial--just postpone the evaluation of the expressions until all joins and other conditions evaluate to TRUE. Since the rows returned are completely determined by the SQL, independent of the execution plan, whatever errors result at the final stage of the execution plan are necessary errors.

  2. Errors in single-table filter conditions: If a single-table filter in a WHERE clause (for example, TO_NUMBER(A.Char_Col) = 1) yields an error during evaluation, one alternative would be to simply mark the row as potentially an error, and only return the error if the row survives the rest of the conditions in the query. I'll show later that there are better alternatives, though.

  3. Errors in join clauses: If a join condition (for example, TO_NUMBER(A.Char_Col)=B.Num_Col) yields an error, it can become impossible to proceed with the rest of the execution plan without assigning some unambiguous truth value to the condition, since the remaining steps in the plan likely require values from the joined-to table columns. This class of error is the hardest to handle without compromise to the stated goal that error behavior must be independent of the chosen execution plan. I will focus on this case, and the rest will follow.

Consider, then, a simple-looking join, between dissimilar column types:


A.Char_Col=B.Num_Col

I hear the groans, already--no foreign key should ever be deliberately configured with a type inconsistent with the primary key it points to, and a join such as this often points to a serious database design error. This is not a discussion of the ideal world, though--I'm talking about the real world. In the real world, we must cope with outright design errors. Certainly, we would never forgive a wrong-rows bug just because it happened only against flawed database design, and flawed database design does not excuse wrong-errors bugs, either. There are even comparatively benign cases where these joins come up: tables in packaged applications often have some extra flexibility built in in the form of generic character-type columns (Attribute1, Attribute2, ... , for example).

Character-type is the natural choice for these columns, since you can convert character strings to numbers and dates, but numbers and dates are not so flexible. When the customer adds custom functionality to such generic applications, it is natural to make use of these generic columns as foreign keys pointing to custom-table primary keys with dissimilar types--few applications are designed top-down as a unified whole.

Simple-looking mixed-type joins like this turn out to be far more problematic than they appear. What does it even mean to say two columns of different types match? Vendors disagree: Oracle always converts the character-type column, with a default-format TO_NUMBER(), to match a number, or a default-format TO_DATE() to match a date. DB2, on the other hand, simply returns an error, refusing to even attempt the match unless the developer makes the conversion explicit on one side or the other. (This has the virtue of forcing the developer to think about the issue, and possibly even to correct a functional or design bug revealed by the type inconsistency, but it is inconvenient when the issue requires no special work.) The type conversion that enables such a join to proceed, whether it is implicitly added under the covers or explicitly added by the developer, generally only allows an efficient nested-loops plan to follow the join from the converted side to the unconverted side. If we enabled conversion on either side of the join, the optimizer would have a much better chance to find a fast path to the data. Unfortunately, this surfaces a subtle wrong-rows issue: The rows returned by


TO_NUMBER(A.Char_Col)=B.Num_Col

and by


A.Char_Col=TO_CHAR(B.Num_Col)

won't even necessarily match, so one of the sets of rows must be wrong! The problem is that TO_CHAR(TO_NUMBER(C)) doesn't necessarily equal C. For example:


TO_CHAR(TO_NUMBER('037'))='37'

The same can even happen with character strings reflecting dates. For example, with Oracle's default U.S. date format:


TO_CHAR(TO_DATE('25-aug-03'))='25-AUG-03'

So, granting that the two alternative forms, ConvertAToB(TypeA)=TypeB and TypeA=ConvertBToA(TypeB), are not functionally interchangeable, which is correct for a given query? Oracle assumes it should convert the character-type, unless the SQL explicitly converts the other side. DB2 refuses to assume one side or the other is correct and requires an explicit conversion in the SQL. Both behaviors are consistent, and avoid returned-rows results that change depending on the chosen execution plan. Consistency does not guarantee correctness, though. What are the odds that the developer writing this SQL has actually thought through which version of the join handles all of the corner cases correctly? Even if the original developer thought it through, will every developer who maintains the SQL also know to handle these cases correctly? (Failure to think through these details is especially likely where the developer has not even made the conversion explicit, as Oracle allows!)

Inequalities are a special case:


A.Char_Col>B.Num_Col

means something completely different if you convert the number-side versus if you convert the character side, since characters and their equivalent numbers sort completely differently. In this case, Oracle's default is probably the only way to go--if you are comparing a character with either a number or a date in an inequality, you almost surely want the more-meaningful non-character sort order, and the database should convert the character, unless the developer makes the conversion explicit on the other side, or the character stores a value in a format guaranteed to sort the same way as the date or the number.

If I put together a wish list for the perfect way to handle mixed-type joins on equalities, it would have several requirements:

  1. The join should be able to follow nested loops to a simple index on either side of the join, to choose the join order for best performance, even when the conversion is explicit on one (the wrong) side.

  2. The chosen conversion should never result in conversion errors where errors could have been avoided by converting the other side of the join.

  3. The rows returned by a plan that converts one side of the join must automatically match the rows returned by a plan that converts the other side of the join, without the developer having to think about the problem.

With a solution that delivers all three of these requirements, we can have our cake and eat it, too! Constraints turn out to be the answer.

If a character column always stored a number or a date in the same format, and all applicable character strings would successfully convert to the required type using that format, then conversions in both directions would produce the same results, and would avoid errors altogether. Given constraints that enforced such consistent formats, developers could safely ignore these subtleties. (As opposed to the current practice of ignoring them at their peril!) The simplest case would, for example, constrain the values of character-type Attribute1 to strings of digits (only) without left-hand zeros, strings where TO_CHAR(TO_NUMBER(Attribute1))=Attribute1, without a conversion error along the way during the evaluation of TO_NUMBER(Attribute1).

The constraints will often be more complex than this, however. Often, the mix of subtypes stored in the table only sometimes (for one or more subtypes) uses the character column to store a numerical foreign key, storing unconvertible strings (names, for example) for other subtypes. In such cases, the constraint must only apply to the subtypes where the character-type column stores numbers, and these subtypes should be explicitly defined (for example, with a column named Type) or, if necessary, with some complex expression on some combination of table columns.

To take advantage of these subtype-specific constraints, developers will need to restrict on the subtype in the SQL, but they should be doing that, anyway--it is surely the intent to join only to these subtypes, if the query joins the character column to a number. You wouldn't want accidentally to join to the wrong subtype just because a couple of rows in that subtype happened (more or less accidentally) to have a character string that converted to a valid number.

Even with current limitations in databases' handling of errors, these constraints are a good idea--there is surely no reason to store numbers as characters in inconsistent formats, and it is surely a bad idea to allow unconvertible strings to be stored where you expect a string to convert to a number. If you currently have rows that violate such constraints, it is surely a good idea to find them and correct them. Unfortunately, most likely no individual knows every case where such constraints should be created on a complex legacy database, nor will you likely find this documented. Currently, we find these errors, at best (assuming super-diligent follow-up on errors), when the application happens to encounter a bad row and return an error, an expensive and unreliable way to uncover the vulnerability.

I propose an error-safe mode for SQL connections: in the error-safe mode, conversions of columns (whether implicit or explicit) would not even be allowed in the WHERE clause unless a constraint already exists that guarantees the success of the conversion, and guarantees that the seemingly equivalent conversion on the other side of the equality really is perfectly equivalent. (For inequalities, the constraint would guarantee success of the conversion from character to number or date, but would not allow implicit conversions of the other side where that would result in a different set of rows owing to a non-equivalent sort.) Developers would develop and test with this mode enforced (for example, with a new init.ora parameter, in Oracle), while developing or enhancing the application, and many errors would result even when testing against toy data volumes that would never yield errors without the parameter.

However, each of these errors would point to a potential corner-case bug that would be horribly hard to find, otherwise, and would point to a fairly simple constraint that would forever prevent that bug. Even on legacy production systems that mostly prevent changes by the customer, the customer could safely add these constraints, cleaning up bad data uncovered by this process, as needed, to meet the constraints. (You would not believe the number of February 30ths, April 31sts, and non-leap-year February 29ths I once found in a legacy-database character column! Cleaning these up can only be a good thing.)

Whenever the database found mixed-type joins, then, whether the conversions were explicit or implicit, the database would find in the constraint definition a declared format in which to expect the character-stored numbers or dates, and could safely convert either side of the comparison, and could join in either direction, with nested loops where these are optimal.

You can already kludge together useful type-format constraints with existing RDBMS functions. For example, in Oracle, you can verify that a character string forms a simple positive integer (or is null) with the pair of conditions:


(LTRIM(Char_Col,'0123456789') IS NULL 
AND NVL(SUBSTR(Char_Col,1,1),'1')!='0')

If we want to see these constraints everywhere they belong, and if we want to see the RDBMS recognize them and take advantage of them to safely free up alternate join orders, such constraints should not be left as complex exercises for the user. Instead, let's assume that the RDBMS vendors create special-purpose functions for the purpose. For example, they could create a function PSEUDO_NUMBER(Char_Expr, Fmt) that returns 'TRUE' if and only if Char_Expr is a string that could result from Oracle's function TO_CHAR(Number_Expr, Fmt), where Fmt is a string that specifies a recognized number format. For example, PSEUDO_NUMBER('0456', '9999')='FALSE', because TO_CHAR(456, '9999')='456', and no other number will yield that left-hand '0' in '0456' using the format '9999'. A similar PSEUDO_DATE() function would verify that a string encodes a date in the specified format.

Here, then, is the path from the current behavior to the new behavior I suggest for handling type-conversion errors and probably preventing at least 95 percent of the wrong-errors bugs, beginning with changes that the RDBMS vendors would need to make.

  1. Create a couple of new parameters, settable at both the session and the system levels. (This requires new functionality from the database vendor.) The first parameter should specify simply that any error outside of the WHERE clause or in a simple filter condition should be discarded if the row is discarded later in the execution plan. I think the new behavior (returning only errors in fully joined rows that survive all WHERE-clause conditions) is a safe new default, but we need the parameter in case an application requires consistency with past behavior. The second parameter, not set on by default (for now), would trigger errors in any equality that matches a character-type column with a number-type or date-type value, unless a constraint guarantees that the match will be error-free and equivalent whichever side of the equality is type-converted.

  2. Create new functions, such as PSEUDO_NUMBER and PSEUDO_DATE, that easily allow correct-format checks in constraints. This requires new functionality from the database vendor.

  3. Have the database recognize the new constraints, and take advantage of them not only to trigger errors with the new type-conversions-guaranteed parameter, but also to permit conversions on either side of a dissimilar-types-matching equality, enabling more degrees of freedom for the optimizer. This requires new functionality from the database vendor.

  4. Set both new parameters on, in a controlled development test environment. Play back as much of your application SQL as possible in this test environment, checking for errors, and create constraints that eliminate every error triggered by unsafe type conversions.

  5. Roll the new constraints into production, fixing bad data as needed to enable the new constraints, and fixing any application flaws that lead to the bad data.

  6. Set at least the new error-postponing parameter in production. If you're really serious about preventing wrong-errors bugs, set both new parameters in production. The second parameter will trigger occasional errors when you encounter new SQL that was never tested in the development environments with that parameter set, but the errors will be consistent, regardless of the execution plan and regardless of the data, and will always point to a new constraint required to avoid future wrong-errors bugs.

Remaining Wrong-Errors Vulnerabilities

Type conversions are 95 percent of the battle with wrong-errors bugs, and they require special treatment if we are to both eliminate these bugs and help the optimizer have maximum opportunities to find the best execution plan. Most of the rest of the problem is handled well enough by postponing errors until we see whether the execution plan discards the row before it is complete. A simple new generic function could help eliminate almost all of the rest of the problem, handling division-by-zero, tangent-of-90-degrees, and so on.

I propose a new function, TRAPPED_ERROR(), which can wrap around any expression at all, and that would return 'TRUE' if that expression would trigger an error (though it would trap the error, not return an error), and return 'FALSE' if the expression evaluates without an error. For example:


SELECT Num_Col1, Num_Col2 FROM Experimental_Results 
WHERE TRAPPED_ERROR(LN(Num_Col1/Num_Col2)='TRUE' 

would yield those rows where Num_Col1 or Num_Col2 was zero, or where Num_Col1/Num_Col2 was negative, or even where the database hit an overflow error because the ratio Num_Col1/Num_Col2 was just too large for it to evaluate.

This function would help you find bad rows much more easily than you currently can, and with it you could easily code SQL defensively wherever you have the potential for a wrong-errors bug. Even in the absence of the other new functionality I propose above, this new function would greatly help in working around wrong-errors problems.

Acknowledgments

Jonathan Gennick was a great help getting this rolling, and in good shape for publication. Thanks, Jonathan! My wife, Parva Oskoui, had very useful suggestions as well.

Dan Tow is an independent consultant, operating under the banner SingingSQL (www.singingsql.com). His experience solving Oracle-related performance problems goes all the way back to his 1989 hire by Oracle Corporation. He has a Ph.D. in chemical engineering from the University of Wisconsin at Madison.


In November 2003, O'Reilly Media, Inc. released SQL Tuning.


Return to ONLamp.com

Copyright © 2009 O'Reilly Media, Inc.