100% Correct Cost Estimation Leads to the Best SQL Optimization - 10 Common Misconceptions in SQL Tuning #10

This blog is the finale in this series about common misconceptions surrounding SQL tuning. I selected this topic as my last one because it is a very common misconception in a DBA or developers mind. It is a common misconception that the main reason that the internal database SQL optimizer cannot find a good query plan for a SQL statement is due to incorrect cost estimation by the database SQL optimizer. I have to emphasize that incorrect resource estimation for a SQL statement by the database SQL optimizer has many reasons. Outdated statistics is only one of the reasons. That is why there are a lot of people on the Internet giving the advice of updating statistics, when users are really asking for help to tune poorly performing SQL statements. But did you know that in certain environments, sometimes bad statistics or wrong cost estimations may actually lead to a better execution plan?

What Is Cost Estimation?

Cost estimation is the key component of modern cost-based SQL optimizers. The old rule-based SQL optimizers used a set of rules to construct or select the best execution plans for SQL statements and it was very sensitive to SQL syntax. Some market experts or tuning tools are still using this concept to help people to tune SQL in cost-based SQL optimizers. Cost estimation is a mathematics model that tries to predict the resource consumption of a given execution plan of a SQL statement. The goal being that the database cost-based SQL optimizer can select the lowest cost execution plan within all potential plans that a SQL can generate during SQL optimization process.

Why You Still Need to Tune Your SQL, Even When There Is 100% Correct Cost Estimation?

You may find that the cost estimation for simple SQL statements can be up to 100% correct in most database SQL optimizers on the market. Oracle12c even provides new features named "Adaptive Plan" which uses partial statistics to select the final execution plan during execution and uses fully executed SQL statistics feedback for the next SQL optimization (re-optimization). If the cost estimations for some SQL statements are no longer a problem then why are there still some problematic SQL statements that need tuning? To explain this problem we have to understand the limitations of modern cost-based SQL optimizers:

Estimated Cost is not equal to Real Cost

As I stated previously, estimated cost is not equal to the actual resource consumption (I call it "real cost"). Even if the database SQL optimizer has 100% correct statistics, cost calculation is only a simplified model for resource consumption forecast before the SQL is executed. The model is not perfect and still has a lot of loopholes. For example, each plan step's (full scan, range scan, nest loop…) primary cost is not adaptive to the environment. The relationship between estimated cardinality and the final cost calculation in each step is variant and not necessarily a linear relationship. Sometimes the relationship may even vary depending on hardware configuration and data distribution. To summarize, you should not rely solely on a SQL statement’s estimated cost as a predictor of actual performance.

SQL Syntax Limits the Best Plan Generation

Most database SQL optimizers are embedded with a SQL transformation engine that can rewrite a SQL statement using a better syntax format that allows the later stage of plans exploration to become easier and deeper. For example an "IN" sub-query can be transformed to a join syntax query by most database SQL optimizers, but sometimes it is difficult for complex syntax SQL statements to be transformed into a better syntax. In this case, the best plan may not appear in plan space no matter how good the cost estimation is. Manual SQL rewrite may be necessary at this moment.

SQL Plan Space is not Big Enough

The plan space size is one of measurements of how well a database SQL optimizer is built. A bigger plan space means that the database SQL optimizer can provide more potential query plans for a SQL statement so the chances to find a better one are higher. I can still remember how some database vendors could provide only "Nested Loop" for a two table join around 15 years ago. If a SQL statement could be optimized by a Hash Join or Sort Merge Join it could not be found in this smaller sized plan space. Of course, a bigger plan space will expose the cost estimation accuracy problem, that's why Oracle’s SQL optimizer gives you a lot of hints to control which execution plan you want to use in your specific environment.

Good Cost Estimation Does Not Necessarily Result in Good SQL Optimization

There are still some other reasons such as a potential good plan is trimmed during complex SQL optimization; database internal SQL transformation overrides developer's natural data processing flow. So that good cost estimation may not result in good SQL optimization in any database SQL optimizer. Cost estimation is only the prerequisite of the current cost-based SQL optimizers and it is especially true when you are handling complex and problematic SQL statements. I have seen some users of our product "Dell SQL Optimizer" generate alternative rewrites for a bad performing SQL and then just test SQL alternatives with lower cost only. By doing this, they may be overlooking a lot of potentially good performing SQL alternatives that have been generated by our tool.