In this article you will learn how Oracle 12c dynamic sampling works. Cases where it helps the optimizer do good estimations as well as cases where it doesn’t will be presented. Only single table selectivity and group by clause will be considered in this first part of a series of articles on Adaptive Dynamic Sampling. The effect of dynamic sampling on joins will be considered in part II of the series.

1. Single table estimation

This is the query I will be using throughout this article

SELECT count(*) 
FROM   customers 
WHERE  cust_state_province = ‘CA’ 
AND    country_id=52790; 
  
select segment_name, bytes/1024/1024 MB 
from user_segments 
where segment_name= 'CUSTOMERS'; 
  
SEGMENT_NAME                 MB 
——————– ———- 
CUSTOMERS                    13 
  
The two columns (cust_state_province and country_id) involved in the predicate part of the above query have histograms. Below is the execution plan of this query when executed for the first time: 

SQL_ID  303kpr2tf9x3j, child number 0 
————————————- 
  
—————————————————————— 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |           |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |   1115 |   3341 | 
—————————————————————— 
  
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("CUST_STATE_PROVINCE"='CA'AND “COUNTRY_ID”=52790)) 
  
As you can point it out there is a significant difference between actual rows and estimated ones (E-Starts*E-Rows != A-Rows). 
  
If we execute the same query again, we will get the following execution plan: 
  
SQL_ID  303kpr2tf9x3j, child number 
————————————- 
SELECT count(*)   FROM   sh.customers   WHERE  cust_state_province = 
'CA'AND    country_id=52790 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |           |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |   3341 |   3341 | 
——————————————————————- 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("CUST_STATE_PROVINCE"='CA'AND “COUNTRY_ID”=52790)) 
Note 
—– 
– statistics feedback used for this statement 
 

The 12c Statistics feedback features (equivalent to the 11g cardinality feedback) kicks in and did a very good estimation producing a new child cursor n°1 of certainly an optimal execution plan. 
  
This 12c Statistics feedback has been possible and successful because during the first query execution, Oracle realized that the estimations it has made are not close to what it has actually generated at run time. This is why it sets the is_reoptimizable property of child cursor n°0 to ‘Y’ as shown below: 
   
SQL> select child_number ,is_reoptimizable 
from v$sql 
where sql_id= ‘303kpr2tf9x3j’; 
  
CHILD_NUMBER I 
———— – 
0            Y 
  
As such, subsequent query executions will benefit from the information gleaned by the first execution. 
  
However, there are many real life cases where tables have no statistics at all and where dynamic statistics (dynamic sampling in 11g) kicks in. What estimations would the optimizer come up with if asked to optimize the above query against the customer table in this kind of situation?. This is what we are going to check after using the new 12c dynamic_sampling level 11 and executing the initial query again : 
  
SQL> alter session set optimizer_dynamic_sampling =11; 
  
SQL_ID  fvdd4sy2zrdkq, child number 2 
————————————- 
SELECT count(*)   FROM   customers   WHERE  cust_state_province = 
'CA'AND    country_id=52790 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |           |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |   3424 |   3341 | 
——————————————————————- 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("CUST_STATE_PROVINCE"='CA'AND “COUNTRY_ID”=52790)) 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO). 
  
As you can notice, a new child cursor n°2 has been created. In this case the optimizer estimations are quite good but not as perfect as it was the case with dynamic statistics (cursor n° 1). Of course changing the optimizer parameter forces the CBO to compile a new execution plan using the new optimizer environment. 

Think now what would have the CBO produced as estimations if the shared pool was flushed and adequate statistics were collected including histograms and extended statistics for the columns involved into the predicate part as show below: 
  
BEGIN 
dbms_stats.gather_table_stats 
('sh' 
,'customers' 
,method_opt=>'for all columns size skewonly 
for columns(cust_state_province,country_id) 
size skewonly' 
); 
END; 
  
alter session set statistics_level=all; 
  
alter system flush shared_pool; 
  
SELECT count(*) 
FROM   customers 
WHERE  cust_state_province = 'CA' 
AND    country_id=52790; 
  
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
  
SQL_ID  303kpr2tf9x3j, child number 0 
————————————- 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |           |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |   3341 |   3341 | 
——————————————————————- 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("CUST_STATE_PROVINCE"='CA'AND “COUNTRY_ID”=52790)) 
  
The estimations done by the CBO are in this case perfect (E-Rows=A-Rows). This is due to the help of the extended statistics collected on cust_state_province and country_id columns (see above in the call to dbms_stats package). 

Let's now enable automatic dynamic sampling (ADS) and see what the CBO reaction will be: 
alter system flush shared_pool; 
alter session set optimizer_dynamic_sampling=11; 
  
SELECT count(*) 
FROM   customers 
WHERE  cust_state_province = 'CA' 
AND    country_id=52790; 
  
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
  
SQL_ID  303kpr2tf9x3j, child number 
————————————- 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |           |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |   3424 |   3341 | 
——————————————————————- 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("CUST_STATE_PROVINCE"='CA'AND “COUNTRY_ID”=52790)) 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO) 
  
One remark has to be emphasized here: despite the presence of adequate statistics when we instructed the CBO to use dynamic sampling it obeyed the instruction and has neglected the presence of adequate statistics. The Optimizer is probably thinking that with ADS activated it has more power and enough tools to do very good estimations. In my opinion, I think, however, that the Optimizer should detect the presence of the column group statistics and should use them instead of using the ADS. This image resembles the working mechanism of SQL hints; when the Optimizer is hinted to follow a certain path if this path is valid the Optimizer will follow it whatever the cost of this path.  

Before digging into more details about this situation where both adequate statistics and dynamic sampling at level 11 are present, let me explain where the optimizer got this 3424 cardinality from. 

As you know in 12c the CBO trace file (10053) does not contain enough information about dynamic sampling. Here is what this corresponding trace file contains: 

SINGLE TABLE ACCESS PATH 
Single Table Cardinality Estimation for CUSTOMERS[CUSTOMERS] 
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE 
*** 2014-09-22 02:47:11.625 
** Performing dynamic sampling initial checks. ** 
** Not using old style dynamic sampling since ADS is enabled. 
Column (#11): 
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500.000000, PopBktCnt:55500.000000, PopValCnt:145, NDV:145 
Column (#11): CUST_STATE_PROVINCE(VARCHAR2) 
AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 
Histogram: Freq  #Bkts: 145  UncompBkts: 55500  EndPtVals: 145  ActualVal: yes 
Column (#13): 
NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500.000000, PopBktCnt:55500.000000, PopValCnt:19, NDV:19 
Column (#13): COUNTRY_ID(NUMBER) 
AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769.000000 Max: 52791.000000 
Histogram: Freq  #Bkts: 19  UncompBkts: 55500  EndPtVals: 19  ActualVal: yes 
Table: CUSTOMERS  Alias: CUSTOMERS 
Card: Original: 55500.000000    >>Single Tab Card adjusted from 1114.870631 to 3423.832500 due to adaptive dynamic sampling 
Rounded: 3424  Computed: 3423.832500  Non Adjusted: 1114.870631 
  
Note how the trace file indicates that the adaptive dynamic sampling has been used to adjust the cardinality from 1114.870631 to 3423.832500. The following question then poses itself: how has this 3423.832500 value been determined? 
When I enable sql trace (10046 events)  I get the following information 
  
SQL ID: 8z1xwcv87jxb2 
Plan Hash: 2105012006 
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring 
optimizer_features_enable(default) no_parallel result_cache(snapshot=3600) 
*/ SUM(C1) 
FROM 
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "CUSTOMERS")  */ 1 AS C1 
FROM "SH"."CUSTOMERS"SAMPLE BLOCK(51.5796, 8) SEED(1)  “CUSTOMERS” WHERE 
("CUSTOMERS"."CUST_STATE_PROVINCE"='CA') AND ("CUSTOMERS"."COUNTRY_ID"= 
52790)) innerQuery 
  
call     count       cpu    elapsed  disk  query    current    rows 
——- ——  —— ———-  —– —–    —–   ——- 
Parse        1     0.00       0.00      0     0        0        0 
Execute      1     0.00       0.00      0     0        0        0 
Fetch        1     0.00       0.00      0     798      0        1 
——- ——  —— ———-  —– ——   —–    —— 
total        3     0.00       0.00      0     798      0        1 
  
Misses in library cache during parse: 1 
Optimizer mode: ALL_ROWS 
Parsing user id: 105     (recursive depth: 1) 
  
Rows     Row Source Operation 
——-  ————————————————— 
1        RESULT CACHE  7t2g6717d8zkq49mdxstnxqnax (cr=798 pr=0 pw=0 time=6509 us) 
1        SORT AGGREGATE (cr=798 pr=0 pw=0 time=6407 us) 
1766      TABLE ACCESS SAMPLE CUSTOMERS (cr=798 pr=0 pw=0 time=6236 us cost=304 
size=48244 card=1723) 
  
As you may have noticed  via the apparition of the RESULT CACHE word in the above row source execution plan, the result has been stored in a cache for future executions (shared for all subsequent query executions). The Oracle sampling mechanism identifies a cardinality of 1766 as shown in the same row source execution plan. We can also point out that this sampling mechanism uses another strange  argument (8) which represents the sampling group size used to sample the customers table (SAMPLE BLOCK(51.5796, 8)). This SAMPLE BLOCK (percent, sample_group_size) option is not yet documented by Oracle. It seems that when dynamic sampling is set at the new 12c level (11) then Oracle will use a sample_group_size with a value of 8. While for other dynamic sampling levels, Oracle will use a sample_group_size with a value of 1.

And, again, a new legitimate question poses itself: Where does this 51.5796 value come from?  In order to estimate the customers table cardinality, Oracle actually scans half of this table (hence the 51,5796 value). This method will, of course, add an extra time penalty to the optimization part of the underlying query; but subsequent executions will benefit from this sampling estimation without the parsing penalty it requires. This method is known as the adaptive sampling method. I will prepare another article about calculating the (optimal) sample size, the exact adaptive sampling algorithm and how Oracle has implemented it. 

So we can figure out that with a sampling percentage of 50% the row source cardinality found by Oracle is 1766.  And Oracle, probably by guessing based on this first finding, extrapolated that the final cardinality evaluated at a 100% sampling would lead to a cardinality of (1766*100/51,57 = 3424,47)rounded to 3424. 

Bear in mind that with dynamic statistics set at a level different from 11, the Optimizer counts Actual SampleSize (ASS) together with Filtered Sample Card (FSC) and then it calculates the selectivity using:

selectivity =FSC/ASS

And the estimated cardinality via the following formula:

cardinality = num_rows*selectivity

While starting from 12c, using dynamic statistics set at a the new level 11, the CBO does not calculate the selectivity of the predicate. It instead starts first by determining the optimal sample size then it continues by estimating a fraction (or a portion) of a subset of the whole result set. And finally it estimates the whole result set based on that fraction.  

At this stage of the investigation a new question arises: what will happen when the size of the customers table increases? 

It is naturally obvious that increasing the table size will imply a reduction of the sample size because, as always, sampling a large segment is always a time consuming operation. Will this sample size reduction imply a reduction of the cardinality estimation precision? A picture being worth a thousand words, let's explore this plausible case via a concrete example: 

create table t_s as select * from dba_objects; 
  
insert into t_s select * from t_s; 
  
insert into t_s select * from t_s; 
  
insert into t_s select * from t_s; 
  
insert into t_s select * from t_s; 
  
insert into t_s select * from t_s; 
  
insert into t_s select * from t_s; 
  
select segment_name, bytes/1024/1024 MB 
from user_segments 
where segment_name='T_S'; 
  
SEGMENT_NAME            MB 
————— ———- 
T_S                 776.00 
  
exec dbms_stats.gather_table_stats('sh','t_s', 
method_opt=>'for all columns size skewonly'); 
  
Against this simple model I will issue two different queries and get their execution plans using dynamic statistics set at level 11 as shown below: 

First query 

  
select count(*) from t_s where owner='SYS' and object_type='TABLE' 
  
SQL_ID  ckutv7jm28ru9, child number 0 
————————————- 
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows | 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| T_S  |      1 |  57840 |  82048 | 
————————————————————– 
  
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("OBJECT_TYPE"='TABLE' AND "OWNER"='SYS')) 
  
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO) 
  

Second query 

  
select count(*) from t_s where owner='SH' and object_type='TABLE' 
  
SQL_ID  2w2wwxaznd0kg, child number 0 
————————————- 
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows | 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| T_S  |      1 |    371 |   1600 | 
————————————————————– 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("OWNER"='SH' AND "OBJECT_TYPE"='TABLE')) 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO) 
  
As you can determine from the above execution plans, when I used a table (T_S) with a large segment size (776MB) Oracle was unable to get an accurate estimations using dynamic statistics set at level 11. This is in contrast to the case where I’ve used a table (CUSTOMERS) with a relatively small size (13MB) where the CBO estimation was approximately correct. In the case of the T_S table Oracle scans a very small part of this segment and then it guessed the whole query cardinality. 
  
I want to emphasize also something else important which concerns the relationship between dynamic statistics (at level 11) and the presence of skewed data (columns with histogram) in the sampled segment. People are still wrongly thinking that the CBO, particularly the new dynamic statistics sampling method, is able to detect missing statistics (like histograms and extended statistics)  and produce a correct cardinality. This,of course, is not possible using dynamic statistics set at level 11. Moreover, we saw above that even when extended statistics and adequate histograms are present, when Oracle is instructed to use dynamic statistics at level 11 it will bypass these existing dictionary statistics and use only information it can get during the execution plan compilation which obviously exclude getting histogram and extended statistics dynamically at parsing time. 
  
As a matter of confirmation, If, for the above T_S table example, I disable automatic dynamic sampling and create extended statistics for owner and object_type columns, then the execution plan for (the second query) will look like: 
  
SQL_ID  2w2wwxaznd0kg, child number 0 
————————————- 
select count(*) from t_s where owner='SH' and object_type='TABLE' 
  
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows | 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| T_S  |      1 |   1600 |   1600 | 
————————————————————– 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("OWNER"='SH' AND "OBJECT_TYPE"='TABLE')). 
  
Where cardinality estimation is perfect thanks to the column group statistics on (owner, object_type). 
  
All the above observations and investigations concern dynamic statistics at level 11; what happens now if I drop the extended columns statistics and instruct the CBO to use dynamic statistics at a lower level (9 for example)? 
  
BEGIN 
DBMS_STATS.DROP_EXTENDED_STATS( 'sh', 't_s', 
'("OWNER","OBJECT_TYPE")' ); 
END; 

  
alter system flush shared_pool; 
exec  DBMS_RESULT_CACHE.FLUSH; 
alter session set optimizer_dynamic_sampling=9
  
select count(*) from t_s where owner='SH' and object_type='TABLE' 
  
SQL_ID  avw6r9kqfvx9r, child number 
————————————- 
select count(*) from t_s where owner=’SH’and object_type=’TABLE’ 
  
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| T_S  |      1 |   1763 |   1600 | 
————————————————————– 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("OWNER"='SH'AND “OBJECT_TYPE”=’TABLE’)) 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=9
  
With dynamic statistics set at a lower level (9) we get a better estimation than with the dynamic statistics set at the new 12c level (11). 
  
If now I re-create the extended statistics for owner and object_type columns, gather statistics with histograms and instruct the CBO to use dynamic statistics at level 9, then I will end up with the following execution plan: 
  
exec dbms_stats.gather_table_stats(‘sh’,’t_s’,method_opt=>’for all columns size skewonly for columns(owner, object_type) size skewonly’); 
  
alter system flush shared_pool
exec  DBMS_RESULT_CACHE.FLUSH
alter session set optimizer_dynamic_sampling=9
  
select count(*) from t_s where owner=’SH’and object_type=’TABLE’ 
  
SQL_ID  avw6r9kqfvx9r, child number 
————————————- 
select count(*) from t_s where owner=’SH’and object_type=’TABLE’ 
Plan hash value: 2004207966 
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |      1 | 
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 | 
|*  2 |   TABLE ACCESS FULL| T_S  |      1 |   1600 |   1600 | 
————————————————————– 
Predicate Information (identified by operation id): 
————————————————— 
2 – filter(("OWNER"='SH'AND “OBJECT_TYPE”=’TABLE’)) 
  
As you can see, Oracle optimizer did not used dynamic sampling at level 9 in the presence of statistics even when instructed to do so. This means that when using optimizer_dynamic_sampling at a level different from 11, the CBO can detect the presence of accurate (or not representative) statistics and decide whether it is approprite to use or bypass the dynamic statistics feature. Of course this is not the case when dynamic statistics is set at the new level 11. In this case dynamic statistics supersedes every other available statistics and uses, unfortunately wrongly, dynamic sampling for its cardinality estimations. 
  

2. Group By clause

I will be using the following query in order to investigate the behavior of dynamic statistics in presence of a group by clause: 
SELECT count(*) 
FROM   sh.customers 
group by COUNTRY_ID,CUST_STATE_PROVINCE; 
  
The execution plan of this query is given by: 

SQL_ID  gp0cpb1kutsr9, child number 0 
————————————- 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows | 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |    145 | 
|   1 |  HASH GROUP BY     |           |      1 |   1949 |    145 | 
|   2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |  55500 |  55500 | 
——————————————————————- 
  
The Optimizer, having no clues about the correlation between country_id and cust_state_province, produced a wrong estimation for the HASH GROUP BY operation n°1 above. 

Let’s then instruct the CBO to use the new 12c dynamic sampling level, re query again and note what will happen 

alter system flush shared_pool
exec  DBMS_RESULT_CACHE.FLUSH
alter session set optimizer_dynamic_sampling=11
  
SQL_ID  gp0cpb1kutsr9, child number 
————————————- 
  
——————————————————————- 
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows 
——————————————————————- 
|   0 | SELECT STATEMENT   |           |      1 |        |    145 | 
|   1 |  HASH GROUP BY     |           |      1 |    145 |    145 | 
|   2 |   TABLE ACCESS FULL| CUSTOMERS |      1 |  55500 |  55500 | 
——————————————————————- 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO) 
  
As you can observe in the above execution plan,  HASH GROUP BY operation n°1 is now perfectly estimated by the CBO. 
From the corresponding 10046 trace file we can isolate the following interesting information: 

SQL ID: 7b9ma1hfgd25c 
Plan Hash: 922166714 
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring 
optimizer_features_enable(default) no_parallel result_cache(snapshot=3600) 
*/ SUM(C1) 
FROM 
(SELECT /*+ qb_name("innerQuery")  */ AS C1 FROM (SELECT COUNT(*) “COUNT(*) 
" FROM "SH"."CUSTOMERS""CUSTOMERS" GROUP BY "CUSTOMERS"."COUNTRY_ID", 
"CUSTOMERS"."CUST_STATE_PROVINCE") "VW_DIS_1“) innerQuery 
  
  
call     count    cpu  elapsed disk  query    current   rows 
——- ——  —–  ——- —– ——-  ——-  —– 
Parse        1   0.00     0.00     0       0        0      0 
Execute      1   0.00     0.00     0       0        0      0 
Fetch        1   0.01     0.01     0    1522        0      1 
——- ——  —–  ——- —–  ——  ——-   —- 
total        3   0.01     0.01     0    1522        0      1 
  
Misses in library cache during parse: 1 
Optimizer mode: ALL_ROWS 
Parsing user id: 105     (recursive depth: 1) 
  
Rows     Row Source Operation 
——-  ————————————————— 
1  RESULT CACHE  6rw3j6xwtuzga8umg05k5624x2 (cr=1522 pr=0 pw=0 time=18273 us) 
1   SORT AGGREGATE (cr=1522 pr=0 pw=0 time=18205 us) 
145    VIEW  VM_NWVW_0 (cr=1522 pr=0 pw=0 time=18183 us cost=591 size=5847 card=1949) 
145HASH GROUP BY (cr=1522 pr=0 pw=0 time=18156 us cost=591 size=31184 card=1949) 
55500      TABLE ACCESS FULL CUSTOMERS (cr=1522 pr=0 pw=0 time=7690 us cost=586 size=888000 card=55500) 
 

Interesting. In order to get an accurate estimation of the group by operation, Oracle executed the query and got 145. Executing the query becomes a step in the plan optimization process!!! Moreover, Oracle actually executed this query two times (you need to believe me on this, since I did not post the part of the trace file that shows this double execution: the first execution with "VW_DIS_1"and the second one visible in the 10053 trace file with "VW_DIS_2"). I still  have not figured out why Oracle needs this double query execution at parse time. Generally when the first sample is not accurate then a second sampling is done, resulting into a second query execution. But in this particular case there has been no sampling (or we could consider it 100% sampling). I think that in our case the second execution is useless and doesn't need to exist. 

 

Having said that, I want now to extend this group by testing to a larger table. For that, I created CUST table using as follows: 

create table cust as select * from customers

and did several inserts (using insert into cust select * from cust) until table size reached 760MB. 

SEGMENT_NAME            MB 
————— ———- 
CUST                 760.00 
  
And here below is the corresponding 10046 trace file for the same group by query, but this time against the new bigger cust table 

SQL ID: 5qa7dunz4udqc 
Plan Hash: 3471111787 
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring 
optimizer_features_enable(default) no_parallel result_cache(snapshot=3600) 
*/ SUM(C1) 
FROM 
(SELECT /*+ qb_name("innerQuery")  */ AS C1 FROM (SELECT COUNT(*) “COUNT(*) 
" FROM "SH"."CUST""CUST" GROUP BY "CUST"."COUNTRY_ID", 
"CUST"."CUST_STATE_PROVINCE") "VW_DIS_1“) innerQuery 
  
call     count    cpu elapsed    disk   query current   rows 
——- ——  —– ——- ——- ——- ——-  —– 
Parse        1   0.00    0.00       0       0       0      0 
Execute      1   0.00    0.00       0       0       0      0 
Fetch        1   1.31    1.31   64914   72386       0      0 
——- ——  —–   —– ——- ——- ——-  —– 
total        3   1.31   1.31    64914   72386       0      0 
  
Rows     Row Source Operation 
——-  ————————————————— 
0        RESULT CACHE  8j9xaaqaqzg8q419n87dbp51vu (cr=0 pr=0 pw=0 time=1 us) 
0        SORT AGGREGATE (cr=0 pr=0 pw=0 time=31 us) 
0        VIEW  VM_NWVW_0 (cr=0 pr=0 pw=0 time=27 us cost=591 size=5847 card=1949) 
0        HASH GROUP BY (cr=0 pr=0 pw=0 time=26 us cost=591 size=31184 card=1949) 
2524125  TABLE ACCESS FULL CUST (cr=72386 pr=64914 pw=0 time=849426 us cost=586 size=888000 card=55500) 
  
You can easily see that this time Oracle did not fully execute the query. The Optimizer considers that this method will not be efficient because it thinks that the object is quite “big”. So it decided to use the old method to estimate the group by clause; with which it generated the following execution plan: 
  
SQL_ID  1j4cppr9wz91r, child number 
————————————- 
SELECT count(*)   FROM   sh.cust  group by 
COUNTRY_ID,CUST_STATE_PROVINCE 
  
————————————————————– 
| Id  | Operation          | Name | Starts | E-Rows | A-Rows 
————————————————————– 
|   0 | SELECT STATEMENT   |      |      1 |        |    145 | 
|   1 |  HASH GROUP BY     |      |      1 |   1949 |    145 | 
|   2 |   TABLE ACCESS FULL| CUST |      1 |  55500 |   3552K| 
————————————————————– 
Note 
—– 
– dynamic statistics used: dynamic sampling (level=AUTO) 
  
However the above note wrongly indicates that dynamic statistics at level 11 has been used. This is not completely true because during parsing the CBO ignored this sampling method because of the huge table size; which, according to the CBO, will make sampling inefficient. 
Notice also that the last 10046 trace file contains a few SQL statements that tried to estimate the cardinality of CUST table and to determine its column statistics (num_distinct etc), as reproduced below: 

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring 
optimizer_features_enable(default) no_parallel  */ SUM(C1) 
FROM 
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "CUST")  */ 1 AS C1 FROM 
"SH"."CUST" SAMPLE BLOCK(0.826182, 8) SEED(1)  "CUST") innerQuery 
  
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring 
optimizer_features_enable(default) no_parallel  */ sum(vsize(C1))/count(*) , 
COUNT(DISTINCT C2), sum(vsize(C3))/count(*) , COUNT(DISTINCT C4) 
FROM 
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "CUST")  */ 
"CUST"."CUST_STATE_PROVINCE" AS C1, "CUST"."COUNTRY_ID" AS C2, 
"CUST"."COUNTRY_ID" AS C3, "CUST"."CUST_STATE_PROVINCE" AS C4 FROM 
"SH"."CUST" SAMPLE BLOCK(0.826182, 8) SEED(1)  "CUST") innerQuery 
  
Queries that have been ignored during plan optimization 

3.  Conclusion

Automatic dynamic sampling (ADS) at level 11 can be efficient and generates accurate cardinality estimations for small tables. The main principle of ADS at this particular level is the estimation of a fraction of the query (estimation of a subset of the query) and guesswork for the whole result set. Therefore with increasing table size, the CBO will definitely start to lose its estimation accuracy. This is why,  in most of  real-life systems, this feature will probably not be helpful; particularly if there is a significant data skew. And what seems not very convenient is that, in contrast to dynamic statistics set at a level other than 11, when the CBO is instructed to use dynamic statistics at level 11 the presence of adequate and representative statistics including histograms and extended column group statistics are ignored. In this situation, a different dynamic statistics level can be more accurate than the new ADS level. Finally, queries using a group by clause on big tables will not use dynamic sampling at level 11 even when instructed to do so and even though the note that accompanies the corresponding execution plan will tell the contrary.

Read Part 2 of this series.

About the Author

Chinar Aliyev

Oracle RDBMS specialist with 10+ years of experience as Developer and DBA. Experienced with OLTP and Data Warehouses (both RAC&Single instance). Specialized on Database Design, Performance Tuning and High Availability.

Start the discussion at forums.toadworld.com