This is the third article of my 12c Adaptive Sampling series. Previously, we saw in Part 1 how ADS works for a single table (also for a group by clause) and in Part 2 for joins.  As you know join cardinality is calculated (as Jonathan Lewis has explained in his book)

Join Selectivity =
((num_rows(t1) – num_nulls(t1.c1)) / num_rows(t1)) *
((num_rows(t2) – num_nulls(t2.c2)) / num_rows(t2)) /
greater(num_distinct(t1.c1), num_distinct(t2.c2))

Join Cardinality =
Join Selectivity *
filtered cardinality(t1) * filtered cardinality(t2)

So, to estimate Join cardinality you need to estimate three factors: the cardinality of both tables included in the join and join selectivity. Can this mechanism benefit from dynamic sampling? Of course, estimating cardinality of both tables can benefit from dynamic sampling, but what about join selectivity? The answer is yes. Using dynamic sampling, the DBMS can estimate or calculate column statistics, the number of distinct values and the number of null values. And this information is enough to estimate join cardinality. In the previous articles of this series we used t2 and t3 tables. In this part we also will use these tables but their size are greater than previously. To increase the segment size I have used several insert into select from clauses, so our case is:

SEGMENT_NAME       MB (user_segments)
————       ——-
T2                 3110.00
T3                 3589.00

TABLE_NAME     NUM_ROWS       BLOCKS (user_tables)
T3             14208000       231391
T2             11810048       204150

TABLE_NAME  COLUMN_NAME  NUM_DISTINCT HISTOGRAM (user_tab_col_statistics)
T2          OBJECT_NAME  53744        NONE
T3          TABLE_NAME   9004         NONE

TABLE_NAME         STALE_STATS (user_tab_statistics)
————       ———–
T2                 YES
T3                 YES

And we will see the following query

select
count(*)
from
t2
,t3
where
–t2.owner=t3.owner and
t2.object_name=t3.table_name

Now, I have decided to set the NDV of t2.object_name to 15000. In reality, it is 53744. This means that the number of distinct values in the dictionary will be less than actual number of distinct values. The object statistics are stale, but the number of distinct values 53744 represents the reality, because several insert into select from statements do not change the number of distinct values of the object_name column. But why did I decide to change the column statistic? Soon you will know that. I am trying to change it as:

DECLARE
l_distcnt  NUMBER
l_density  NUMBER;
l_nullcnt  NUMBER
l_srec     DBMS_STATS.StatRec;
l_avgclen  NUMBER;
BEGIN
DBMS_STATS.get_column_stats (
ownname => 'sh',
tabname => 't2',
colname => 'object_name',
distcnt => l_distcnt,
density => l_density,
nullcnt => l_nullcnt,
srec    => l_srec,
avgclen => l_avgclen);

l_distcnt:=15000;
l_density:=1/15000;

DBMS_STATS.set_column_stats (
ownname => 'sh',
tabname => 't2',
colname => 'object_name',
distcnt => l_distcnt,
density => l_density,
nullcnt => l_nullcnt,
srec    => l_srec,
avgclen => l_avgclen);
END;
/

First, I want to note that Oracle completely ignores dynamic sampling statistics for computing any cardinality estimation. For the Q3 query from the 10046 trace file I got the following (we can see that for both tables but I'm providing it only for the T3 table):

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
optimizer_features_enable(default) no_parallel  */ sum(vsize(C1))/count(*) ,
substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
COUNT(DISTINCT C5)
FROM
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3")  */ "T3"."TABLE_NAME"
AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
C5 FROM "T3" SAMPLE BLOCK(0.174483, 8) SEED(1)  "T3") innerQuery

Sometimes, we see that the sample size is increased (generally doubled), as below:

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
optimizer_features_enable(default) no_parallel  */ sum(vsize(C1))/count(*) ,
substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
COUNT(DISTINCT C5)
FROM
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3")  */ "T3"."TABLE_NAME"
AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
C5 FROM "T3" SAMPLE BLOCK(0.348966, 8) SEED(2)  "T3") innerQuery

What does this SQL mean? This statement uses the estimate/calculate column (join) statistics which are involved in the join (the approach can also be applied for the filter predicates).  And this is the third method for estimating joins cardinality. Here COUNT(DISTINCT C5) indicates the number of distinct values of the join (or filter) column and  SUM(C4) is the number of null values. I have removed all columns from the SQL except the NDV for t3 table, then:

SQL> SELECT
2    COUNT(DISTINCT C5) as num_dist
3  FROM
4   (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2")  */ "T2"."OBJECT_NAME"
5    AS C1, "T2"."OBJECT_NAME" AS C2, "T2"."OBJECT_NAME" AS C3, CASE WHEN
6    ("T2"."OBJECT_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T2"."OBJECT_NAME" AS
7    C5 FROM "T2" SAMPLE BLOCK(0.402778, 8) SEED(2)  "T2") innerQuery;
NUM_DIST
———-
37492

I updated the NDV to 15000 for that column, and the plan was

——————————————–
| Id  | Operation           | Name | Rows  |
——————————————–
|   0 | SELECT STATEMENT    |      |     1 |
|   1 |  SORT AGGREGATE     |      |     1 |
|*  2 |   HASH JOIN         |      |    11G|
|   3 |    TABLE ACCESS FULL| T2   |    11M|
|   4 |    TABLE ACCESS FULL| T3   |    14M|
——————————————–
Predicate Information (identified by operation id):
—————————————————
2 – access("T2"."OBJECT_NAME"="T3"."TABLE_NAME")
Note
—–
– Dynamic statistics used: dynamic sampling (level=AUTO)

And  from the 10053 trace file

Best NL cost: 2037555914706.766602
resc: 2037555914706.766602  resc_io: 2030576990663.999756  resc_cpu: 75477063522522944
resp: 2037555914706.766602  resp_io: 2030576990663.999756  resc_cpu: 75477063522522944
SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN
Join Card:  11186477465.600000 = outer (11810048.000000) * inner (14208000.000000) * sel (6.6667e-05)

As you can see, the optimizer did not use the number of distinct values that were computed using the sampling; instead it used num_distinct from the dictionary. The optimizer has such a mechanism but it does not use it as expected; however the object statistics in the dictionary are STALE; also using sampling the number of distinct values (37492) is greater than dba_tab_col_statistics.num_distinct (15000). This means, in this instance dynamic sampling gives us more correct information than the dictionary, but the optimizer ignores that fact. It should not happen.

Conclusion

ADS will help if your tables are small; otherwise it will not help or can be bad. Depending on the size of the tables involved in the joins and the predicate type (join/filter), dynamic sampling can be completely ignored. Also, the optimizer tries (in some cases) to estimate column statistics for join selectivity but it cannot advantage of that. I hope this situation will be fixed in an upcoming release. In addition, ADS increase the parse time of the statements; therefore it can produce additional concurrency in OLTP environments such as latches and mutexes.