Oct 5, 2018 8:24:11 AM by Anju Garg
The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key. A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
The main source of benefit achieved by using an index is due to reduced number of I/Os required to satisfy a query. Selectivity of an index is one of the factors which determine the reduction in I/O that can be achieved by using it and, hence, its effectiveness in optimizing performance. The higher the selectivity, the fewer rows returned by the index scan and the faster the query engine can reduce the size of the result set. Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache resulting in various wait events such as db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc. Optimizer may prefer to perform a Full Table Scan rather than use a highly unselective index; in which case, we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. On the other hand, if the index is selective, optimizer prefers to access the table via index. Hence, it is desirable to have indexes with fairly high selectivity to make them more useful.
In this article, I will explore the scenario in which there are few values in the indexed column but data distribution in the column is skewed; i.e., although on the whole the index seems to be unselective, it may still be selective with respect to sparsely occurring values.
I will demonstrate that
SYS>drop table hr.sales purge;
create table hr.sales as select * from sh.sales
order by channel_id, cust_id;
SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,
count(distinct(cust_id))/count(*)cust_id_selectivity,
count(distinct channel_id)distinct_channel_id,
count(distinct(channel_id))/count(*)channel_id_selectivity
from hr.sales;
NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY
---------- ---------------- ------------------- ------------------- ----------------------
918843 7059 .007682488 4 4.3533E-06
SYS>select distinct channel_id , count(*)
from hr.sales
group by channel_id order by channel_id;
CHANNEL_ID COUNT(*)
---------- ----------
2 258025
3 540328
4 118416
9 2074
It can be seen that the data in CHANNEL_ID column is skewed and there are few rows for CHANNEL_ID = 9 as compared to the other values; i.e., if we create an index on CHANNEL_ID it will be selective with respect to certain values, whereas it is unselective with respect to others.
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns size 1', Cascade => true, no_invalidate => false);
SYS>Create index hr.channel_id_idx on hr.sales(channel_id);
Now, if we search for a CHANNEL_ID = 3, it can be seen that the optimizer assumes data to be uniformly distributed and estimates 25% of the rows (918K/4 = 229K) to be returned and hence prefers to perform a full table scan rather than use the unselective index. Note that the statistics shown are from the second execution to eliminate parsing and other costs.
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
Execution Plan
----------------------------------------------------------
Plan hash value: 781590677
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("S"."CHANNEL_ID"=3)
If we search for CHANNEL_ID = 9, having few rows, and examine the execution plan and statistics, it can be seen that the optimizer again prefers to perform a full table scan even though it results in a large number of physical and logical I/Os.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
2074 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 781590677
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("S"."CHANNEL_ID"=9)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4574 consistent gets
4431 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
If we force the use of index by means of hint, observe that the consistent gets have dropped considerably from 4574 to 296 and physical reads have been eliminated. It means that although it would be beneficial to use an index, the optimizer still thinks full table search is better since it does not know about the skew in the data (We did not create a histogram on CHANNEL_ID while collecting statistics for the hr.sales table).
SYS> set autotrace traceonly
select /*+ index (s) */ s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
Execution Plan
----------------------------------------------------------
Plan hash value: 3317832025
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 229K| 2691K| 1562 (1)| 00:00:19 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 229K| 2691K| 1562 (1)| 00:00:19 |
|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 229K| | 453 (1)| 00:00:06 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("S"."CHANNEL_ID"=9)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
296 consistent gets
0 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto',Cascade => true, no_invalidate => false);
SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM dba_TAB_COL_STATISTICS
WHERE owner = 'HR' and TABLE_NAME='SALES';
TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM
--------------- ------------------------------ ------------ ---------------
SALES PROD_ID 72 NONE
SALES CUST_ID 7059 NONE
SALES TIME_ID 1460 NONE
SALES CHANNEL_ID 4 FREQUENCY
SALES PROMO_ID 4 NONE
SALES QUANTITY_SOLD 1 NONE
SALES AMOUNT_SOLD 3586 NONE
7 rows selected.
Now that a histogram on CHANNEL_ID is in place, the optimizer is aware of the skew; i.e., it knows that there are too many rows for channel_id = 3 and relatively few rows for CHANNEL_ID = 9.
Now, if we search for CHANNEL_ID = 3, the optimizer almost correctly estimates 542K rows to be returned and appropriately chooses to perform a full table scan.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
set autot off
540328 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 781590677
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 542K| 6356K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 542K| 6356K| 1231 (1)| 00:00:15 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("S"."CHANNEL_ID"=3)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
40277 consistent gets
4431 physical reads
0 redo size
8897553 bytes sent via SQL*Net to client
396650 bytes received via SQL*Net from client
36023 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
540328 rows processed
On searching for CHANNEL_ID = 9, the optimizer estimates 1437 rows to be returned, which is quite close to the actual number of 2074, and obviously much better than its earlier estimate of 229k. As a result, it correctly chooses to perform an index scan with only 296 consistent gets.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
2074 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3317832025
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1437 | 17244 | 12 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1437 | 17244 | 12 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 1437 | | 5 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("S"."CHANNEL_ID"=9)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
296 consistent gets
0 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
2074 rows selected.
Thus, if data distribution is skewed in a column, a histogram should be created on the column to make the optimizer aware of the skew. With a histogram in place, the optimizer may prefer to use even an unselective index for sparsely occurring values, while it may ignore the index for values which occur frequently. This in turn means that what matters is the selectivity of the index with respect to the values being queried most. It might be beneficial to create even an unselective index on column(s) having few values if
Tags: Oracle
Written by Anju Garg
Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc.
She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook
She is certified for :
She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/
We use cookies to improve your experience with our site. By continuing to use this site, you consent to our use of cookies. Learn more.