The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key. A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
The main source of benefit achieved by using an index is due to reduced number of I/Os required to satisfy a query. Selectivity of an index is one of the factors which determine the reduction in I/O that can be achieved by using it and, hence, its effectiveness in optimizing performance. The higher the selectivity, the fewer rows returned by the index scan and the faster the query engine can reduce the size of the result set. Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache resulting in various wait events such as db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc. Optimizer may prefer to perform a Full Table Scan rather than use a highly unselective index; in which case, we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. On the other hand, if the index is selective, optimizer prefers to access the table via index. Hence, it is desirable to have indexes with fairly high selectivity to make them more useful.
In this article, I will explore the scenario in which there are few values in the indexed column but data distribution in the column is skewed; i.e., although on the whole the index seems to be unselective, it may still be selective with respect to sparsely occurring values.
I will demonstrate that
- In the case of skewed data distribution,
- If a histogram is not available on the indexed column, an unselective index may not be used at all
- If a histogram is available on the indexed column, an unselective index may be
- Used for sparsely occurring values
- Ignored for frequently occurring values
- It might be beneficial to create an index even on column(s) having few values if
- The data distribution in the column(s) is skewed, and
- A histogram is available on skewed column(s), and
- The application queries sparsely occurring values most of the time.
Demonstration
- Create hr.sales table which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.
SYS>drop table hr.sales purge;
create table hr.sales as select * from sh.sales
order by channel_id, cust_id;
SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,
count(distinct(cust_id))/count(*)cust_id_selectivity,
count(distinct channel_id)distinct_channel_id,
count(distinct(channel_id))/count(*)channel_id_selectivity
from hr.sales;
NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY
———- —————- ——————- ——————- ———————-
918843 7059 .007682488 4 4.3533E-06
- Let us check how data is distributed across four values in the CHANNEL_ID column.
SYS>select distinct channel_id , count(*)
from hr.sales
group by channel_id order by channel_id;
CHANNEL_ID COUNT(*)
———- ———-
2 258025
3 540328
4 118416
9 2074
It can be seen that the data in CHANNEL_ID column is skewed and there are few rows for CHANNEL_ID = 9 as compared to the other values; i.e., if we create an index on CHANNEL_ID it will be selective with respect to certain values, whereas it is unselective with respect to others.
- Gather optimizer statistics for the hr.sales table. Note that I have not gathered a histogram for column CHANNEL_ID.
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns size 1', Cascade => true, no_invalidate => false);
- Let us Create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column
SYS>Create index hr.channel_id_idx on hr.sales(channel_id);
Now, if we search for a CHANNEL_ID = 3, it can be seen that the optimizer assumes data to be uniformly distributed and estimates 25% of the rows (918K/4 = 229K) to be returned and hence prefers to perform a full table scan rather than use the unselective index. Note that the statistics shown are from the second execution to eliminate parsing and other costs.
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
Execution Plan
———————————————————-
Plan hash value: 781590677
—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
—————————————————————————
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CHANNEL_ID"=3)
If we search for CHANNEL_ID = 9, having few rows, and examine the execution plan and statistics, it can be seen that the optimizer again prefers to perform a full table scan even though it results in a large number of physical and logical I/Os.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
2074 rows selected.
Execution Plan
———————————————————-
Plan hash value: 781590677
—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
—————————————————————————
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CHANNEL_ID"=9)
Statistics
———————————————————-
0 recursive calls
0 db block gets
4574 consistent gets
4431 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
If we force the use of index by means of hint, observe that the consistent gets have dropped considerably from 4574 to 296 and physical reads have been eliminated. It means that although it would be beneficial to use an index, the optimizer still thinks full table search is better since it does not know about the skew in the data (We did not create a histogram on CHANNEL_ID while collecting statistics for the hr.sales table).
SYS> set autotrace traceonly
select/*+ index (s) */ s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
Execution Plan
———————————————————-
Plan hash value: 3317832025
———————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
———————————————————————————————-
| 0 | SELECT STATEMENT | | 229K| 2691K| 1562 (1)| 00:00:19 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 229K| 2691K| 1562 (1)| 00:00:19 |
|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 229K| | 453 (1)| 00:00:06 |
———————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
2 – access("S"."CHANNEL_ID"=9)
Statistics
———————————————————-
0 recursive calls
0 db block gets
296 consistent gets
0 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
- Let us create a histogram on CHANNEL_ID to inform the optimizer about the skew.
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columnsCHANNEL_ID size 4, cust_id size auto’,Cascade => true, no_invalidate => false);
SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM dba_TAB_COL_STATISTICS
WHERE owner = 'HR' and TABLE_NAME='SALES';
TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM
————— —————————— ———— —————
SALES PROD_ID 72 NONE
SALES CUST_ID 7059 NONE
SALES TIME_ID 1460 NONE
SALES CHANNEL_ID 4 FREQUENCY
SALES PROMO_ID 4 NONE
SALES QUANTITY_SOLD 1 NONE
SALES AMOUNT_SOLD 3586 NONE
7 rows selected.
Now that a histogram on CHANNEL_ID is in place, the optimizer is aware of the skew; i.e., it knows that there are too many rows for channel_id = 3 and relatively few rows for CHANNEL_ID = 9.
Now, if we search for CHANNEL_ID = 3, the optimizer almost correctly estimates 542K rows to be returned and appropriately chooses to perform a full table scan.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
set autot off
540328 rows selected.
Execution Plan
———————————————————-
Plan hash value: 781590677
—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 542K| 6356K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 542K| 6356K| 1231 (1)| 00:00:15 |
—————————————————————————
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CHANNEL_ID"=3)
Statistics
———————————————————-
0 recursive calls
0 db block gets
40277 consistent gets
4431 physical reads
0 redo size
8897553 bytes sent via SQL*Net to client
396650 bytes received via SQL*Net from client
36023 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
540328 rows processed
On searching for CHANNEL_ID = 9, the optimizer estimates 1437 rows to be returned, which is quite close to the actual number of 2074, and obviously much better than its earlier estimate of 229k. As a result, it correctly chooses to perform an index scan with only 296 consistent gets.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9;
set autot off
2074 rows selected.
Execution Plan
———————————————————-
Plan hash value: 3317832025
———————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
———————————————————————————————-
| 0 | SELECT STATEMENT | | 1437 | 17244 | 12 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1437 | 17244 | 12 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 1437 | | 5 (0)| 00:00:01 |
———————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
2 – access("S"."CHANNEL_ID"=9)
Statistics
———————————————————-
0 recursive calls
0 db block gets
296 consistent gets
0 physical reads
0 redo size
38498 bytes sent via SQL*Net to client
1937 bytes received via SQL*Net from client
140 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2074 rows processed
2074 rows selected.
Thus, if data distribution is skewed in a column, a histogram should be created on the column to make the optimizer aware of the skew. With a histogram in place, the optimizer may prefer to use even an unselective index for sparsely occurring values, while it may ignore the index for values which occur frequently. This in turn means that what matters is the selectivity of the index with respect to the values being queried most. It might be beneficial to create even an unselective index on column(s) having few values if
- The data distribution in the column(s) is skewed and
- A histogram is available on skewed column(s) and
- The application queries sparsely occurring values most of the time.
Summary:
- In case of skewed data distribution,
- If a histogram is not available on an indexed column , an unselective index may not be used at all
- If a histogram is available on an indexed column, an unselective index may be
- Used for sparsely occurring values
- Ignored for frequently occurring values
- It might be beneficial to create index even on column(s) having few values if
- The data distribution in the column(s) is skewed, and
- A histogram is available on the skewed column(s). and
- The application queries sparsely occurring values most of the time.
Start the discussion at forums.toadworld.com