A B-Tree index – short for balanced tree index – is typically the most common type of index found in most Oracle databases. Just like the index of a book, it stores the values in the indexed column(s) along with the corresponding ROWID’s. These indexes may be used by the optimizer to execute a SQL query. One of the key factors considered by the optimizer to choose between a full table scan and an index scan is the selectivity of an index.
What is index selectivity?
Index selectivity is defined as the fraction of rows in a table having the same value for the indexed key. An index is considered highly selective if it has few rows for each index entry. On the other hand, an unselective index has many rows for each index entry. Best possible selectivity is when each row has a different value in the indexed column(s), whereas an index is considered least selective when all the rows have same value in indexed column(s).
Example of indexes with good selectivity are:
- Unique indexes on NOT NULL columns have the highest selectivity, as there is only one row for each index entry.
Examples of indexes with poor selectivity are:
- Index on the column Gender, which stores just two unique values (M/F)
- Indexes on columns storing Boolean data; e.g., Y/N, 1/0, T/F
Another way of understanding selectivity is as a measure of the uniqueness of the data in the indexed column(s).
- Higher selectivity means
- More unique data
- Fewer duplicates
- Fewer number of rows for each key value
- Lower selectivity means
- Less unique data
- More duplicates
- Larger number of rows for each key value
In short, a highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
When and why is an index used?
In absence of an index, all the table blocks will need to be scanned to retrieve the rows satisfying the query predicate. However, if an index is used, only those table data blocks need to be visited which contain the ROWID’s returned by the index. Thus, indexes facilitate faster data retrieval by providing random lookups and easy access to data ordered by indexed column(s). The main source of benefit achieved by using an index is due to
- Reduced number of I/Os required to satisfy a query
- Reduced requirement for buffers in the buffer cache
Unselective vs. Selective Indexes
If the index being used is unselective, it will return a large number of rows for each Key value. This could cause:
- A higher number of index blocks to be visited
- A higher number of table block visits, which could further be escalated by a high clustering factor of the index
Both the above factors could even lead the optimizer to choose Full table scan over index access because of lower cost of FTS. This means that we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. Consider for example, an index having two distinct values in the key column(s). Assuming uniform data distribution, 50% ( = (1/2)*100) of the rows in a table have the same value for the indexed key. In such a case, it is more efficient to perform a full table scan than accessing the table via index.
If the index is not highly unselective, the optimizer may choose to use the index but in that case,
- A large number of index and data blocks would initially be read into the buffer cache, resulting in waits for db file sequential read, Free buffers and Latch: Cache Buffer LRU Chain.
- The buffer cache might fall short of space due to the application’s unnecessarily requiring many more blocks, resulting in excessive linking / unlinking of buffers in the Cache Buffer chains, causing waits for Latch: Cache Buffer Chain wait event.
- The query would read too many data blocks into the buffer cache, keeping in wait other sessions that want to access one or more of those same blocks, thereby resulting in Buffer Busy waits.
However, if the index is selective, there would be few rows for each Key value. As a result, we can access the same table data by visiting fewer indexes as well as table blocks, thereby reducing the number of I/Os and hence improving performance. For example, if an index has 100 distinct values in the key column(s), only 1% of the rows in a table have the same value for the indexed key. Since this index is quite selective, it is highly likely that the optimizer uses this index while searching for a key value.
Moreover, the requirement for buffers in the buffer cache would be quite less, leading to lesser waits for various buffer cache related wait events. The higher the selectivity of an index, the more likely it is that the optimizer chooses to access the table using an index rather than performing a full table scan.
Hence, selectivity of an index determines its effectiveness in optimizing performance. The higher the selectivity, the fewer rows are returned by the index scan and the faster the query engine can reduce the size of the result set. It is desirable to have indexes with fairly high selectivity to avoid performance issues.
It is worth mentioning here that
- Oracle implicitly creates B-tree indexes on the columns having unique and primary key integrity constraints.
- A B-tree index does not store NULL entries.
- As a general guideline, B-tree indexes should be created on the columns that are often queried for less than 15% of the table's rows.
Now, I will demonstrate that the optimizer prefers to
- Perform a Full Table Scan rather than use a highly unselective index.
- Access a table via index if the index is selective.
Demonstration
- Create hr.sales table, which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that the column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.
SYS>drop table hr.sales purge;
create table hr.sales as select * from sh.sales
order by channel_id, cust_id;
SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,
count(distinct(cust_id))/count(*)cust_id_selectivity,
count(distinct channel_id)distinct_channel_id,
count(distinct(channel_id))/count(*)channel_id_selectivity
from hr.sales;
NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY
———- —————- ——————- ——————- ———————-
918843 7059 .007682488 4 4.3533E-06
- Gather optimizer statistics for the hr.sales table
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns size 1', Cascade => true, no_invalidate => false);
- Find out four distinct values of CHANNEL_ID
SYS>select distinct channel_id
from hr.sales;
CHANNEL_ID
———-
2
4
3
9
- Let us create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column
SYS>Create index hr.channel_id_idx on hr.sales(channel_id);
Now, if we search for a CHANNEL_ID, it can be seen that the optimizer assumes data to be uniformly distributed and estimates that 25% of the rows (918K/4 = 229K) to be returned and hence prefers to perform a full table scan. Note that the statistics shown are from the second execution to eliminate parsing and other costs.
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
Execution Plan
———————————————————-
Plan hash value: 781590677
—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
—————————————————————————
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CHANNEL_ID"=3)
Hence, we saw thatthe optimizer prefers to perform a Full Table Scan rather than use a highly unselective index.
Now let us repeat the same exercise on the highly selective CUST_ID column.
- Create an index on column CHANNEL_ID, which is highly selective, as there are 7049 distinct values in the column.
SYS>Create index hr.cust_id_idx on hr.sales(cust_id);
- Now, if we search for a CUST_ID, it can be seen that the optimizer, assuming the data to be uniformly distributed, estimates that only 130 rows (= 918843/7049 ) out of a total of 918843 will be returned and hence prefers to access the table via index.
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.cust_id = 100750;
Execution Plan
———————————————————-
Plan hash value: 2826200558
——————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
——————————————————————————————-
| 0 | SELECT STATEMENT | | 130 | 1560 | 6 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 130 | 1560 | 6 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CUST_ID_IDX | 130 | | 3 (0)| 00:00:01 |
——————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
2 – access("S"."CUST_ID"=100750)
Hence, we saw that the optimizer prefers to access a table via index if the index is selective.
It is desirable to create indexes having fairly high selectivity to avoid performance issues like
- Poor query performance
- Slower DML operations due to index maintenance overhead
- Various wait events related to buffer cache
Summary
- The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key.
- A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
- The main source of benefit achieved by using an index is due to the reduced number of I/Os required to satisfy a query.
- The selectivity of an index determines its effectiveness in optimizing performance.
- Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache, resulting in various wait events like db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc.
- The optimizer prefers to perform a Full Table Scan rather than use a highly unselective index.
- The optimizer prefers to access a table via index if the index is selective.
- It is desirable to have indexes with fairly high selectivity to make them more useful.
References:
https://docs.oracle.com/cd/B10501_01/server.920/a96533/data_acc.htm
Start the discussion at forums.toadworld.com