Sep 5, 2018 3:45:31 PM by Anju Garg
A B-Tree index – short for balanced tree index – is typically the most common type of index found in most Oracle databases. Just like the index of a book, it stores the values in the indexed column(s) along with the corresponding ROWID’s. These indexes may be used by the optimizer to execute a SQL query. One of the key factors considered by the optimizer to choose between a full table scan and an index scan is the selectivity of an index.
Index selectivity is defined as the fraction of rows in a table having the same value for the indexed key. An index is considered highly selective if it has few rows for each index entry. On the other hand, an unselective index has many rows for each index entry. Best possible selectivity is when each row has a different value in the indexed column(s), whereas an index is considered least selective when all the rows have same value in indexed column(s).
Example of indexes with good selectivity are:
Examples of indexes with poor selectivity are:
Another way of understanding selectivity is as a measure of the uniqueness of the data in the indexed column(s).
In short, a highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
In absence of an index, all the table blocks will need to be scanned to retrieve the rows satisfying the query predicate. However, if an index is used, only those table data blocks need to be visited which contain the ROWID’s returned by the index. Thus, indexes facilitate faster data retrieval by providing random lookups and easy access to data ordered by indexed column(s). The main source of benefit achieved by using an index is due to
If the index being used is unselective, it will return a large number of rows for each Key value. This could cause:
Both the above factors could even lead the optimizer to choose Full table scan over index access because of lower cost of FTS. This means that we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. Consider for example, an index having two distinct values in the key column(s). Assuming uniform data distribution, 50% ( = (1/2)*100) of the rows in a table have the same value for the indexed key. In such a case, it is more efficient to perform a full table scan than accessing the table via index.
If the index is not highly unselective, the optimizer may choose to use the index but in that case,
However, if the index is selective, there would be few rows for each Key value. As a result, we can access the same table data by visiting fewer indexes as well as table blocks, thereby reducing the number of I/Os and hence improving performance. For example, if an index has 100 distinct values in the key column(s), only 1% of the rows in a table have the same value for the indexed key. Since this index is quite selective, it is highly likely that the optimizer uses this index while searching for a key value.
Moreover, the requirement for buffers in the buffer cache would be quite less, leading to lesser waits for various buffer cache related wait events. The higher the selectivity of an index, the more likely it is that the optimizer chooses to access the table using an index rather than performing a full table scan.
Hence, selectivity of an index determines its effectiveness in optimizing performance. The higher the selectivity, the fewer rows are returned by the index scan and the faster the query engine can reduce the size of the result set. It is desirable to have indexes with fairly high selectivity to avoid performance issues.
It is worth mentioning here that
Now, I will demonstrate that the optimizer prefers to
SYS>drop table hr.sales purge;
create table hr.sales as select * from sh.sales
order by channel_id, cust_id;
SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,
count(distinct(cust_id))/count(*)cust_id_selectivity,
count(distinct channel_id)distinct_channel_id,
count(distinct(channel_id))/count(*)channel_id_selectivity
from hr.sales;
NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY
---------- ---------------- ------------------- ------------------- ----------------------
918843 7059 .007682488 4 4.3533E-06
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns size 1', Cascade => true, no_invalidate => false);
SYS>select distinct channel_id
from hr.sales;
CHANNEL_ID
----------
2
4
3
9
SYS>Create index hr.channel_id_idx on hr.sales(channel_id);
Now, if we search for a CHANNEL_ID, it can be seen that the optimizer assumes data to be uniformly distributed and estimates that 25% of the rows (918K/4 = 229K) to be returned and hence prefers to perform a full table scan. Note that the statistics shown are from the second execution to eliminate parsing and other costs.
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 3;
Execution Plan
----------------------------------------------------------
Plan hash value: 781590677
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |
|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("S"."CHANNEL_ID"=3)
Hence, we saw that the optimizer prefers to perform a Full Table Scan rather than use a highly unselective index.
Now let us repeat the same exercise on the highly selective CUST_ID column.
SYS>Create index hr.cust_id_idx on hr.sales(cust_id);
SYS>select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.cust_id = 100750;
Execution Plan
----------------------------------------------------------
Plan hash value: 2826200558
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 130 | 1560 | 6 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 130 | 1560 | 6 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CUST_ID_IDX | 130 | | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("S"."CUST_ID"=100750)
Hence, we saw that the optimizer prefers to access a table via index if the index is selective.
It is desirable to create indexes having fairly high selectivity to avoid performance issues like
References:
https://docs.oracle.com/cd/B10501_01/server.920/a96533/data_acc.htm
Tags: Oracle
Written by Anju Garg
Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc.
She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook
She is certified for :
She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/
We use cookies to improve your experience with our site. By continuing to use this site, you consent to our use of cookies. Learn more.