Unselective Indexes: Selectivity

    Sep 5, 2018 4:45:31 PM by Anju Garg

    A B-Tree index – short for balanced tree index – is typically the most common type of index found in most Oracle databases. Just like the index of a book, it stores the values in the indexed column(s) along with the corresponding ROWID’s.  These indexes may be used by the optimizer to execute a SQL query. One of the key factors considered by the optimizer to choose between a full table scan and an index scan is the selectivity of an index.

    What is index selectivity?

    Index selectivity is defined as the fraction of rows in a table having the same value for the indexed key.  An index is considered highly selective if it has few rows for each index entry. On the other hand, an unselective index has many rows for each index entry.  Best possible selectivity is when each row has a different value in the indexed column(s), whereas an index is considered least selective when all the rows have same value in indexed column(s). 

    Example of indexes with good selectivity are:

    • Unique indexes on NOT NULL columns have the highest selectivity, as there is only one row for each index entry.

    Examples of indexes with poor selectivity are:

    • Index on the column Gender, which stores just two unique values (M/F)
    • Indexes on columns storing Boolean data; e.g., Y/N, 1/0, T/F

    Another way of understanding selectivity is as a measure of the uniqueness of the data in the indexed column(s).

    • Higher selectivity means
      • More unique data
      • Fewer duplicates
      • Fewer number of rows for each key value
    • Lower selectivity means
      • Less unique data
      • More duplicates
      • Larger number of rows for each key value

    In short, a highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.

    When and why is an index used?

    In absence of an index, all the table blocks will need to be scanned to retrieve the rows satisfying the query predicate. However, if an index is used, only those table data blocks need to be visited which contain the ROWID’s returned by the index.  Thus, indexes facilitate faster data retrieval by providing random lookups and easy access to data ordered by indexed column(s). The main source of benefit achieved by using an index is due to

    • Reduced number of I/Os required to satisfy a query
    • Reduced requirement for buffers in the buffer cache

    Unselective vs. Selective Indexes

    If the index being used is unselective, it will return a large number of rows for each Key value. This could cause:

    • A higher number of index blocks to be visited
    • A higher number of table block visits, which could further be escalated by a high clustering factor of the index

    Both the above factors could even lead the optimizer to choose Full table scan over index access because of lower cost of FTS. This means that we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. Consider for example, an index having two distinct values in the key column(s). Assuming uniform data distribution, 50% ( = (1/2)*100) of the rows in a table have the same value for the indexed key. In such a case, it is more efficient to perform a full table scan than accessing the table via index.

    If the index is not highly unselective, the optimizer may choose to use the index but in that case,

    • A large number of index and data blocks would initially be read into the buffer cache, resulting in waits for db file sequential read, Free buffers and Latch: Cache Buffer LRU Chain.
    • The buffer cache might fall short of space due to the application’s unnecessarily requiring many more blocks, resulting in excessive linking / unlinking of buffers in the Cache Buffer chains, causing waits for Latch: Cache Buffer Chain wait event.
    • The query would read too many data blocks into the buffer cache, keeping in wait other sessions that want to access one or more of those same blocks, thereby resulting in Buffer Busy waits.

    However, if the index is selective, there would be few rows for each Key value. As a result, we can access the same table data by visiting fewer indexes as well as table blocks, thereby reducing the number of I/Os  and hence improving performance.  For example, if an index has 100 distinct values in the key column(s), only 1% of the rows in a table have the same value for the indexed key. Since this index is quite selective, it is highly likely that the optimizer uses this index while searching for a key value. 

    Moreover, the requirement for buffers in the buffer cache would be quite less, leading to lesser waits for various buffer cache related wait events. The higher the selectivity of an index, the more likely it is that the optimizer chooses to access the table using an index rather than performing a full table scan.

    Hence, selectivity of an index determines its effectiveness in optimizing performance. The higher the selectivity, the fewer rows are returned by the index scan and the faster the query engine can reduce the size of the result set. It is desirable to have indexes with fairly high selectivity to avoid performance issues.

    It is worth mentioning here that

    • Oracle implicitly creates B-tree indexes on the columns having unique and primary key integrity constraints.
    • A B-tree index does not store NULL entries.
    • As a general guideline, B-tree indexes should be created on the columns that are often queried for less than 15% of the table's rows. 

    Now, I will demonstrate that the optimizer prefers to

    • Perform a Full Table Scan rather than use a highly unselective index.
    • Access a table via index if the index is selective.

    Demonstration

    • Create hr.sales table, which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that the column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

    SYS>drop table hr.sales purge;

        create table hr.sales as select * from sh.sales

        order by channel_id, cust_id;

     

    SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

        count(distinct(cust_id))/count(*)cust_id_selectivity,

        count(distinct channel_id)distinct_channel_id,

        count(distinct(channel_id))/count(*)channel_id_selectivity 

         from hr.sales;

     

      NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

    ---------- ---------------- ------------------- ------------------- ----------------------

        918843             7059          .007682488                   4             4.3533E-06

     

    • Gather optimizer statistics for the hr.sales table

    SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns  size 1', Cascade => true, no_invalidate => false);

    • Find out four distinct values of CHANNEL_ID

    SYS>select distinct channel_id

        from hr.sales;

     

    CHANNEL_ID

    ----------

             2

             4

             3

             9

    • Let us create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column

    SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

    Now, if we search for a CHANNEL_ID, it can be seen that the optimizer assumes data to be uniformly distributed and estimates that 25% of the rows (918K/4 = 229K) to be returned and hence prefers   to perform a full table scan. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

    SYS>select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

        where s.channel_id = 3;

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 781590677

     

    ---------------------------------------------------------------------------

    | Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

    ---------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |       |   229K|  2691K|  1231   (1)| 00:00:15 |

    |*  1 |  TABLE ACCESS FULL| SALES |   229K|  2691K|  1231   (1)| 00:00:15 |

    ---------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       1 - filter("S"."CHANNEL_ID"=3)

    Hence, we saw that the optimizer prefers to perform a Full Table Scan rather than use a highly unselective index.

    Now let us repeat the same exercise on the highly selective CUST_ID column.

    • Create an index on column CHANNEL_ID, which is highly selective, as there are 7049 distinct values in the column.

    SYS>Create index hr.cust_id_idx on hr.sales(cust_id);

    • Now, if we search for a CUST_ID, it can be seen that the optimizer, assuming the data to be uniformly distributed, estimates that only 130 rows (= 918843/7049 ) out of a total of 918843 will be returned and hence prefers to access the table via index.

    SYS>select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

        where s.cust_id = 100750;

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 2826200558

     

    -------------------------------------------------------------------------------------------

    | Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |

    -------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT            |             |   130 |  1560 |     6   (0)| 00:00:01 |

    |   1 |  TABLE ACCESS BY INDEX ROWID| SALES       |   130 |  1560 |     6   (0)| 00:00:01 |

    |*  2 |   INDEX RANGE SCAN          | CUST_ID_IDX |   130 |       |     3   (0)| 00:00:01 |

    -------------------------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       2 - access("S"."CUST_ID"=100750)

    Hence, we saw that the optimizer prefers to access a table via index if the index is selective.

    It is desirable to create indexes having fairly high selectivity to avoid performance issues like

    • Poor query performance
    • Slower DML operations due to index maintenance overhead
    • Various wait events related to buffer cache

    Summary

    • The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key.
    • A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.
    • The main source of benefit achieved by using an index is due to the reduced number of I/Os required to satisfy a query.
    • The selectivity of an index determines its effectiveness in optimizing performance.
    • Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache, resulting in various wait events like db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc.
    • The optimizer prefers to perform a Full Table Scan rather than use a highly unselective index.
    • The optimizer prefers to access a table via index if the index is selective.
    • It is desirable to have indexes with fairly high selectivity to make them more useful.

     

    References:

    https://docs.oracle.com/cd/B10501_01/server.920/a96533/data_acc.htm

     

     

    Tags: Oracle

    Anju Garg

    Written by Anju Garg

    Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc.

    She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook

    She is certified for :

    • Oracle 9i Database Administration OCP
    • Oracle 11g Database Administration OCP
    • Oracle 11g Performance Tuning OCE
    • Oracle 11g R2 RAC OCE
    • Oracle 11g SQL Tuning OCE
    • Oracle 12c Database Administration OCP
    • Oracle Real Application Clusters 12c Certified Implementation Specialist

    She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/