Unselective indexes and skewed data distribution

    Oct 5, 2018 9:24:11 AM by Anju Garg

    The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key.  A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.

    The main source of benefit achieved by using an index is due to reduced number of I/Os required to satisfy a query. Selectivity of an index is one of the factors which determine the reduction in I/O that can be achieved by using it and, hence, its effectiveness in optimizing performance. The higher the selectivity, the fewer rows returned by the index scan and the faster the query engine can reduce the size of the result set. Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache resulting in various wait events such as db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc. Optimizer may prefer to perform a Full Table Scan rather than use a highly unselective index; in which case, we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. On the other hand, if the index is selective, optimizer prefers to access the table via index. Hence, it is desirable to have indexes with fairly high selectivity to make them more useful.

    In this article, I will explore the scenario in which there are few values in the indexed column but data distribution in the column is skewed; i.e., although on the whole the index seems to be unselective, it may still be selective with respect to sparsely occurring values.  

    I will demonstrate that

    • In the case of skewed data distribution,
      • If a histogram is not available on the indexed column, an unselective index may not be used at all
      • If a histogram is available on the indexed column, an unselective index may be
        • Used for sparsely occurring values
        • Ignored for frequently occurring values
    • It might be beneficial to create an index even on column(s) having few values if
      • The data distribution in the column(s) is skewed, and
      • A histogram is available on skewed column(s), and
      • The application queries sparsely occurring values most of the time.

    Demonstration

    • Create hr.sales table which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

    SYS>drop table hr.sales purge;

        create table hr.sales as select * from sh.sales

        order by channel_id, cust_id;

     

    SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

        count(distinct(cust_id))/count(*)cust_id_selectivity,

        count(distinct channel_id)distinct_channel_id,

        count(distinct(channel_id))/count(*)channel_id_selectivity 

         from hr.sales;

     

      NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

    ---------- ---------------- ------------------- ------------------- ----------------------

        918843             7059          .007682488                   4             4.3533E-06

     

    • Let us check how data is distributed across four values in the CHANNEL_ID column.

    SYS>select distinct channel_id , count(*)

        from hr.sales

        group by channel_id order by channel_id;  

       

    CHANNEL_ID   COUNT(*)

    ---------- ----------

             2     258025

             3     540328

             4     118416

             9       2074

    It can be seen that the data in CHANNEL_ID column is skewed and there are few rows for CHANNEL_ID = 9 as compared to the other values; i.e., if we create an index on CHANNEL_ID it will be selective with respect to certain values, whereas it is unselective with respect to others.

    • Gather optimizer statistics for the hr.sales table. Note that I have not gathered a histogram for column CHANNEL_ID.

    SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns  size 1', Cascade => true, no_invalidate => false);

    • Let us Create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column

    SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

    Now, if we search for a CHANNEL_ID = 3, it can be seen that the optimizer assumes data to be uniformly distributed and estimates 25% of the rows (918K/4 = 229K) to be returned and hence  prefers to perform a full table scan rather than use the unselective index. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

    SYS>select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

        where s.channel_id = 3;

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 781590677

     

    ---------------------------------------------------------------------------

    | Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

    ---------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |       |   229K|  2691K|  1231   (1)| 00:00:15 |

    |*  1 |  TABLE ACCESS FULL| SALES |   229K|  2691K|  1231   (1)| 00:00:15 |

    ---------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       1 - filter("S"."CHANNEL_ID"=3)

     

    If we search for CHANNEL_ID = 9, having few rows, and examine the execution plan and statistics, it can be seen that the optimizer again prefers to perform a full table scan even though it results in a large number of physical and logical I/Os.

    SYS> set autotrace traceonly

         select s.channel_id, s.cust_id, s.prod_id

         from hr.sales s

          where s.channel_id = 9;

         set autot off

     

    2074 rows selected.

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 781590677

     

    ---------------------------------------------------------------------------

    | Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

    ---------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |       |   229K|  2691K|  1231   (1)| 00:00:15 |

    |*  1 |  TABLE ACCESS FULL| SALES |   229K|  2691K|  1231   (1)| 00:00:15 |

    ---------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("S"."CHANNEL_ID"=9)

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

           4574  consistent gets

           4431  physical reads

              0  redo size

          38498  bytes sent via SQL*Net to client

           1937  bytes received via SQL*Net from client

            140  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

           2074  rows processed

    If we force the use of index by means of hint, observe that the consistent gets have dropped considerably from 4574 to 296 and physical reads have been eliminated. It means that although it would be beneficial to use an index, the optimizer still thinks full table search is better since it does not know about the skew in the data (We did not create a histogram on CHANNEL_ID while collecting statistics for the hr.sales table).

    SYS> set autotrace traceonly

         select /*+ index (s) */ s.channel_id, s.cust_id, s.prod_id

         from hr.sales s

         where s.channel_id = 9;

         set autot off

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 3317832025

     

    ----------------------------------------------------------------------------------------------

    | Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |

    ----------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT            |                |   229K|  2691K|  1562   (1)| 00:00:19 |

    |   1 |  TABLE ACCESS BY INDEX ROWID| SALES          |   229K|  2691K|  1562   (1)| 00:00:19 |

    |*  2 |   INDEX RANGE SCAN          | CHANNEL_ID_IDX |   229K|       |   453   (1)| 00:00:06 |

    ----------------------------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       2 - access("S"."CHANNEL_ID"=9)

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

            296  consistent gets

              0  physical reads

              0  redo size

          38498  bytes sent via SQL*Net to client

           1937  bytes received via SQL*Net from client

            140  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

           2074  rows processed

    • Let us create a histogram on CHANNEL_ID to inform the optimizer about the skew.

    SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto',Cascade => true, no_invalidate => false);

    SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM

         FROM   dba_TAB_COL_STATISTICS

         WHERE  owner = 'HR' and TABLE_NAME='SALES';

     

     

    TABLE_NAME      COLUMN_NAME                    NUM_DISTINCT HISTOGRAM

    --------------- ------------------------------ ------------ ---------------

    SALES           PROD_ID                                  72 NONE

    SALES           CUST_ID                                7059 NONE

    SALES           TIME_ID                                1460 NONE

    SALES           CHANNEL_ID                                4 FREQUENCY

    SALES           PROMO_ID                                  4 NONE

    SALES           QUANTITY_SOLD                             1 NONE

    SALES           AMOUNT_SOLD                            3586 NONE

     

    7 rows selected.

    Now that a histogram on CHANNEL_ID is in place, the optimizer is aware of the skew; i.e., it knows that there are too many rows for channel_id = 3 and relatively few rows for CHANNEL_ID = 9.

    Now, if we search for CHANNEL_ID = 3, the optimizer almost correctly estimates 542K rows to be returned and appropriately chooses to perform a full table scan.

    SYS> set autotrace traceonly

     

         select s.channel_id, s.cust_id, s.prod_id

         from hr.sales s

         where s.channel_id = 3;

     

         set autot off

     

    540328 rows selected.

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 781590677

     

    ---------------------------------------------------------------------------

    | Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

    ---------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |       |   542K|  6356K|  1231   (1)| 00:00:15 |

    |*  1 TABLE ACCESS FULL| SALES |   542K|  6356K|  1231   (1)| 00:00:15 |

    ---------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("S"."CHANNEL_ID"=3)

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

          40277  consistent gets

           4431  physical reads

              0  redo size

        8897553  bytes sent via SQL*Net to client

         396650  bytes received via SQL*Net from client

          36023  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

         540328  rows processed

    On searching for CHANNEL_ID = 9, the optimizer estimates 1437 rows to be returned, which is quite close to the actual number of 2074, and obviously much better than its earlier estimate of 229k. As a result, it correctly chooses to perform an index scan with only 296 consistent gets. 

    SYS> set autotrace traceonly

     

        select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

         where s.channel_id = 9;

        set autot off

    2074 rows selected.

     

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 3317832025

     

    ----------------------------------------------------------------------------------------------

    | Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |

    ----------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT            |                |  1437 | 17244 |    12   (0)| 00:00:01 |

    |   1 |  TABLE ACCESS BY INDEX ROWID| SALES          |  1437 | 17244 |    12   (0)| 00:00:01 |

    |*  2 |   INDEX RANGE SCAN          | CHANNEL_ID_IDX1437 |       |     5   (0)| 00:00:01 |

    ----------------------------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       2 - access("S"."CHANNEL_ID"=9)

     

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

            296  consistent gets

              0  physical reads

              0  redo size

          38498  bytes sent via SQL*Net to client

           1937  bytes received via SQL*Net from client

            140  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

           2074  rows processed

     

    2074 rows selected.

    Thus, if data distribution is skewed in a column, a histogram should be created on the column to make the optimizer aware of the skew. With a histogram in place, the optimizer may prefer to use even an unselective index for sparsely occurring values, while it may ignore the index for values which occur frequently. This in turn means that what matters is the selectivity of the index with respect to the values being queried most. It might be beneficial to create even an unselective index on column(s) having few values if

    • The data distribution in the column(s) is skewed and
    • A histogram is available on skewed column(s) and
    • The application queries sparsely occurring values most of the time.

    Summary:

    • In case of skewed data distribution,
      • If a histogram is not available on an indexed column , an unselective index may not be used at all
      • If a histogram is available on an indexed column, an unselective index may be
        • Used for sparsely occurring values
        • Ignored for frequently occurring values
    • It might be beneficial to create index even on column(s) having few values if
    • The data distribution in the column(s) is skewed, and
    • A histogram is available on the skewed column(s). and
    • The application queries sparsely occurring values most of the time.

     

    Tags: Oracle

    Anju Garg

    Written by Anju Garg

    Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc. She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook She is certified for : Oracle 9i Database Administration OCP Oracle 11g Database Administration OCP Oracle 11g Performance Tuning OCE Oracle 11g R2 RAC OCE Oracle 11g SQL Tuning OCE Oracle 12c Database Administration OCP Oracle Real Application Clusters 12c Certified Implementation Specialist She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/