Index selectivity & queries with multiple equality conditions

    Nov 5, 2018 10:35:17 AM by Anju Garg

    Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column to a selective one.

    It is very common to issue queries with multiple equality predicates, such as

    Where col1 = <> and col2 = <> and …

    It is highly likely that the columns being referenced in the predicates have different selectivities; i.e, whereas some (highly unselective) columns have very few distinct values, others (more selective) have a large number of distinct values. Some columns with the highest selectivity may have a unique constraint defined so that each row has a distinct value.

    In such a scenario, if we have the option to create single column B-tree index, which index will perform better: the index on a column with more unique data (more selective) or more duplicate data (less selective)?

    In this article I will explore above scenario by means of a simple query having two predicates referencing columns that have contrasting selectivities.

    Overview

    I will create a table hr.sales which is copy of sh.sales having data ordered by columns channel_id, cust_id. The column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

    We will issue a query having two equality predicates; e.g.,

    Where CHANNEL_ID = < >   And CUST_ID = < >

    Subsequently we will examine performance of the query if we have

    • An (unselective) index on CHANNEL_ID only
    • An (selective) index on CUST_ID only

    Demonstration

    • Create a table sales table which is a copy of sh.sales having data ordered by channel_id, cust_id. Verify that the column channel_id has four distinct values, whereas the column cust_id has 7059 distinct values. This means that an index on channel_id will be less selective than that on cust_id.

    SYS>drop table hr.sales purge;

        create table hr.sales as select * from sh.sales

        order by channel_id, cust_id;

     

    SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

        count(distinct(cust_id))/count(*)cust_id_selectivity,

        count(distinct channel_id)distinct_channel_id,

        count(distinct(channel_id))/count(*)channel_id_selectivity 

         from hr.sales;

     

      NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

    ---------- ---------------- ------------------- ------------------- ----------------------

        918843             7059          .007682488                   4             4.3533E-06

    The data in the column channel_id is skewed and there are few rows for channel_id = 9 as compared to the other values; i.e., the index on channel_id is selective with respect to certain values, whereas it is unselective with respect to others.

    SYS>select distinct channel_id , count(*)

        from hr.sales

        group by channel_id order by channel_id;  

        

    CHANNEL_ID   COUNT(*)

    ---------- ----------

             2     258025

             3     540328

             4     118416

             9       2074

    • Let us create a histogram on channel_id to inform the optimizer about the skew.

    SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto',Cascade => true, no_invalidate => false);

    SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM

         FROM   dba_TAB_COL_STATISTICS

         WHERE  owner = 'HR' and TABLE_NAME='SALES';

     

    TABLE_NAME      COLUMN_NAME                    NUM_DISTINCT HISTOGRAM

    --------------- ------------------------------ ------------ ---------------

    SALES           PROD_ID                                  72 NONE

    SALES           CUST_ID                                7059 NONE

    SALES           TIME_ID                                1460 NONE

    SALES           CHANNEL_ID                                4 FREQUENCY

    SALES           PROMO_ID                                  4 NONE

    SALES           QUANTITY_SOLD                             1 NONE

    SALES           AMOUNT_SOLD                            3586 NONE

     

    7 rows selected.

    • Let us create an index on column channel_id, which is quite unselective, as there are only four distinct values in the column.

    SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

    • Let us find out the number of rows for
    • four values of channel_id
    • cust_id = 100750

    SYS>select distinct channel_id , count(*)

        from hr.sales

        group by channel_id order by channel_id;  

       

    CHANNEL_ID   COUNT(*)

    ---------- ----------

             2     258025

             3     540328

             4     118416

             9       2074

     

    SYS>select count(*) from hr.sales where cust_id = 100750;

     

      COUNT(*)

    ----------

             2

    Now, I will perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 to check the index usage and the cost of execution. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

    SYS> set autotrace traceonly

     

        select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

         where s.channel_id = 9 and s.cust_id = 100750;

     

        set autot off

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 3317832025

     

    ----------------------------------------------------------------------------------------------

    | Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |

    ----------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT            |                |     1 |    21 |    18   (0)| 00:00:01 |

    |*  1 |  TABLE ACCESS BY INDEX ROWID| SALES          |     1 |    21 |    18   (0)| 00:00:01 |

    |*  2 |   INDEX RANGE SCAN          | CHANNEL_ID_IDX |  2248 |       |     7   (0)| 00:00:01 |

    ----------------------------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       1 - filter("S"."CUST_ID"=100750)

       2 - access("S"."CHANNEL_ID"=9)

     

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

             20  consistent gets

              0  physical reads

              0  redo size

            600  bytes sent via SQL*Net to client

            419  bytes received via SQL*Net from client

              2  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

              2  rows processed

    Observe that

    • The range scan of the index on CHANNEL_ID returns ROWIDs of the rows having channel_id = 9 (Line 2)
    • The rows in the SALES table corresponding to returned ROWIDs are searched for rows with CUST_ID = 100750 (Line 1)
    • The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
    • As CHANNEL_ID_IDX is less selective, its scan (Line 2) returns a large number of rows (2074), requiring a large number of index blocks to be visited. The subsequent filter operation (Line 1) performed on 2074 rows requires visiting many table blocks. The total cost of the query is 20 consistent gets.

    It is worth mentioning here that the rows and cost shown in the execution plan are optimizer estimates. These might vary from the actual figures. For example, there are 2074 rows having channel_id = 9 but the optimizer estimates 2248 rows to be returned.

    • Let us now dropthe index on CHANNEL_ID and create an index on CUST_ID, which has higher selectivity

    SYS> Drop index hr.channel_id_idx;

        Create index hr.cust_id_idx on hr.sales(cust_id);

    • Now, again perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 and examine the execution plan and statistics

    SYS>set autot trace

        select s.channel_id, s.cust_id, s.prod_id

        from hr.sales s

         where s.channel_id = 9 and s.cust_id = 100750;

        set autot off

     

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 2826200558

     

    -------------------------------------------------------------------------------------------

    | Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |

    -------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT            |             |     1 |    21 |     6   (0)| 00:00:01 |

    |*  1 |  TABLE ACCESS BY INDEX ROWID| SALES       |     1 |    21 |     6   (0)| 00:00:01 |

    |*  2 |   INDEX RANGE SCAN          | CUST_ID_IDX |   130 |       |     3   (0)| 00:00:01 |

    -------------------------------------------------------------------------------------------

     

    Predicate Information (identified by operation id):

    ---------------------------------------------------

     

       1 - filter("S"."CHANNEL_ID"=9)

       2 - access("S"."CUST_ID"=100750)

     

     

    Statistics

    ----------------------------------------------------------

              0  recursive calls

              0  db block gets

              6  consistent gets

              0  physical reads

              0  redo size

            600  bytes sent via SQL*Net to client

            419  bytes received via SQL*Net from client

              2  SQL*Net roundtrips to/from client

              0  sorts (memory)

              0  sorts (disk)

              2  rows processed

    Observe that

    • The range scan of the index on CUST_ID returns ROWIDs of  rows having CUST_ID =100750 (line 2)
    • The rows in the SALES table corresponding to returned ROWIDs are searched for rows with channel_id = 9 (line 1)
    • The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
    • As CUST_ID_IDX is more selective, its scan (Line 2) returns a small number of rows (2) requiring a small number of index blocks to be visited. Subsequently, during the filter operation (Line 1) performed on the two rows, a maximum of two table blocks needs to be visited. This results in a much lower cost of only six consistent gets.

    Again note that the rows and cost shown in the execution plan, which are optimizer estimates, differ from the actual figures. There are 2 rows having CUST_ID =100750 but optimizer estimates 130 rows to be returned.

    Hence, we saw that in the case of a query having multiple equality predicates, using the index on the column with the higher selectivity (CUST_ID) causes

    • Fewer index blocks to be visited
    • Fewer rows to be returned as a result of index scan
    • Fewer table blocks to be visited
    • A filter operation to be performed on fewer rows

    Thus, use of a selective index reduces I/O and hence cost of the query. Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column (CHANNEL_ID) to a selective one (CUST_ID).

    Hence, in the case of a query having multiple equality predicates, it is advisable to create an index on the more selective column referenced in the predicate.

    Summary

    • For a query having multiple equality type predicates
      • The performance improves as it switches from using the index on an unselective column to a selective one.
      • It is advisable to create an index on the more selective column referenced in the predicate.

    In the next article, I will explore the use of concatenated indexes to improve the performance of queries having multiple equality type predicates.

     

     

     

    Tags: Oracle

    Anju Garg

    Written by Anju Garg

    Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc.

    She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook

    She is certified for :

    • Oracle 9i Database Administration OCP
    • Oracle 11g Database Administration OCP
    • Oracle 11g Performance Tuning OCE
    • Oracle 11g R2 RAC OCE
    • Oracle 11g SQL Tuning OCE
    • Oracle 12c Database Administration OCP
    • Oracle Real Application Clusters 12c Certified Implementation Specialist

    She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/