Toad World Blog

Unselective indexes and skewed data distribution

Oct 5, 2018 9:24:11 AM by Anju Garg

The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key.  A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.

The main source of benefit achieved by using an index is due to reduced number of I/Os required to satisfy a query. Selectivity of an index is one of the factors which determine the reduction in I/O that can be achieved by using it and, hence, its effectiveness in optimizing performance. The higher the selectivity, the fewer rows returned by the index scan and the faster the query engine can reduce the size of the result set. Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache resulting in various wait events such as db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc. Optimizer may prefer to perform a Full Table Scan rather than use a highly unselective index; in which case, we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. On the other hand, if the index is selective, optimizer prefers to access the table via index. Hence, it is desirable to have indexes with fairly high selectivity to make them more useful.

In this article, I will explore the scenario in which there are few values in the indexed column but data distribution in the column is skewed; i.e., although on the whole the index seems to be unselective, it may still be selective with respect to sparsely occurring values.  

I will demonstrate that

  • In the case of skewed data distribution,
    • If a histogram is not available on the indexed column, an unselective index may not be used at all
    • If a histogram is available on the indexed column, an unselective index may be
      • Used for sparsely occurring values
      • Ignored for frequently occurring values
  • It might be beneficial to create an index even on column(s) having few values if
    • The data distribution in the column(s) is skewed, and
    • A histogram is available on skewed column(s), and
    • The application queries sparsely occurring values most of the time.

Demonstration

  • Create hr.sales table which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

SYS>drop table hr.sales purge;

    create table hr.sales as select * from sh.sales

    order by channel_id, cust_id;

 

SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

    count(distinct(cust_id))/count(*)cust_id_selectivity,

    count(distinct channel_id)distinct_channel_id,

    count(distinct(channel_id))/count(*)channel_id_selectivity 

     from hr.sales;

 

  NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

---------- ---------------- ------------------- ------------------- ----------------------

    918843             7059          .007682488                   4             4.3533E-06

 

  • Let us check how data is distributed across four values in the CHANNEL_ID column.

SYS>select distinct channel_id , count(*)

    from hr.sales

    group by channel_id order by channel_id;  

   

CHANNEL_ID   COUNT(*)

---------- ----------

         2     258025

         3     540328

         4     118416

         9       2074

It can be seen that the data in CHANNEL_ID column is skewed and there are few rows for CHANNEL_ID = 9 as compared to the other values; i.e., if we create an index on CHANNEL_ID it will be selective with respect to certain values, whereas it is unselective with respect to others.

  • Gather optimizer statistics for the hr.sales table. Note that I have not gathered a histogram for column CHANNEL_ID.

SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns  size 1', Cascade => true, no_invalidate => false);

  • Let us Create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column

SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

Now, if we search for a CHANNEL_ID = 3, it can be seen that the optimizer assumes data to be uniformly distributed and estimates 25% of the rows (918K/4 = 229K) to be returned and hence  prefers to perform a full table scan rather than use the unselective index. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

SYS>select s.channel_id, s.cust_id, s.prod_id

    from hr.sales s

    where s.channel_id = 3;

Execution Plan

----------------------------------------------------------

Plan hash value: 781590677

 

---------------------------------------------------------------------------

| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |       |   229K|  2691K|  1231   (1)| 00:00:15 |

|*  1 |  TABLE ACCESS FULL| SALES |   229K|  2691K|  1231   (1)| 00:00:15 |

---------------------------------------------------------------------------

 

Predicate Information (identified by operation id):

---------------------------------------------------

 

   1 - filter("S"."CHANNEL_ID"=3)

 

If we search for CHANNEL_ID = 9, having few rows, and examine the execution plan and statistics, it can be seen that the optimizer again prefers to perform a full table scan even though it results in a large number of physical and logical I/Os.

SYS> set autotrace traceonly

     select s.channel_id, s.cust_id, s.prod_id

     from hr.sales s

      where s.channel_id = 9;

     set autot off

 

2074 rows selected.

 

Execution Plan

----------------------------------------------------------

Plan hash value: 781590677

 

---------------------------------------------------------------------------

| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |       |   229K|  2691K|  1231   (1)| 00:00:15 |

|*  1 |  TABLE ACCESS FULL| SALES |   229K|  2691K|  1231   (1)| 00:00:15 |

---------------------------------------------------------------------------

 

Predicate Information (identified by operation id):

---------------------------------------------------

   1 - filter("S"."CHANNEL_ID"=9)

 

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

       4574  consistent gets

       4431  physical reads

          0  redo size

      38498  bytes sent via SQL*Net to client

       1937  bytes received via SQL*Net from client

        140  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

       2074  rows processed

If we force the use of index by means of hint, observe that the consistent gets have dropped considerably from 4574 to 296 and physical reads have been eliminated. It means that although it would be beneficial to use an index, the optimizer still thinks full table search is better since it does not know about the skew in the data (We did not create a histogram on CHANNEL_ID while collecting statistics for the hr.sales table).

SYS> set autotrace traceonly

     select /*+ index (s) */ s.channel_id, s.cust_id, s.prod_id

     from hr.sales s

     where s.channel_id = 9;

     set autot off

 

Execution Plan

----------------------------------------------------------

Plan hash value: 3317832025

 

----------------------------------------------------------------------------------------------

| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |                |   229K|  2691K|  1562   (1)| 00:00:19 |

|   1 |  TABLE ACCESS BY INDEX ROWID| SALES          |   229K|  2691K|  1562   (1)| 00:00:19 |

|*  2 |   INDEX RANGE SCAN          | CHANNEL_ID_IDX |   229K|       |   453   (1)| 00:00:06 |

----------------------------------------------------------------------------------------------

 

Predicate Information (identified by operation id):

---------------------------------------------------

   2 - access("S"."CHANNEL_ID"=9)

 

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

        296  consistent gets

          0  physical reads

          0  redo size

      38498  bytes sent via SQL*Net to client

       1937  bytes received via SQL*Net from client

        140  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

       2074  rows processed

  • Let us create a histogram on CHANNEL_ID to inform the optimizer about the skew.

SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto',Cascade => true, no_invalidate => false);

SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM

     FROM   dba_TAB_COL_STATISTICS

     WHERE  owner = 'HR' and TABLE_NAME='SALES';

 

 

TABLE_NAME      COLUMN_NAME                    NUM_DISTINCT HISTOGRAM

--------------- ------------------------------ ------------ ---------------

SALES           PROD_ID                                  72 NONE

SALES           CUST_ID                                7059 NONE

SALES           TIME_ID                                1460 NONE

SALES           CHANNEL_ID                                4 FREQUENCY

SALES           PROMO_ID                                  4 NONE

SALES           QUANTITY_SOLD                             1 NONE

SALES           AMOUNT_SOLD                            3586 NONE

 

7 rows selected.

Now that a histogram on CHANNEL_ID is in place, the optimizer is aware of the skew; i.e., it knows that there are too many rows for channel_id = 3 and relatively few rows for CHANNEL_ID = 9.

Now, if we search for CHANNEL_ID = 3, the optimizer almost correctly estimates 542K rows to be returned and appropriately chooses to perform a full table scan.

SYS> set autotrace traceonly

 

     select s.channel_id, s.cust_id, s.prod_id

     from hr.sales s

     where s.channel_id = 3;

 

     set autot off

 

540328 rows selected.

 

Execution Plan

----------------------------------------------------------

Plan hash value: 781590677

 

---------------------------------------------------------------------------

| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |       |   542K|  6356K|  1231   (1)| 00:00:15 |

|*  1 TABLE ACCESS FULL| SALES |   542K|  6356K|  1231   (1)| 00:00:15 |

---------------------------------------------------------------------------

 

Predicate Information (identified by operation id):

---------------------------------------------------

   1 - filter("S"."CHANNEL_ID"=3)

 

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

      40277  consistent gets

       4431  physical reads

          0  redo size

    8897553  bytes sent via SQL*Net to client

     396650  bytes received via SQL*Net from client

      36023  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

     540328  rows processed

On searching for CHANNEL_ID = 9, the optimizer estimates 1437 rows to be returned, which is quite close to the actual number of 2074, and obviously much better than its earlier estimate of 229k. As a result, it correctly chooses to perform an index scan with only 296 consistent gets. 

SYS> set autotrace traceonly

 

    select s.channel_id, s.cust_id, s.prod_id

    from hr.sales s

     where s.channel_id = 9;

    set autot off

2074 rows selected.

 

 

Execution Plan

----------------------------------------------------------

Plan hash value: 3317832025

 

----------------------------------------------------------------------------------------------

| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |                |  1437 | 17244 |    12   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| SALES          |  1437 | 17244 |    12   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | CHANNEL_ID_IDX1437 |       |     5   (0)| 00:00:01 |

----------------------------------------------------------------------------------------------

 

Predicate Information (identified by operation id):

---------------------------------------------------

 

   2 - access("S"."CHANNEL_ID"=9)

 

 

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

        296  consistent gets

          0  physical reads

          0  redo size

      38498  bytes sent via SQL*Net to client

       1937  bytes received via SQL*Net from client

        140  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

       2074  rows processed

 

2074 rows selected.

Thus, if data distribution is skewed in a column, a histogram should be created on the column to make the optimizer aware of the skew. With a histogram in place, the optimizer may prefer to use even an unselective index for sparsely occurring values, while it may ignore the index for values which occur frequently. This in turn means that what matters is the selectivity of the index with respect to the values being queried most. It might be beneficial to create even an unselective index on column(s) having few values if

  • The data distribution in the column(s) is skewed and
  • A histogram is available on skewed column(s) and
  • The application queries sparsely occurring values most of the time.

Summary:

  • In case of skewed data distribution,
    • If a histogram is not available on an indexed column , an unselective index may not be used at all
    • If a histogram is available on an indexed column, an unselective index may be
      • Used for sparsely occurring values
      • Ignored for frequently occurring values
  • It might be beneficial to create index even on column(s) having few values if
  • The data distribution in the column(s) is skewed, and
  • A histogram is available on the skewed column(s). and
  • The application queries sparsely occurring values most of the time.

 

Tags: Oracle

Anju Garg

Written by Anju Garg

Anju Garg is an Oracle Ace with over 14 years of experience in IT Industry in various roles. Since 2010, she has been involved in teaching and has trained more than a hundred DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc.

She is a regular speaker at Sangam and OTNYathra. She writes articles about Oracle and is one of the reviewers of the following book published by Pearson Oracle Problem-Solving and Troubleshooting Handbook

She is certified for :

  • Oracle 9i Database Administration OCP
  • Oracle 11g Database Administration OCP
  • Oracle 11g Performance Tuning OCE
  • Oracle 11g R2 RAC OCE
  • Oracle 11g SQL Tuning OCE
  • Oracle 12c Database Administration OCP
  • Oracle Real Application Clusters 12c Certified Implementation Specialist

She is passionate about learning and has keen interest in RAC and Performance Tuning. She shares her knowledge via her technical blog at http://oracleinaction.com/