Unselective indexes and skewed data distribution

The selectivity of an index is defined as the fraction of rows in a table having the same value for the indexed key. A highly selective index has few rows for each index entry and an unselective index has many rows for each index entry.

The main source of benefit achieved by using an index is due to reduced number of I/Os required to satisfy a query. Selectivity of an index is one of the factors which determine the reduction in I/O that can be achieved by using it and, hence, its effectiveness in optimizing performance. The higher the selectivity, the fewer rows returned by the index scan and the faster the query engine can reduce the size of the result set. Use of unselective indexes can cause a large number of index and data blocks to occupy the buffer cache resulting in various wait events such as db file sequential read, buffer busy wait, Cache Buffer Chain latch, Latch: Cache Buffer LRU Chain, etc. Optimizer may prefer to perform a Full Table Scan rather than use a highly unselective index; in which case, we will have to bear the overhead of slower DML operations and maintenance of an unselective index even though it may not be utilized at all. On the other hand, if the index is selective, optimizer prefers to access the table via index. Hence, it is desirable to have indexes with fairly high selectivity to make them more useful.

In this article, I will explore the scenario in which there are few values in the indexed column but data distribution in the column is skewed; i.e., although on the whole the index seems to be unselective, it may still be selective with respect to sparsely occurring values.

I will demonstrate that

In the case of skewed data distribution,
- If a histogram is not available on the indexed column, an unselective index may not be used at all
- If a histogram is available on the indexed column, an unselective index may be
  - Used for sparsely occurring values
  - Ignored for frequently occurring values
It might be beneficial to create an index even on column(s) having few values if
- The data distribution in the column(s) is skewed, and
- A histogram is available on skewed column(s), and
- The application queries sparsely occurring values most of the time.

Demonstration

Create hr.sales table which is a copy of sh.sales having data ordered by CHANNEL_ID, CUST_ID. Note that column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

SYS>drop table hr.sales purge;

create table hr.sales as select * from sh.sales

order by channel_id, cust_id;

SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

count(distinct(cust_id))/count(*)cust_id_selectivity,

count(distinct channel_id)distinct_channel_id,

count(distinct(channel_id))/count(*)channel_id_selectivity

from hr.sales;

NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

———- —————- ——————- ——————- ———————-

918843 7059 .007682488 4 4.3533E-06

Let us check how data is distributed across four values in the CHANNEL_ID column.

SYS>select distinct channel_id , count(*)

from hr.sales

group by channel_id order by channel_id;

CHANNEL_ID COUNT(*)

———- ———-

2 258025

3 540328

4 118416

9 2074

It can be seen that the data in CHANNEL_ID column is skewed and there are few rows for CHANNEL_ID = 9 as compared to the other values; i.e., if we create an index on CHANNEL_ID it will be selective with respect to certain values, whereas it is unselective with respect to others.

Gather optimizer statistics for the hr.sales table. Note that I have not gathered a histogram for column CHANNEL_ID.

SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For all columns size 1', Cascade => true, no_invalidate => false);

Let us Create an index on column CHANNEL_ID, which is highly unselective since there are only four distinct values in the column

SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

Now, if we search for a CHANNEL_ID = 3, it can be seen that the optimizer assumes data to be uniformly distributed and estimates 25% of the rows (918K/4 = 229K) to be returned and hence prefers to perform a full table scan rather than use the unselective index. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

SYS>select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 3;

Execution Plan

———————————————————-

Plan hash value: 781590677

—————————————————————————

—————————————————————————

| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |

|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |

—————————————————————————

Predicate Information (identified by operation id):

—————————————————

1 – filter("S"."CHANNEL_ID"=3)

If we search for CHANNEL_ID = 9, having few rows, and examine the execution plan and statistics, it can be seen that the optimizer again prefers to perform a full table scan even though it results in a large number of physical and logical I/Os.

SYS> set autotrace traceonly

select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 9;

set autot off

2074 rows selected.

Execution Plan

———————————————————-

Plan hash value: 781590677

—————————————————————————

—————————————————————————

| 0 | SELECT STATEMENT | | 229K| 2691K| 1231 (1)| 00:00:15 |

|* 1 | TABLE ACCESS FULL| SALES | 229K| 2691K| 1231 (1)| 00:00:15 |

—————————————————————————

Predicate Information (identified by operation id):

—————————————————

1 – filter("S"."CHANNEL_ID"=9)

Statistics

———————————————————-

0 recursive calls

0 db block gets

4574 consistent gets

4431 physical reads

0 redo size

38498 bytes sent via SQL*Net to client

1937 bytes received via SQL*Net from client

140 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

2074 rows processed

If we force the use of index by means of hint, observe that the consistent gets have dropped considerably from 4574 to 296 and physical reads have been eliminated. It means that although it would be beneficial to use an index, the optimizer still thinks full table search is better since it does not know about the skew in the data (We did not create a histogram on CHANNEL_ID while collecting statistics for the hr.sales table).

SYS> set autotrace traceonly

select/*+ index (s) */ s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 9;

set autot off

Execution Plan

———————————————————-

Plan hash value: 3317832025

———————————————————————————————-

———————————————————————————————-

| 0 | SELECT STATEMENT | | 229K| 2691K| 1562 (1)| 00:00:19 |

| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 229K| 2691K| 1562 (1)| 00:00:19 |

|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 229K| | 453 (1)| 00:00:06 |

———————————————————————————————-

Predicate Information (identified by operation id):

—————————————————

2 – access("S"."CHANNEL_ID"=9)

Statistics

———————————————————-

0 recursive calls

0 db block gets

296 consistent gets

0 physical reads

0 redo size

38498 bytes sent via SQL*Net to client

1937 bytes received via SQL*Net from client

140 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

2074 rows processed

Let us create a histogram on CHANNEL_ID to inform the optimizer about the skew.

SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columnsCHANNEL_ID size 4, cust_id size auto’,Cascade => true, no_invalidate => false);

SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM

FROM dba_TAB_COL_STATISTICS

WHERE owner = 'HR' and TABLE_NAME='SALES';

TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM

————— —————————— ———— —————

SALES PROD_ID 72 NONE

SALES CUST_ID 7059 NONE

SALES TIME_ID 1460 NONE

SALES CHANNEL_ID 4 FREQUENCY

SALES PROMO_ID 4 NONE

SALES QUANTITY_SOLD 1 NONE

SALES AMOUNT_SOLD 3586 NONE

7 rows selected.

Now that a histogram on CHANNEL_ID is in place, the optimizer is aware of the skew; i.e., it knows that there are too many rows for channel_id = 3 and relatively few rows for CHANNEL_ID = 9.

Now, if we search for CHANNEL_ID = 3, the optimizer almost correctly estimates 542K rows to be returned and appropriately chooses to perform a full table scan.

SYS> set autotrace traceonly

select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 3;

set autot off

540328 rows selected.

Execution Plan

———————————————————-

Plan hash value: 781590677

—————————————————————————

—————————————————————————

| 0 | SELECT STATEMENT | | 542K| 6356K| 1231 (1)| 00:00:15 |

|* 1 | TABLE ACCESS FULL| SALES | 542K| 6356K| 1231 (1)| 00:00:15 |

—————————————————————————

Predicate Information (identified by operation id):

—————————————————

1 – filter("S"."CHANNEL_ID"=3)

Statistics

———————————————————-

0 recursive calls

0 db block gets

40277 consistent gets

4431 physical reads

0 redo size

8897553 bytes sent via SQL*Net to client

396650 bytes received via SQL*Net from client

36023 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

540328 rows processed

On searching for CHANNEL_ID = 9, the optimizer estimates 1437 rows to be returned, which is quite close to the actual number of 2074, and obviously much better than its earlier estimate of 229k. As a result, it correctly chooses to perform an index scan with only 296 consistent gets.

SYS> set autotrace traceonly

select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 9;

set autot off

2074 rows selected.

Execution Plan

———————————————————-

Plan hash value: 3317832025

———————————————————————————————-

———————————————————————————————-

| 0 | SELECT STATEMENT | | 1437 | 17244 | 12 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1437 | 17244 | 12 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 1437 | | 5 (0)| 00:00:01 |

———————————————————————————————-

Predicate Information (identified by operation id):

—————————————————

2 – access("S"."CHANNEL_ID"=9)

Statistics

———————————————————-

0 recursive calls

0 db block gets

296 consistent gets

0 physical reads

0 redo size

38498 bytes sent via SQL*Net to client

1937 bytes received via SQL*Net from client

140 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

2074 rows processed

2074 rows selected.

Thus, if data distribution is skewed in a column, a histogram should be created on the column to make the optimizer aware of the skew. With a histogram in place, the optimizer may prefer to use even an unselective index for sparsely occurring values, while it may ignore the index for values which occur frequently. This in turn means that what matters is the selectivity of the index with respect to the values being queried most. It might be beneficial to create even an unselective index on column(s) having few values if

The data distribution in the column(s) is skewed and
A histogram is available on skewed column(s) and
The application queries sparsely occurring values most of the time.

Summary:

In case of skewed data distribution,
- If a histogram is not available on an indexed column , an unselective index may not be used at all
- If a histogram is available on an indexed column, an unselective index may be
  - Used for sparsely occurring values
  - Ignored for frequently occurring values
It might be beneficial to create index even on column(s) having few values if
The data distribution in the column(s) is skewed, and
A histogram is available on the skewed column(s). and
The application queries sparsely occurring values most of the time.

Demonstration

Summary:

About the Author

Anju Garg

About the Author

Start the discussion at forums.toadworld.com