Index selectivity & queries with multiple equality conditions

Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column to a selective one.

It is very common to issue queries with multiple equality predicates, such as

Where col1 = <> and col2 = <> and …

It is highly likely that the columns being referenced in the predicates have different selectivities; i.e, whereas some (highly unselective) columns have very few distinct values, others (more selective) have a large number of distinct values. Some columns with the highest selectivity may have a unique constraint defined so that each row has a distinct value.

In such a scenario, if we have the option to create single column B-tree index, which index will perform better: the index on a column with more unique data (more selective) or more duplicate data (less selective)?

In this article I will explore above scenario by means of a simple query having two predicates referencing columns that have contrasting selectivities.

Overview

I will create a table hr.sales which is copy of sh.sales having data ordered by columns channel_id,cust_id. The column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.

We will issue a query having two equality predicates; e.g.,

Where CHANNEL_ID = < > And CUST_ID = < >

Subsequently we will examine performance of the query if we have

An (unselective) index on CHANNEL_ID only
An (selective) index on CUST_ID only

Demonstration

Create a table sales table which is a copy of sh.sales having data ordered by channel_id, cust_id. Verify that the column channel_id has four distinct values, whereas the column cust_id has 7059 distinct values. This means that an index on channel_id will be less selective than that on cust_id.

SYS>drop table hr.sales purge;

create table hr.sales as select * from sh.sales

order by channel_id, cust_id;

SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,

count(distinct(cust_id))/count(*)cust_id_selectivity,

count(distinct channel_id)distinct_channel_id,

count(distinct(channel_id))/count(*)channel_id_selectivity

from hr.sales;

NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY

———- —————- ——————- ——————- ———————-

918843 7059 .007682488 4 4.3533E-06

The data in the column channel_id is skewed and there are few rows for channel_id = 9 as compared to the other values; i.e., the index on channel_id is selective with respect to certain values, whereas it is unselective with respect to others.

SYS>select distinct channel_id , count(*)

from hr.sales

group by channel_id order by channel_id;

CHANNEL_ID COUNT(*)

———- ———-

2 258025

3 540328

4 118416

9 2074

Let us create a histogram on channel_id to inform the optimizer about the skew.

SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto’,Cascade => true, no_invalidate => false);

SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM

FROM dba_TAB_COL_STATISTICS

WHERE owner = 'HR' and TABLE_NAME='SALES';

TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM

————— —————————— ———— —————

SALES PROD_ID 72 NONE

SALES CUST_ID 7059 NONE

SALES TIME_ID 1460 NONE

SALES CHANNEL_ID 4 FREQUENCY

SALES PROMO_ID 4 NONE

SALES QUANTITY_SOLD 1 NONE

SALES AMOUNT_SOLD 3586 NONE

7 rows selected.

Let us create an index on column channel_id, which is quite unselective, as there are only four distinct values in the column.

SYS>Create index hr.channel_id_idx on hr.sales(channel_id);

Let us find out the number of rows for
four values of channel_id
cust_id = 100750

SYS>select distinct channel_id , count(*)

from hr.sales

group by channel_id order by channel_id;

CHANNEL_ID COUNT(*)

———- ———-

2 258025

3 540328

4 118416

9 2074

SYS>select count(*) from hr.sales where cust_id = 100750;

COUNT(*)

———-

Now, I will perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 to check the index usage and the cost of execution. Note that the statistics shown are from the second execution to eliminate parsing and other costs.

SYS> set autotrace traceonly

select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 9 and s.cust_id = 100750;

set autot off

Execution Plan

———————————————————-

Plan hash value: 3317832025

———————————————————————————————-

———————————————————————————————-

| 0 | SELECT STATEMENT | | 1 | 21 | 18 (0)| 00:00:01 |

|* 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1 | 21 | 18 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 2248 | | 7 (0)| 00:00:01 |

———————————————————————————————-

Predicate Information (identified by operation id):

—————————————————

1 – filter("S"."CUST_ID"=100750)

2 – access("S"."CHANNEL_ID"=9)

Statistics

———————————————————-

0 recursive calls

0 db block gets

20 consistent gets

0 physical reads

0 redo size

600 bytes sent via SQL*Net to client

419 bytes received via SQL*Net from client

2 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

2 rows processed

Observe that

The range scan of the index on CHANNEL_ID returns ROWIDs of the rows having channel_id = 9 (Line 2)
The rows in the SALES table corresponding to returned ROWIDs are searched for rows with CUST_ID = 100750 (Line 1)
The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
As CHANNEL_ID_IDX is less selective, its scan (Line 2) returns a large number of rows (2074), requiring a large number of index blocks to be visited. The subsequent filter operation (Line 1) performed on 2074 rows requires visiting many table blocks. The total cost of the query is 20 consistent gets.

It is worth mentioning here that the rows and cost shown in the execution plan are optimizer estimates. These might vary from the actual figures. For example, there are 2074 rows having channel_id = 9 but the optimizer estimates 2248 rows to be returned.

Let us now dropthe index on CHANNEL_ID and create an index on CUST_ID, which has higher selectivity

SYS> Drop index hr.channel_id_idx;

Create index hr.cust_id_idx on hr.sales(cust_id);

Now, again perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 and examine the execution plan and statistics

SYS>set autot trace

select s.channel_id, s.cust_id, s.prod_id

from hr.sales s

where s.channel_id = 9 and s.cust_id = 100750;

set autot off

Execution Plan

———————————————————-

Plan hash value: 2826200558

——————————————————————————————-

——————————————————————————————-

| 0 | SELECT STATEMENT | | 1 | 21 | 6 (0)| 00:00:01 |

|* 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1 | 21 | 6 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | CUST_ID_IDX | 130 | | 3 (0)| 00:00:01 |

——————————————————————————————-

Predicate Information (identified by operation id):

—————————————————

1 – filter("S"."CHANNEL_ID"=9)

2 – access("S"."CUST_ID"=100750)

Statistics

———————————————————-

0 recursive calls

0 db block gets

6 consistent gets

0 physical reads

0 redo size

600 bytes sent via SQL*Net to client

419 bytes received via SQL*Net from client

2 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

2 rows processed

Observe that

The range scan of the index on CUST_ID returns ROWIDs of rows having CUST_ID =100750 (line 2)
The rows in the SALES table corresponding to returned ROWIDs are searched for rows with channel_id = 9 (line 1)
The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
As CUST_ID_IDX is more selective, its scan (Line 2) returns a small number of rows (2) requiring a small number of index blocks to be visited. Subsequently, during the filter operation (Line 1) performed on the two rows, a maximum of two table blocks needs to be visited. This results in a much lower cost of only six consistent gets.

Again note that the rows and cost shown in the execution plan, which are optimizer estimates, differ from the actual figures. There are 2 rows having CUST_ID =100750 but optimizer estimates 130 rows to be returned.

Hence, we saw that in the case of a query having multiple equality predicates, using the index on the column with the higher selectivity (CUST_ID) causes

Fewer index blocks to be visited
Fewer rows to be returned as a result of index scan
Fewer table blocks to be visited
A filter operation to be performed on fewer rows

Thus, use of a selective index reduces I/O and hence cost of the query. Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column (CHANNEL_ID) to a selective one (CUST_ID).

Hence, in the case of a query having multiple equality predicates, it is advisable to create an index on the more selective column referenced in the predicate.

Summary

For a query having multiple equality type predicates
- The performance improves as it switches from using the index on an unselective column to a selective one.
- It is advisable to create an index on the more selective column referenced in the predicate.

In the next article, I will explore the use of concatenated indexes to improve the performance of queries having multiple equality type predicates.

Overview

Demonstration

Summary

About the Author

Anju Garg

About the Author

Start the discussion at forums.toadworld.com