Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column to a selective one.
It is very common to issue queries with multiple equality predicates, such as
Where col1 = <> and col2 = <> and …
It is highly likely that the columns being referenced in the predicates have different selectivities; i.e, whereas some (highly unselective) columns have very few distinct values, others (more selective) have a large number of distinct values. Some columns with the highest selectivity may have a unique constraint defined so that each row has a distinct value.
In such a scenario, if we have the option to create single column B-tree index, which index will perform better: the index on a column with more unique data (more selective) or more duplicate data (less selective)?
In this article I will explore above scenario by means of a simple query having two predicates referencing columns that have contrasting selectivities.
Overview
I will create a table hr.sales which is copy of sh.sales having data ordered by columns channel_id,cust_id. The column CHANNEL_ID has four distinct values, whereas column CUST_ID has 7059 distinct values. This means that an index on CHANNEL_ID will be less selective than that on CUST_ID.
We will issue a query having two equality predicates; e.g.,
Where CHANNEL_ID = < > And CUST_ID = < >
Subsequently we will examine performance of the query if we have
- An (unselective) index on CHANNEL_ID only
- An (selective) index on CUST_ID only
Demonstration
- Create a table sales table which is a copy of sh.sales having data ordered by channel_id, cust_id. Verify that the column channel_id has four distinct values, whereas the column cust_id has 7059 distinct values. This means that an index on channel_id will be less selective than that on cust_id.
SYS>drop table hr.sales purge;
create table hr.sales as select * from sh.sales
order by channel_id, cust_id;
SYS>select count(*) num_rows, count(distinct cust_id) distinct_cust_id,
count(distinct(cust_id))/count(*)cust_id_selectivity,
count(distinct channel_id)distinct_channel_id,
count(distinct(channel_id))/count(*)channel_id_selectivity
from hr.sales;
NUM_ROWS DISTINCT_CUST_ID CUST_ID_SELECTIVITY DISTINCT_CHANNEL_ID CHANNEL_ID_SELECTIVITY
———- —————- ——————- ——————- ———————-
918843 7059 .007682488 4 4.3533E-06
The data in the column channel_id is skewed and there are few rows for channel_id = 9 as compared to the other values; i.e., the index on channel_id is selective with respect to certain values, whereas it is unselective with respect to others.
SYS>select distinct channel_id , count(*)
from hr.sales
group by channel_id order by channel_id;
CHANNEL_ID COUNT(*)
———- ———-
2 258025
3 540328
4 118416
9 2074
- Let us create a histogram on channel_id to inform the optimizer about the skew.
SYS>exec dbms_stats.gather_table_stats('HR','SALES', method_opt => 'For columns CHANNEL_ID size 4, cust_id size auto’,Cascade => true, no_invalidate => false);
SYS>SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM dba_TAB_COL_STATISTICS
WHERE owner = 'HR' and TABLE_NAME='SALES';
TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM
————— —————————— ———— —————
SALES PROD_ID 72 NONE
SALES CUST_ID 7059 NONE
SALES TIME_ID 1460 NONE
SALES CHANNEL_ID 4 FREQUENCY
SALES PROMO_ID 4 NONE
SALES QUANTITY_SOLD 1 NONE
SALES AMOUNT_SOLD 3586 NONE
7 rows selected.
- Let us create an index on column channel_id, which is quite unselective, as there are only four distinct values in the column.
SYS>Create index hr.channel_id_idx on hr.sales(channel_id);
- Let us find out the number of rows for
- four values of channel_id
- cust_id = 100750
SYS>select distinct channel_id , count(*)
from hr.sales
group by channel_id order by channel_id;
CHANNEL_ID COUNT(*)
———- ———-
2 258025
3 540328
4 118416
9 2074
SYS>select count(*) from hr.sales where cust_id = 100750;
COUNT(*)
———-
2
Now, I will perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 to check the index usage and the cost of execution. Note that the statistics shown are from the second execution to eliminate parsing and other costs.
SYS> set autotrace traceonly
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9 and s.cust_id = 100750;
set autot off
Execution Plan
———————————————————-
Plan hash value: 3317832025
———————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
———————————————————————————————-
| 0 | SELECT STATEMENT | | 1 | 21 | 18 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1 | 21 | 18 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CHANNEL_ID_IDX | 2248 | | 7 (0)| 00:00:01 |
———————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CUST_ID"=100750)
2 – access("S"."CHANNEL_ID"=9)
Statistics
———————————————————-
0 recursive calls
0 db block gets
20 consistent gets
0 physical reads
0 redo size
600 bytes sent via SQL*Net to client
419 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2 rows processed
Observe that
- The range scan of the index on CHANNEL_ID returns ROWIDs of the rows having channel_id = 9 (Line 2)
- The rows in the SALES table corresponding to returned ROWIDs are searched for rows with CUST_ID = 100750 (Line 1)
- The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
- As CHANNEL_ID_IDX is less selective, its scan (Line 2) returns a large number of rows (2074), requiring a large number of index blocks to be visited. The subsequent filter operation (Line 1) performed on 2074 rows requires visiting many table blocks. The total cost of the query is 20 consistent gets.
It is worth mentioning here that the rows and cost shown in the execution plan are optimizer estimates. These might vary from the actual figures. For example, there are 2074 rows having channel_id = 9 but the optimizer estimates 2248 rows to be returned.
- Let us now dropthe index on CHANNEL_ID and create an index on CUST_ID, which has higher selectivity
SYS> Drop index hr.channel_id_idx;
Create index hr.cust_id_idx on hr.sales(cust_id);
- Now, again perform a lookup for CHANNEL_ID = 9 and CUST_ID = 100750 and examine the execution plan and statistics
SYS>set autot trace
select s.channel_id, s.cust_id, s.prod_id
from hr.sales s
where s.channel_id = 9 and s.cust_id = 100750;
set autot off
Execution Plan
———————————————————-
Plan hash value: 2826200558
——————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
——————————————————————————————-
| 0 | SELECT STATEMENT | | 1 | 21 | 6 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES | 1 | 21 | 6 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | CUST_ID_IDX | 130 | | 3 (0)| 00:00:01 |
——————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
1 – filter("S"."CHANNEL_ID"=9)
2 – access("S"."CUST_ID"=100750)
Statistics
———————————————————-
0 recursive calls
0 db block gets
6 consistent gets
0 physical reads
0 redo size
600 bytes sent via SQL*Net to client
419 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2 rows processed
Observe that
- The range scan of the index on CUST_ID returns ROWIDs of rows having CUST_ID =100750 (line 2)
- The rows in the SALES table corresponding to returned ROWIDs are searched for rows with channel_id = 9 (line 1)
- The row(s) having channel_id = 9 and CUST_ID = 100750 are finally returned (Line 0)
- As CUST_ID_IDX is more selective, its scan (Line 2) returns a small number of rows (2) requiring a small number of index blocks to be visited. Subsequently, during the filter operation (Line 1) performed on the two rows, a maximum of two table blocks needs to be visited. This results in a much lower cost of only six consistent gets.
Again note that the rows and cost shown in the execution plan, which are optimizer estimates, differ from the actual figures. There are 2 rows having CUST_ID =100750 but optimizer estimates 130 rows to be returned.
Hence, we saw that in the case of a query having multiple equality predicates, using the index on the column with the higher selectivity (CUST_ID) causes
- Fewer index blocks to be visited
- Fewer rows to be returned as a result of index scan
- Fewer table blocks to be visited
- A filter operation to be performed on fewer rows
Thus, use of a selective index reduces I/O and hence cost of the query. Performance of a query having multiple equality type predicates improves as it switches from using the index on an unselective column (CHANNEL_ID) to a selective one (CUST_ID).
Hence, in the case of a query having multiple equality predicates, it is advisable to create an index on the more selective column referenced in the predicate.
Summary
- For a query having multiple equality type predicates
- The performance improves as it switches from using the index on an unselective column to a selective one.
- It is advisable to create an index on the more selective column referenced in the predicate.
In the next article, I will explore the use of concatenated indexes to improve the performance of queries having multiple equality type predicates.
Start the discussion at forums.toadworld.com