DOAG Conference 2017

Monday, January 02, 2017

Removing Unnecessary Indexes: 1. The Need for Extended Statistics

This is the first post in a series about unnecessary indexes and some of the challenges that they present 

I have always noted that it is remarkably easy to add indexes to systems, especially OLTP systems. However, removing them is generally more difficult.
The reward for removing an unused, or at least an unnecessary, index is that you no longer need to maintain it during DML operations, thus reducing I/O and not loading index blocks into the buffer cache. However, the risk is that performance will degrade somewhere, because an execution plan changes.
I will start by looking at a class of indexes that are easily identified as candidates for removal. If the columns in a non-unique index are the same as the leading columns in another index then it is reasonable to remove it because any query that uses it could also use the other index. We cannot drop unique indexes because they are need to enforce a unique or primary key constraint.
However, there is a catch. A multi-column index might be used by the optimizer to provide the number of distinct keys without being referenced in the execution plan. Here is a simple example.
CREATE TABLE t 
(a NUMBER
,b NUMBER
,c NUMBER
,d NUMBER
,CONSTRAINT t_pk PRIMARY KEY (a)
);

CREATE INDEX t_ab  ON t (a,b);
CREATE INDEX t_bc  ON t (b,c);
CREATE INDEX t_bcd ON t (b,c,d);
  • T_BC is a subset of T_BCD, so it is redundant and could be dropped.
  • (correction 7.1.2017) The primary key index T_PK is a subset of T_AB.  It is also redundant because the primary key constraint, on A alone, could be altered to use index T_AB.
ALTER TABLE t MODIFY PRIMARY KEY USING INDEX t_ab;
However, I am going to leave the default primary key index in place for the rest of this demonstration.  I will now populate the table, and gather statistics but not histograms.
INSERT /*+APPEND*/ INTO t
WITH x AS (
SELECT  rownum-1 n FROM DUAL connect by level <= 1E5)
SELECT  n
, MOD(n,100)
, ROUND(MOD(n,100),-1)
, dbms_random.value(1,100)
FROM   x;
EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1');
The table has 100,000 rows, column B has 100 distinct, and column C has 11 distinct values. But there is a close correlation between them. The value of C is the value of B rounded off to the nearest 10.
column column_name format a20
SELECT column_name, num_distinct, num_buckets, num_nulls, histogram 
FROM user_tab_col_statistics
WHERE table_name = 'T'
ORDER BY 1
/

COLUMN_NAME         NUM_DISTINCT NUM_BUCKETS  NUM_NULLS HISTOGRAM                                            
-------------------- ------------ ----------- ---------- ---------------
A                         100000           1          0 NONE
B                            100           1          0 NONE
C                             11           1          0 NONE
D                         100000           1          0 NONE
We can also see that index T_BC has 100 distinct keys.
SELECT index_name, uniqueness, visibility, distinct_keys, clustering_factor
FROM user_indexes WHERE table_name = 'T'
/

INDEX_NAME UNIQUENES VISIBILIT DISTINCT_KEYS CLUSTERING_FACTOR
---------- --------- --------- ------------- -----------------
T_PK       UNIQUE    VISIBLE          100000               525
T_AB       NONUNIQUE VISIBLE          100000               525
T_BC       NONUNIQUE VISIBLE             100             52500
T_BCD      NONUNIQUE VISIBLE          100000             99894
Let's look at the execution plans for some simple queries: This first query uses index T_BC because. It doesn't even need to look up the table because column A is not null.
SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40          
    
Plan hash value: 2931606879
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |       |     3 (100)|          |      1 |00:00:00.01 |       5 |
|   1 |  SORT AGGREGATE   |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |       5 |
|*  2 |   INDEX RANGE SCAN| T_BC |      1 |   1000 |  6000 |     3   (0)| 00:00:01 |   1000 |00:00:00.01 |       5 |
--------------------------------------------------------------------------------------------------------------------
This query references column D, which is nullable, so it uses index T_BCD.
SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40

Plan hash value: 2099617883
--------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation         | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT  |       |      1 |        |       |     8 (100)|          |      1 |00:00:00.01 |       8 | 
|   1 |  SORT AGGREGATE   |       |      1 |      1 |    28 |            |          |      1 |00:00:00.01 |       8 | 
|*  2 |   INDEX RANGE SCAN| T_BCD |      1 |   1000 | 28000 |     8   (0)| 00:00:01 |   1000 |00:00:00.01 |       8 |
---------------------------------------------------------------------------------------------------------------------
Note that in both cases the optimizer estimated it would receive 1000 rows, and it did.

Invisible Indexes 

Rather than drop index T_BC, I am going to make it invisible, and repeat the test. I would always do this in a real production situation before dropping the index later. If removing the index does have undesirable consequences it can instantly be made visible again, whereas rebuilding it may take time and cause contention on a large and active table.
ALTER INDEX t_bc INVISIBLE;
The execution plan for the first query has changed to use T_BCD (which is perfectly reasonable), but the estimated number of rows is now just 91, not the 1000 that were actually found. Although it didn't make a difference to this execution plan, it is this misestimate which could cause plan regressions in more complex cases.
SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40          

Plan hash value: 2099617883                                   
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |      1 |        |       |     3 (100)|          |      1 |00:00:00.01 |       8 |
|   1 |  SORT AGGREGATE   |       |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |       8 |
|*  2 |   INDEX RANGE SCAN| T_BCD |      1 |     91 |   546 |     3   (0)| 00:00:01 |   1000 |00:00:00.01 |       8 |
---------------------------------------------------------------------------------------------------------------------
The other query produces the same plan, with the same misestimate of the number of rows.
SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40          
    
Plan hash value: 2099617883                                   
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |      1 |        |       |     3 (100)|          |      1 |00:00:00.02 |       8 |
|   1 |  SORT AGGREGATE   |       |      1 |      1 |    28 |            |          |      1 |00:00:00.02 |       8 |
|*  2 |   INDEX RANGE SCAN| T_BCD |      1 |     91 |  2548 |     3   (0)| 00:00:01 |   1000 |00:00:00.01 |       8 |
---------------------------------------------------------------------------------------------------------------------
The optimizer misestimates of the number of rows because it can no longer use index T_BC to tell it that there are only 100 distinct values on the table for the combination of columns B and C.

Extended Statistics

Now I will create extended statistics on the table
select dbms_stats.create_extended_stats(null,'T','(B,C)') from dual;
EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1');
Or I could just have done this. It comes to the same.
EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 1 (B,C)');
The extended statistics on the combination of columns B and C tells the optimizer that there are 100 distinct values for these columns. Note that the extended statistics only have a single bucket.
COLUMN_NAME                    NUM_DISTINCT NUM_BUCKETS  NUM_NULLS HISTOGRAM
------------------------------ ------------ ----------- ---------- -----------
A                                    100000           1          0 NONE
B                                       100           1          0 NONE
C                                        11           1          0 NONE
D                                    100000           1          0 NONE
SYS_STU3$58HONF9VK$$69P2OW4P4X          100           1          0 NONE

SELECT * FROM user_stat_extensions WHERE table_name = 'T'

TABLE_NAME EXTENSION_NAME                 EXTENSION            CREATO DRO
---------- ------------------------------ -------------------- ------ ---
T          SYS_STU3$58HONF9VK$$69P2OW4P4X ("B","C")            USER   YES
Now the optimizer reverts to correctly estimating the number of rows as 1000.
SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40                                                                    
                                                                                                                        
Plan hash value: 2099617883                                                                                             
---------------------------------------------------------------------------------------------------------------------   
| Id  | Operation         | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |   
---------------------------------------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |       |      1 |        |       |     8 (100)|          |      1 |00:00:00.01 |       8 |   
|   1 |  SORT AGGREGATE   |       |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |       8 |   
|*  2 |   INDEX RANGE SCAN| T_BCD |      1 |   1000 |  6000 |     8   (0)| 00:00:01 |   1000 |00:00:00.01 |       8 |   
---------------------------------------------------------------------------------------------------------------------   

SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40                                                                    
                                                                                                                        
Plan hash value: 2099617883 
---------------------------------------------------------------------------------------------------------------------   
| Id  | Operation         | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |   
---------------------------------------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |       |      1 |        |       |     8 (100)|          |      1 |00:00:00.02 |       8 |   
|   1 |  SORT AGGREGATE   |       |      1 |      1 |    28 |            |          |      1 |00:00:00.02 |       8 |   
|*  2 |   INDEX RANGE SCAN| T_BCD |      1 |   1000 | 28000 |     8   (0)| 00:00:01 |   1000 |00:00:00.01 |       8 |   
---------------------------------------------------------------------------------------------------------------------   

Conclusion

  • If you are going to drop a redundant multi-column index, at least replace it with extended statistics on that combination of columns.
  • If you are going to add columns to an existing index rather than create another, perhaps to satisfy a query without the need to visit the table, then again consider creating extended statistics on the original list of columns.
Acknowledgement: my thanks to Jonathan Lewis for his blog: Index Usage, and for prompting the correction.

No comments :