Monday, January 06, 2020

Sparse Indexing

This is the first of two blog posts that discuss sparse and partial indexing.

Problem Statement

It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values.  So, the distribution of values is skewed.  A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values.  A histogram is would almost certainly be required on such a column.  However, if you build an index on the column you have to index all the rows.  The index is, therefore, larger and requires more maintenance.  Could we not just index the rare values for which we want to use the index to find?
  • Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values.  This is sometimes called sparse indexing and is discussed in this blog.
  • Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values.  This is called partial indexing and is discussed in the next blog.
As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links.  However, I want to draw some of the issues together.

Sparse Indexing

The ideas discussed in this section are based on the principle that Oracle indexes do not include rows where the key values are null.  

Store Null Values in the Database?

One option is to engineer the application to use null as the common status value.  However, this means that the column in question has to be nullable, and you may require different logic because the comparison to null is always false.
CREATE TABLE t 
(key    NUMBER NOT NULL
,status VARCHAR2(1)
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
);

INSERT /*+APPEND*/ INTO t 
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN NULL /*common status*/
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
FROM dual
CONNECT BY level <= 1e6;

CREATE INDEX t_status ON t (status);
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
I have created a test table with 1000000 rows. 10 rows have status R, and 32 rows have status A. The rest have status NULL. I have indexed the status column, and also created a histogram on it when I collected statistics.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
      999958
R         10
A         32
I can see from the statistics that I have 1000000 rows in the primary key index, but only 42 rows in the status index because it only contains the not null values. Therefore, it is much smaller, having only a single leaf block, whereas the primary key index has 1875 leaf blocks.
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T';

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
There are some problems with this approach.

Not All Index Columns are Null 
If any of the index columns are not null, then there is an entry in the index for the row, and there is no saving of space. It is not uncommon to add additional columns to such an index, either for additional filtering, or to avoid accessing the table by satisfying the query from the index.
CREATE INDEX t_status2 ON t (status,other);
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
T_STATUS2     1000000        9081

Null Logic
If, for example, I want to find the rows that do not have status A, then a simple inequality does not find the null statuses because comparison to null is always false.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A'
GROUP BY status
/

S   COUNT(*)
- ----------
R         10
Instead, I would have to explicitly code for the null values.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A' OR status IS NULL
GROUP BY status
/

S   COUNT(*)
- ----------
      999958
R         10
This additional complexity is certainly one reason why developers shy away from this approach in custom applications. It is almost impossible to retrofit it into an existing or packaged application. 

Function-Based Indexes 

It is possible to build an index on a function, such that that function to evaluates to null for the common values. This time my test table still has 1,000,000 rows. The status column is now not nullable.
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
)
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
10 rows have status A, and 32 rows have status O. The rest have status C.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
C     999958
A         32
I will build a simple index on status, and a second index on a function of status that decodes the common status C back to NULL;
CREATE INDEX t_status ON t (status);
CREATE INDEX t_status_fn ON t (DECODE(status,'C',NULL,status));
As before, with the null column, the function-based index has only a single leaf block, the other indexes are much larger because they contain all 1 million rows.
SELECT index_name, index_type, num_rows, leaf_blocks 
from user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
------------ --------------------------- ---------- -----------
T_PK         NORMAL                         1000000        1875
T_STATUS     NORMAL                         1000000        1812
T_STATUS_FN  FUNCTION-BASED NORMAL               42           1
If I query the table for the common status, Oracle quite reasonably full scans the table.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999958

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2446   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2446   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')
If I query for the rare status value, it will use the normal index to look that up.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 1997248105
-------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |          |     1 |    55 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |          |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS |    10 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("STATUS"='R')
Now I will make that index invisible, and the optimizer can only choose to full scan the table. It cannot use the function-based index because the query does not match the function.
ALTER INDEX t_status INVISIBLE;
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |    10 |   550 |  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='R')
Instead, I must change the query to reference the function in the function-based index, and then the optimizer chooses the function-based index, even if I make the normal index visible again. Note that the function is shown in the access operation in the predicate section.
ALTER INDEX t_status VISIBLE;
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2511618215
----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |             |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |             |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T           |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_FN |    21 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(DECODE("STATUS",'C',NULL,"STATUS")='R')

Invisible Virtual Columns 

Function-based indexes are implemented using an invisible virtual column. You can even reference that virtual column in a query. However, the name of the column is system generated, so you may not want to include it in your application.
SELECT * FROM user_stat_extensions WHERE table_name = 'T';

TABLE_NAME EXTENSION_NAME  EXTENSION                                CREATO DRO
---------- --------------- ---------------------------------------- ------ ---
T          SYS_NC00004$    (DECODE("STATUS",'C',NULL,"STATUS"))     SYSTEM NO

SELECT SYS_NC00004$, COUNT(*) FROM t group by SYS_NC00004$;

S   COUNT(*)
- ----------
      999958
R         10
A         32
Instead, you could create a virtual column and then index it. The resulting index is still function-based because it references the function inside the virtual column. From Oracle 12c it is also possible to make a column invisible. I would recommend doing so in case you have any insert statements without explicit column lists, otherwise, you might get ORA-00947: not enough values.
ALTER TABLE t ADD virtual_status VARCHAR2(1) INVISIBLE 
GENERATED ALWAYS AS (DECODE(status,'C',null,status));
CREATE INDEX t_status_virtual ON t (virtual_status);

SELECT index_name, index_type, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME       INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
---------------- --------------------------- ---------- -----------
T_PK             NORMAL                         1000000        1875
T_STATUS         NORMAL                         1000000        1812
T_STATUS_VIRTUAL FUNCTION-BASED NORMAL               42           1
The only difference between this and previous function-based index example is that now you can control the name of the virtual column, and you can easily reference it in the application. 
If you have only ever referenced the virtual column in the application, and never the function, then it is also easy to change the function.  Although you would have to rebuild the index.
SELECT COUNT(other) FROM t WHERE virtual_status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("VIRTUAL_STATUS"='R')
If you have already created function-based indexes and referenced the function in the application you can replace them with an index on a named virtual column and the index will still be used.
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T"."VIRTUAL_STATUS"='R')

Conclusion 

A function-based index, preferably on an explicitly named and created virtual column, will permit you to build an index on just the rare values in a column. Making the virtual column invisible will prevent errors during insert statements without explicit column lists. However, you will still need to alter the application SQL to reference either the virtual column or the function that generates it.

No comments :