Tuesday, January 21, 2020

Analysing Database Time with Active Session History for Statements with On-line Optimizer Statistics Gathering Operations

I have been looking into the performance of on-line statistics collection. When statistics are collected on-line there is an extra OPTIMIZER STATISTICS GATHERING operation in the execution plan. However, I have noticed that the presence or absence of this operation does not change the hash value of the plan. This has consequences for profiling DB time by execution plan line and then describing that line from a captured plan.

OPTIMIZER STATISTICS GATHERING Operation

From 12c, statistics are collected on-line during either a create-table-as-select operation or the initial direct-path insert into a new segment.  Below, I have different statements, whose execution plans have the same plan hash value, but actually differ. So, the differences are in areas that do not contribute to the plan hash value.
  • The first statement performs online statistics gathering, and so the plan includes the OPTIMIZER STATISTICS GATHERING operation, the second does not.
  • Note also that the statements insert into different tables, and that does not alter the plan hash value either. However, if the data was queried from different tables that would have produced a different plan hash value.
INSERT /*+APPEND PARALLEL(i)*/ into T2 i SELECT * /*+*/ FROM t1 s

Plan hash value: 90348617
---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |       |       |   178K(100)|          |       |       |
|   1 |  LOAD AS SELECT                  | T2   |       |       |            |          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100M|  4005M|   178K  (1)| 00:00:07 |       |       |
|   3 |    PARTITION RANGE ALL           |      |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
|   4 |     TABLE ACCESS STORAGE FULL    | T1   |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
---------------------------------------------------------------------------------------------------------
INSERT /*+APPEND PARALLEL(i) NO_GATHER_OPTIMIZER_STATISTICS*/ into T3 i
SELECT /*+*/ * FROM t1 s

Plan hash value: 90348617
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT            |      |       |       |   178K(100)|          |       |       |
|   1 |  LOAD AS SELECT             | T3   |       |       |            |          |       |       |
|   2 |   PARTITION RANGE ALL       |      |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
|   3 |    TABLE ACCESS STORAGE FULL| T1   |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
----------------------------------------------------------------------------------------------------
I find that it is often useful to profile database time from DBA_HIST_ACTIVE_SESS_HISTORY (or v$active_session_history) by line in the execution plan, in order to see how much time was consumed by the different operations. I can then join the profile to DBA_HIST_SQL_PLAN (or v$sql_plan) to see what is the operation for each line. So long as I also join these tables by SQL_ID, the answer I get will be correct, but I may not always get an answer.
column inst_id heading 'Inst|Id' format 99
column sql_plan_line_id heading 'SQL Plan|Line ID'
column sql_plan_hash_value heading 'SQL Plan|Hash Value'
column ash_secs heading 'ASH|Secs' format 999
break on sql_id skip 1
with h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p
  ON p.dbid = h.dbid
  and p.sql_id = h.sql_id
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/
If the plan was not captured into AWR or is no longer in the library cache, I don't get a description of the operations in the plan.
                SQL Plan   SQL Plan  ASH
SQL_ID           Line ID Hash Value Secs OPERATION
------------- ---------- ---------- ---- --------------------------------
0s4ruucw2wvsw          0   90348617    4 INSERT STATEMENT
                       1   90348617   77 LOAD AS SELECT
                       2   90348617   25 OPTIMIZER STATISTICS GATHERING
                       3   90348617   11 PARTITION RANGE
                       4   90348617   24 TABLE ACCESS

33x8fjppwh095          0   90348617    2 INSERT STATEMENT
                       1   90348617   89 LOAD AS SELECT
                       2   90348617   10 PARTITION RANGE
                       3   90348617   20 TABLE ACCESS

7140frhyu42t5          0   90348617    1
                       1   90348617   83
                       2   90348617    8
                       3   90348617   28

9vky53vhy5740          0   90348617    3
                       1   90348617   89
                       2   90348617   23
                       3   90348617    9
                       4   90348617   22
Normally, I would look for another SQL_ID that produced the same plan hash value. However, for an execution plan that only sometimes includes on-line statistics gathering, the operations may not match correctly because the OPTIMIZER STATISTICS GATHERING operation changes the line IDs.
WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
), p as (
SELECT DISTINCT dbid, plan_hash_value, id, operation
from dba_hist_sql_plan
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN p
  ON p.dbid = h.dbid
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/
If I just join the ASH profile to a distinct list of ID and operation for the same plan hash value but matching any SQL_ID, I can get duplicate rows returned, starting at the line with the OPTIMIZER STATISTICS GATHERING operation because I have different plans with the same plan hash value.
                           SQL Plan   SQL Plan  ASH
      DBID SQL_ID           Line ID Hash Value Secs OPERATION
---------- ------------- ---------- ---------- ---- ------------------------------
1278460406 7140frhyu42t5          1   90348617   80 LOAD AS SELECT
1278460406                        2   90348617   10 OPTIMIZER STATISTICS GATHERING
1278460406                        2   90348617   10 PARTITION RANGE
1278460406                        3   90348617   30 PARTITION RANGE
1278460406                        3   90348617   30 TABLE ACCESS
...
To mitigate this problem, in the following SQL Query, I check that the maximum plan line ID for which I have ASH data matches the maximum line ID (i.e. the number of lines) in any alternative plan with the same hash value.
WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_value
), x as (
SELECT h.*
,      MAX(sql_plan_line_id) OVER (PARTITION BY h.dbid, h.sql_id) plan_lines
,      p1.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p1
  ON  p1.dbid = h.dbid
  AND p1.sql_id = h.sql_id
  AND p1.plan_hash_value = h.sql_plan_hash_value
  AND p1.id = h.sql_plan_line_id
)
SELECT x.*
, (SELECT p2.operation
   FROM   dba_hist_sql_plan p2
   WHERE  p2.dbid = x.dbid
   AND    p2.plan_hash_value = x.sql_plan_hash_value
   AND    p2.id = x.sql_plan_line_id
   AND    p2.sql_id IN(
                   SELECT p.sql_id
                   FROM   dba_hist_sql_plan p
                   WHERE  p.dbid = x.dbid
                   AND    p.plan_hash_value = x.sql_plan_hash_value
                   GROUP BY p.dbid, p.sql_id
                   HAVING MAX(p.id) = x.plan_lines)
   AND    rownum = 1) operation2
FROM   x
ORDER BY 1,2,3
/
Now, I get an operation description for every line ID (if the same plan was gathered for a different SQL_ID).
                           SQL Plan   SQL Plan  ASH
      DBID SQL_ID           Line ID Hash Value Secs PLAN_LINES OPERATION                        OPERATION2
---------- ------------- ---------- ---------- ---- ---------- -------------------------------- ------------------------------
1278460406 7140frhyu42t5          1   90348617   80          3                                  LOAD AS SELECT
1278460406                        2   90348617   10          3                                  PARTITION RANGE
1278460406                        3   90348617   30          3                                  TABLE ACCESS

1278460406 c2dy3rmnqp7d7          1   90348617  520          4 LOAD AS SELECT                   LOAD AS SELECT
1278460406                        2   90348617  100          4 OPTIMIZER STATISTICS GATHERING   OPTIMIZER STATISTICS GATHERING
1278460406                        3   90348617   80          4 PARTITION RANGE                  PARTITION RANGE
1278460406                        4   90348617  280          4 TABLE ACCESS                     TABLE ACCESS
1278460406                            90348617   30          4

1278460406 drrbxctf8t5nz          1   90348617  100          4                                  LOAD AS SELECT
1278460406                        2   90348617   10          4                                  OPTIMIZER STATISTICS GATHERING
1278460406                        3   90348617   10          4                                  PARTITION RANGE
1278460406                        4   90348617   50          4                                  TABLE ACCESS

1278460406 g7awpb71jbup1          1   90348617  540          3 LOAD AS SELECT                   LOAD AS SELECT
1278460406                        2   90348617   60          3 PARTITION RANGE                  PARTITION RANGE
1278460406                        3   90348617   90          3 TABLE ACCESS                     TABLE ACCESS
1278460406                            90348617   20          3
However, this approach, while better, is still not perfect. I may not have sufficient DB time for the last line in the execution plan to be sampled, and therefore I may not choose a valid alternative plan.

Autonomous & Cloud Databases

Automatic on-line statistics gathering is becoming a more common occurrence.
  • In the Autonomous Data Warehouse, Oracle has set _optimizer_gather_stats_on_load_all=TRUE, so statistics are collected on every direct-path insert. 
  • From 19c, on Engineered Systems (both in the cloud and on-premises), Real-Time statistics are collected during conventional DML (on inserts, updates and some deletes), also using the OPTIMIZER STATISTICS GATHERING operation. Again, the presence or absence of this operation does not affect the execution plan hash value.
SQL_ID  f0fsghg088k3q, child number 0
-------------------------------------
INSERT INTO t2 SELECT * FROM t1

Plan hash value: 589593414
---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |       |       |  1879 (100)|          |       |       |
|   1 |  LOAD TABLE CONVENTIONAL         | T2   |       |       |            |          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |  1000K|    40M|  1879   (1)| 00:00:01 |       |       |
|   3 |    PARTITION RANGE ALL           |      |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
|   4 |     TABLE ACCESS STORAGE FULL    | T1   |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
---------------------------------------------------------------------------------------------------------
SQL_ID  360pwsfmdkxf4, child number 0
-------------------------------------
INSERT /*+NO_GATHER_OPTIMIZER_STATISTICS*/ INTO t3 SELECT * FROM t1

Plan hash value: 589593414
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT            |      |       |       |  1879 (100)|          |       |       |
|   1 |  LOAD TABLE CONVENTIONAL    | T3   |       |       |            |          |       |       |
|   2 |   PARTITION RANGE ALL       |      |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
|   3 |    TABLE ACCESS STORAGE FULL| T1   |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
----------------------------------------------------------------------------------------------------

Wednesday, January 15, 2020

On-Line Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference

I have come across a quirk where the presence of a table statistics preference that specifies METHOD_OPT that is specific to some columns disables on-line statistics gathering.  This behaviour is at least not documented.  I have reproduced this in Oracle version 12.1.0.2 and 19.3.

Demonstration 

I will create two identical tables, but on the first table, I will specify a table statistic preference to collect a histogram on column C.
set serveroutput on verify on autotrace off
CREATE TABLE t1(a number, b varchar2(1000), c number);
CREATE TABLE t2(a number, b varchar2(1000), c number);
exec dbms_stats.set_table_prefs(user,'t1','METHOD_OPT','FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 C');
To make the test rerunnable, I will truncate each table, delete any statistics (because truncate does not delete statistics) and then populate the table again in direct-path mode.
TRUNCATE TABLE t1;
EXEC dbms_stats.delete_table_stats(user,'T1');
INSERT /*+APPEND*/ INTO t1 
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;

TRUNCATE TABLE t2;
EXEC dbms_stats.delete_table_stats(user,'T2');
INSERT /*+APPEND*/ INTO t2
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;
COMMIT;
I expect to get statistics on both tables.
alter session set nls_date_Format = 'hh24:mi:ss dd/mm/yy';
column table_name format a10
column column_name format a11
SELECT table_name, num_rows, last_analyzed FROM user_tables WHERE table_name LIKE 'T_' ORDER BY 1;
SELECT table_name, column_name, num_distinct, histogram, num_buckets FROM user_tab_columns WHERE table_name LIKE 'T_' ORDER BY 1,2;
But I only get table and column statistics on T2, the one without the statistics preference.
TABLE_NAME   NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1
T2             100000 10:08:30 15/01/20

Table Column
Name  Name   NUM_DISTINCT HISTOGRAM       NUM_BUCKETS
----- ------ ------------ --------------- -----------
T1    A                   NONE
T1    B                   NONE
T1    C                   NONE
T2    A            100000 NONE                      1
T2    B             98928 NONE                      1
T2    C               317 NONE                      1
It appears that I don't get statistics on T1 because I have specified a METHOD_OPT table statistic preference that is specific to some named columns. It doesn't have to specify creating a histogram, it might be preventing a histogram from being created.
For example, this preference does not disable on-line statistics collection.
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR ALL COLUMNS SIZE 1');
But these preferences do disable on-line statistics collection.
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 B C');
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 A B C');
I have not found any other statistics preferences (for other DBMS_STATS parameters) that cause this behaviour.

Conclusion 

Table preferences are recommended as a method of controlling statistics collection declaratively and consistently. You don't have to specify parameters to DBMS_STATS into scripts that collect statistics ad-hoc. The table statistics preferences provide a method that every time statistics are collected on a particular table, they are collected consistently, albeit in a way that may be different from the default.
However, take the example of an ETL process loading data into a data warehouse. If you rely on on-line statistics gathering to collect table statistics as a part of a data load process, you must now be careful not to disable statistics collection during the load with METHOD_OPT statistics preferences.

Autonomous & 19c Addendum

(Updated 21st January 2020) In the Oracle Autonomous Data Warehouse, in order to make statistics collection as self-managing as possible, Oracle sets two undocumented parameters.
  • _optimizer_gather_stats_on_load_hist=TRUE - so that histograms are created on every column when on-line statistics are gathered
  • _optimizer_gather_stats_on_load_all=TRUE - so that on-line statistics are collected for every direct-path insert, not just the first one.
Creating a column specific METHOD_OPT statistics preference disables this behaviour.
From 19c, on Engineered systems, Oracle collects Real-Time statistics.  The METHOD_OPT preference does not disable this behaviour, but you only get Real-Time statistics if you have conventional statistics in the first place.  If you don't get statistics after a bulk load, then you won't get Real-Time statistics until conventional statistics have been explicitly gathered, for example, by the maintenance job.

(Update 22nd January 2020) This has been filed as bug 30787109.

Monday, January 06, 2020

Partial Indexing

This is the second of two blog posts that discuss sparse and partial indexing.

Problem Statement

(This is the same problem statements as for sparse indexing)
It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values.  So, the distribution of values is skewed.  A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values.  A histogram is would almost certainly be required on such a column.  However, if you build an index on the column you have to index all the rows.  The index is, therefore, larger and requires more maintenance.  Could we not just index the rare values for which we want to use the index to find?
  • Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values.  This is sometimes called sparse indexing and was discussed in the previous blog.
  • Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values.  This is called partial indexing and is discussed in this blog.
As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links.  However, I want to draw some of the issues together.

Partition Table and Locally Partitioned Partial Index 

I could partition the table on the status column. Here, I have used list partitioning, because the common status is between the two rare status, so I only need two partitions not three. From Oracle 12.1, I can specify indexing on and off on the table and certain partitions so that later I can build partial local indexes only on some partitions. See also:
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) INDEXING OFF 
PARTITION BY LIST (status)
(PARTITION t_status_rare   VALUES ('R','A') INDEXING ON
,PARTITION t_status_common VALUES (DEFAULT) 
) ENABLE ROW MOVEMENT 
/
INSERT /*+APPEND*/ INTO t --(key, status)
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
Here Oracle eliminated the common status partition and only scanned the rare status partition (partition 1). Note that I don't even have an index at this point.  So simply partitioning the table can be effective.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |    10 |   580 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |    10 |   580 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')
However, now when the application updates the status from R (rare) to C (common) the row must be moved between partitions. It is necessary to enable row movement on the table, otherwise, an error will be generated. However, there is additional overhead in moving the row. It is effectively deleted from one partition and inserted into the other.
In this test, I have increased the frequency of one of the rare statuses. Otherwise, the optimizer determines that it is cheaper just to scan the table partition than use the index!
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
A        990
C     999000
Note that I have already specified INDEXING OFF on the table and INDEXING ON on the rare statuses partition. Now I can just build a locally partitioned partial index.
CREATE INDEX t_status ON t(status) LOCAL INDEXING PARTIAL;
Note that only partition T_STATUS_RARE is physically built, and it only contains a single extent. Partition T_STATUS_COMMON exists, is unusable and the segment has not been physically built. It contains no rows and no leaf blocks.
SELECT partition_name, status, num_rows, leaf_blocks
from user_ind_partitions where index_name = 'T_STATUS';

PARTITION_NAME       STATUS     NUM_ROWS LEAF_BLOCKS
-------------------- -------- ---------- -----------
T_STATUS_COMMON      UNUSABLE          0           0
T_STATUS_RARE        USABLE         1000           2

SELECT segment_name, partition_name, blocks
FROM user_segments WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME           BLOCKS
------------ -------------------- ----------
T_STATUS     T_STATUS_RARE                 8

SELECT segment_name, partition_name, segment_type, extent_id, blocks
FROM user_extents WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME       SEGMENT_TYPE        EXTENT_ID     BLOCKS
------------ -------------------- ------------------ ---------- ----------
T_STATUS     T_STATUS_RARE        INDEX PARTITION             0          8
Scans for the common status value can only full scan the table partition because there is no index to use.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    55 |  2444   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    55 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   998K|    52M|  2444   (1)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   998K|    52M|  2444   (1)| 00:00:01 |     2 |     2 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='C')
To query the rare value Oracle does use the index on the rare values partition.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3051124889
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |          |     1 |    58 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                             |          |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE                     |          |    10 |   580 |     2   (0)| 00:00:01 |   KEY |   KEY |
|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T        |    10 |   580 |     2   (0)| 00:00:01 |     1 |     1 |
|*  4 |     INDEX RANGE SCAN                        | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     1 |     1 |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')
However, it is not worth using the index for the slightly more common status A.  Here, Oracle full scans the table partition.
SELECT COUNT(other) FROM t WHERE status='A';

COUNT(OTHER)
------------
         990

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   990 | 57420 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   990 | 57420 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='A')
Note that to use partial indexing I also had to partition the table.

Globally Partitioned Index with Zero-Sized Unusable Partitions 

Since Oracle 11.2.0.4, it has been possible to achieve the same effect without partitioning the table, thus avoiding the overhead of row movement. See also:
This feature also worked in earlier versions, but Oracle built a single extent for each unusable partition.
Here, I will recreate a non-partitioned table.
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) 
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;

exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');

SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
R         10
C     999000
A        990
It is not possible to create a globally list-partitioned index. Oracle simply does not support it.
CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY LIST (status, id2)
(PARTITION t_status_common VALUES ('R','A')
,PARTITION t_status_rare   VALUES (DEFAULT)
);

GLOBAL PARTITION BY LIST (status)
                    *
ERROR at line 2:
ORA-14151: invalid table partitioning method
You can create a global range or hash partitioned index. It is unlikely that the hash values of the column will break down conveniently into particular hash values. In this example, I would still have needed to create 4 hash partitions and still build 2 of them.
WITH x as (
SELECT status, COUNT(*) freq
FROM   t 
GROUP BY status
) SELECT x.*
, dbms_utility.get_hash_value(status,0,2)
, dbms_utility.get_hash_value(status,0,4)
FROM x
/ 

S       FREQ DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,2) DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,4)
- ---------- --------------------------------------- ---------------------------------------
R        990                                       1                                       1
C    1009000                                       0                                       0
A         10                                       0                                       2
It is easier to create a globally range partitioned index. Although in my example, the common status lies between the two rare statuses, so I need to create three partitions. I will create the index unusable and build the two rare status partitions.
CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY RANGE (status)
(PARTITION t_status_rare1  VALUES LESS THAN ('C')
,PARTITION t_status_common VALUES LESS THAN ('D')
,PARTITION t_status_rare2  VALUES LESS THAN (MAXVALUE)
) UNUSABLE;
ALTER INDEX t_status REBUILD PARTITION t_status_rare1;
ALTER INDEX t_status REBUILD PARTITION t_status_rare2;
The index partition for the common status is unusable so Oracle can only full scan the table.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')
However, for the rare statuses, Oracle scans the index and looks up each of the table rows.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2558590380
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |     1 |    55 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                       |          |     1 |    55 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE              |          |    10 |   550 |     2   (0)| 00:00:01 |     3 |     3 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     2   (0)| 00:00:01 |       |       |
|*  4 |     INDEX RANGE SCAN                  | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     3 |     3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')

Conclusion 

The advantage of this global partitioning approach is that it does not require any change to application code, and it does not involve partitioning the table. However, you will have to remember not to rebuild the unusable partitions, otherwise, they will have to be maintained as the table changes until you make them unusable again and, they will consume space that you will only get back by recreating the entire index.
NB: Partitioning is a licenced option that is only available on the Enterprise Edition of the database.

Sparse Indexing

This is the first of two blog posts that discuss sparse and partial indexing.

Problem Statement

It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values.  So, the distribution of values is skewed.  A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values.  A histogram is would almost certainly be required on such a column.  However, if you build an index on the column you have to index all the rows.  The index is, therefore, larger and requires more maintenance.  Could we not just index the rare values for which we want to use the index to find?
  • Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values.  This is sometimes called sparse indexing and is discussed in this blog.
  • Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values.  This is called partial indexing and is discussed in the next blog.
As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links.  However, I want to draw some of the issues together.

Sparse Indexing

The ideas discussed in this section are based on the principle that Oracle indexes do not include rows where the key values are null.  

Store Null Values in the Database?

One option is to engineer the application to use null as the common status value.  However, this means that the column in question has to be nullable, and you may require different logic because the comparison to null is always false.
CREATE TABLE t 
(key    NUMBER NOT NULL
,status VARCHAR2(1)
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
);

INSERT /*+APPEND*/ INTO t 
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN NULL /*common status*/
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
FROM dual
CONNECT BY level <= 1e6;

CREATE INDEX t_status ON t (status);
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
I have created a test table with 1000000 rows. 10 rows have status R, and 32 rows have status A. The rest have status NULL. I have indexed the status column, and also created a histogram on it when I collected statistics.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
      999958
R         10
A         32
I can see from the statistics that I have 1000000 rows in the primary key index, but only 42 rows in the status index because it only contains the not null values. Therefore, it is much smaller, having only a single leaf block, whereas the primary key index has 1875 leaf blocks.
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T';

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
There are some problems with this approach.

Not All Index Columns are Null 
If any of the index columns are not null, then there is an entry in the index for the row, and there is no saving of space. It is not uncommon to add additional columns to such an index, either for additional filtering, or to avoid accessing the table by satisfying the query from the index.
CREATE INDEX t_status2 ON t (status,other);
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
T_STATUS2     1000000        9081

Null Logic
If, for example, I want to find the rows that do not have status A, then a simple inequality does not find the null statuses because comparison to null is always false.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A'
GROUP BY status
/

S   COUNT(*)
- ----------
R         10
Instead, I would have to explicitly code for the null values.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A' OR status IS NULL
GROUP BY status
/

S   COUNT(*)
- ----------
      999958
R         10
This additional complexity is certainly one reason why developers shy away from this approach in custom applications. It is almost impossible to retrofit it into an existing or packaged application. 

Function-Based Indexes 

It is possible to build an index on a function, such that that function to evaluates to null for the common values. This time my test table still has 1,000,000 rows. The status column is now not nullable.
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
)
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
10 rows have status A, and 32 rows have status O. The rest have status C.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
C     999958
A         32
I will build a simple index on status, and a second index on a function of status that decodes the common status C back to NULL;
CREATE INDEX t_status ON t (status);
CREATE INDEX t_status_fn ON t (DECODE(status,'C',NULL,status));
As before, with the null column, the function-based index has only a single leaf block, the other indexes are much larger because they contain all 1 million rows.
SELECT index_name, index_type, num_rows, leaf_blocks 
from user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
------------ --------------------------- ---------- -----------
T_PK         NORMAL                         1000000        1875
T_STATUS     NORMAL                         1000000        1812
T_STATUS_FN  FUNCTION-BASED NORMAL               42           1
If I query the table for the common status, Oracle quite reasonably full scans the table.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999958

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2446   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2446   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')
If I query for the rare status value, it will use the normal index to look that up.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 1997248105
-------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |          |     1 |    55 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |          |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS |    10 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("STATUS"='R')
Now I will make that index invisible, and the optimizer can only choose to full scan the table. It cannot use the function-based index because the query does not match the function.
ALTER INDEX t_status INVISIBLE;
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |    10 |   550 |  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='R')
Instead, I must change the query to reference the function in the function-based index, and then the optimizer chooses the function-based index, even if I make the normal index visible again. Note that the function is shown in the access operation in the predicate section.
ALTER INDEX t_status VISIBLE;
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2511618215
----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |             |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |             |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T           |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_FN |    21 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(DECODE("STATUS",'C',NULL,"STATUS")='R')

Invisible Virtual Columns 

Function-based indexes are implemented using an invisible virtual column. You can even reference that virtual column in a query. However, the name of the column is system generated, so you may not want to include it in your application.
SELECT * FROM user_stat_extensions WHERE table_name = 'T';

TABLE_NAME EXTENSION_NAME  EXTENSION                                CREATO DRO
---------- --------------- ---------------------------------------- ------ ---
T          SYS_NC00004$    (DECODE("STATUS",'C',NULL,"STATUS"))     SYSTEM NO

SELECT SYS_NC00004$, COUNT(*) FROM t group by SYS_NC00004$;

S   COUNT(*)
- ----------
      999958
R         10
A         32
Instead, you could create a virtual column and then index it. The resulting index is still function-based because it references the function inside the virtual column. From Oracle 12c it is also possible to make a column invisible. I would recommend doing so in case you have any insert statements without explicit column lists, otherwise, you might get ORA-00947: not enough values.
ALTER TABLE t ADD virtual_status VARCHAR2(1) INVISIBLE 
GENERATED ALWAYS AS (DECODE(status,'C',null,status));
CREATE INDEX t_status_virtual ON t (virtual_status);

SELECT index_name, index_type, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME       INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
---------------- --------------------------- ---------- -----------
T_PK             NORMAL                         1000000        1875
T_STATUS         NORMAL                         1000000        1812
T_STATUS_VIRTUAL FUNCTION-BASED NORMAL               42           1
The only difference between this and previous function-based index example is that now you can control the name of the virtual column, and you can easily reference it in the application. 
If you have only ever referenced the virtual column in the application, and never the function, then it is also easy to change the function.  Although you would have to rebuild the index.
SELECT COUNT(other) FROM t WHERE virtual_status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("VIRTUAL_STATUS"='R')
If you have already created function-based indexes and referenced the function in the application you can replace them with an index on a named virtual column and the index will still be used.
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T"."VIRTUAL_STATUS"='R')

Conclusion 

A function-based index, preferably on an explicitly named and created virtual column, will permit you to build an index on just the rare values in a column. Making the virtual column invisible will prevent errors during insert statements without explicit column lists. However, you will still need to alter the application SQL to reference either the virtual column or the function that generates it.