Tuesday, January 21, 2020

Analysing Database Time with Active Session History for Statements with On-line Optimizer Statistics Gathering Operations

I have been looking into the performance of on-line statistics collection. When statistics are collected on-line there is an extra OPTIMIZER STATISTICS GATHERING operation in the execution plan. However, I have noticed that the presence or absence of this operation does not change the hash value of the plan. This has consequences for profiling DB time by execution plan line and then describing that line from a captured plan.

OPTIMIZER STATISTICS GATHERING Operation

From 12c, statistics are collected on-line during either a create-table-as-select operation or the initial direct-path insert into a new segment.  Below, I have different statements, whose execution plans have the same plan hash value, but actually differ. So, the differences are in areas that do not contribute to the plan hash value.
  • The first statement performs online statistics gathering, and so the plan includes the OPTIMIZER STATISTICS GATHERING operation, the second does not.
  • Note also that the statements insert into different tables, and that does not alter the plan hash value either. However, if the data was queried from different tables that would have produced a different plan hash value.
INSERT /*+APPEND PARALLEL(i)*/ into T2 i SELECT * /*+*/ FROM t1 s

Plan hash value: 90348617
---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |       |       |   178K(100)|          |       |       |
|   1 |  LOAD AS SELECT                  | T2   |       |       |            |          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100M|  4005M|   178K  (1)| 00:00:07 |       |       |
|   3 |    PARTITION RANGE ALL           |      |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
|   4 |     TABLE ACCESS STORAGE FULL    | T1   |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
---------------------------------------------------------------------------------------------------------
INSERT /*+APPEND PARALLEL(i) NO_GATHER_OPTIMIZER_STATISTICS*/ into T3 i
SELECT /*+*/ * FROM t1 s

Plan hash value: 90348617
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT            |      |       |       |   178K(100)|          |       |       |
|   1 |  LOAD AS SELECT             | T3   |       |       |            |          |       |       |
|   2 |   PARTITION RANGE ALL       |      |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
|   3 |    TABLE ACCESS STORAGE FULL| T1   |   100M|  4005M|   178K  (1)| 00:00:07 |     1 |1048575|
----------------------------------------------------------------------------------------------------
I find that it is often useful to profile database time from DBA_HIST_ACTIVE_SESS_HISTORY (or v$active_session_history) by line in the execution plan, in order to see how much time was consumed by the different operations. I can then join the profile to DBA_HIST_SQL_PLAN (or v$sql_plan) to see what is the operation for each line. So long as I also join these tables by SQL_ID, the answer I get will be correct, but I may not always get an answer.
column inst_id heading 'Inst|Id' format 99
column sql_plan_line_id heading 'SQL Plan|Line ID'
column sql_plan_hash_value heading 'SQL Plan|Hash Value'
column ash_secs heading 'ASH|Secs' format 999
break on sql_id skip 1
with h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p
  ON p.dbid = h.dbid
  and p.sql_id = h.sql_id
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/
If the plan was not captured into AWR or is no longer in the library cache, I don't get a description of the operations in the plan.
                SQL Plan   SQL Plan  ASH
SQL_ID           Line ID Hash Value Secs OPERATION
------------- ---------- ---------- ---- --------------------------------
0s4ruucw2wvsw          0   90348617    4 INSERT STATEMENT
                       1   90348617   77 LOAD AS SELECT
                       2   90348617   25 OPTIMIZER STATISTICS GATHERING
                       3   90348617   11 PARTITION RANGE
                       4   90348617   24 TABLE ACCESS

33x8fjppwh095          0   90348617    2 INSERT STATEMENT
                       1   90348617   89 LOAD AS SELECT
                       2   90348617   10 PARTITION RANGE
                       3   90348617   20 TABLE ACCESS

7140frhyu42t5          0   90348617    1
                       1   90348617   83
                       2   90348617    8
                       3   90348617   28

9vky53vhy5740          0   90348617    3
                       1   90348617   89
                       2   90348617   23
                       3   90348617    9
                       4   90348617   22
Normally, I would look for another SQL_ID that produced the same plan hash value. However, for an execution plan that only sometimes includes on-line statistics gathering, the operations may not match correctly because the OPTIMIZER STATISTICS GATHERING operation changes the line IDs.
WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
), p as (
SELECT DISTINCT dbid, plan_hash_value, id, operation
from dba_hist_sql_plan
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN p
  ON p.dbid = h.dbid
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/
If I just join the ASH profile to a distinct list of ID and operation for the same plan hash value but matching any SQL_ID, I can get duplicate rows returned, starting at the line with the OPTIMIZER STATISTICS GATHERING operation because I have different plans with the same plan hash value.
                           SQL Plan   SQL Plan  ASH
      DBID SQL_ID           Line ID Hash Value Secs OPERATION
---------- ------------- ---------- ---------- ---- ------------------------------
1278460406 7140frhyu42t5          1   90348617   80 LOAD AS SELECT
1278460406                        2   90348617   10 OPTIMIZER STATISTICS GATHERING
1278460406                        2   90348617   10 PARTITION RANGE
1278460406                        3   90348617   30 PARTITION RANGE
1278460406                        3   90348617   30 TABLE ACCESS
...
To mitigate this problem, in the following SQL Query, I check that the maximum plan line ID for which I have ASH data matches the maximum line ID (i.e. the number of lines) in any alternative plan with the same hash value.
WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 90348617
AND    h.sql_id IN('g7awpb71jbup1','c2dy3rmnqp7d7','drrbxctf8t5nz','7140frhyu42t5')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_value
), x as (
SELECT h.*
,      MAX(sql_plan_line_id) OVER (PARTITION BY h.dbid, h.sql_id) plan_lines
,      p1.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p1
  ON  p1.dbid = h.dbid
  AND p1.sql_id = h.sql_id
  AND p1.plan_hash_value = h.sql_plan_hash_value
  AND p1.id = h.sql_plan_line_id
)
SELECT x.*
, (SELECT p2.operation
   FROM   dba_hist_sql_plan p2
   WHERE  p2.dbid = x.dbid
   AND    p2.plan_hash_value = x.sql_plan_hash_value
   AND    p2.id = x.sql_plan_line_id
   AND    p2.sql_id IN(
                   SELECT p.sql_id
                   FROM   dba_hist_sql_plan p
                   WHERE  p.dbid = x.dbid
                   AND    p.plan_hash_value = x.sql_plan_hash_value
                   GROUP BY p.dbid, p.sql_id
                   HAVING MAX(p.id) = x.plan_lines)
   AND    rownum = 1) operation2
FROM   x
ORDER BY 1,2,3
/
Now, I get an operation description for every line ID (if the same plan was gathered for a different SQL_ID).
                           SQL Plan   SQL Plan  ASH
      DBID SQL_ID           Line ID Hash Value Secs PLAN_LINES OPERATION                        OPERATION2
---------- ------------- ---------- ---------- ---- ---------- -------------------------------- ------------------------------
1278460406 7140frhyu42t5          1   90348617   80          3                                  LOAD AS SELECT
1278460406                        2   90348617   10          3                                  PARTITION RANGE
1278460406                        3   90348617   30          3                                  TABLE ACCESS

1278460406 c2dy3rmnqp7d7          1   90348617  520          4 LOAD AS SELECT                   LOAD AS SELECT
1278460406                        2   90348617  100          4 OPTIMIZER STATISTICS GATHERING   OPTIMIZER STATISTICS GATHERING
1278460406                        3   90348617   80          4 PARTITION RANGE                  PARTITION RANGE
1278460406                        4   90348617  280          4 TABLE ACCESS                     TABLE ACCESS
1278460406                            90348617   30          4

1278460406 drrbxctf8t5nz          1   90348617  100          4                                  LOAD AS SELECT
1278460406                        2   90348617   10          4                                  OPTIMIZER STATISTICS GATHERING
1278460406                        3   90348617   10          4                                  PARTITION RANGE
1278460406                        4   90348617   50          4                                  TABLE ACCESS

1278460406 g7awpb71jbup1          1   90348617  540          3 LOAD AS SELECT                   LOAD AS SELECT
1278460406                        2   90348617   60          3 PARTITION RANGE                  PARTITION RANGE
1278460406                        3   90348617   90          3 TABLE ACCESS                     TABLE ACCESS
1278460406                            90348617   20          3
However, this approach, while better, is still not perfect. I may not have sufficient DB time for the last line in the execution plan to be sampled, and therefore I may not choose a valid alternative plan.

Autonomous & Cloud Databases

Automatic on-line statistics gathering is becoming a more common occurrence.
  • In the Autonomous Data Warehouse, Oracle has set _optimizer_gather_stats_on_load_all=TRUE, so statistics are collected on every direct-path insert. 
  • From 19c, on Engineered Systems (both in the cloud and on-premises), Real-Time statistics are collected during conventional DML (on inserts, updates and some deletes), also using the OPTIMIZER STATISTICS GATHERING operation. Again, the presence or absence of this operation does not affect the execution plan hash value.
SQL_ID  f0fsghg088k3q, child number 0
-------------------------------------
INSERT INTO t2 SELECT * FROM t1

Plan hash value: 589593414
---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |       |       |  1879 (100)|          |       |       |
|   1 |  LOAD TABLE CONVENTIONAL         | T2   |       |       |            |          |       |       |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |  1000K|    40M|  1879   (1)| 00:00:01 |       |       |
|   3 |    PARTITION RANGE ALL           |      |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
|   4 |     TABLE ACCESS STORAGE FULL    | T1   |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
---------------------------------------------------------------------------------------------------------
SQL_ID  360pwsfmdkxf4, child number 0
-------------------------------------
INSERT /*+NO_GATHER_OPTIMIZER_STATISTICS*/ INTO t3 SELECT * FROM t1

Plan hash value: 589593414
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT            |      |       |       |  1879 (100)|          |       |       |
|   1 |  LOAD TABLE CONVENTIONAL    | T3   |       |       |            |          |       |       |
|   2 |   PARTITION RANGE ALL       |      |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
|   3 |    TABLE ACCESS STORAGE FULL| T1   |  1000K|    40M|  1879   (1)| 00:00:01 |     1 |1048575|
----------------------------------------------------------------------------------------------------

Wednesday, January 15, 2020

On-Line Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference

I have come across a quirk where the presence of a table statistics preference that specifies METHOD_OPT that is specific to some columns disables on-line statistics gathering.  This behaviour is at least not documented.  I have reproduced this in Oracle version 12.1.0.2 and 19.3.

Demonstration 

I will create two identical tables, but on the first table, I will specify a table statistic preference to collect a histogram on column C.
set serveroutput on verify on autotrace off
CREATE TABLE t1(a number, b varchar2(1000), c number);
CREATE TABLE t2(a number, b varchar2(1000), c number);
exec dbms_stats.set_table_prefs(user,'t1','METHOD_OPT','FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 C');
To make the test rerunnable, I will truncate each table, delete any statistics (because truncate does not delete statistics) and then populate the table again in direct-path mode.
TRUNCATE TABLE t1;
EXEC dbms_stats.delete_table_stats(user,'T1');
INSERT /*+APPEND*/ INTO t1 
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;

TRUNCATE TABLE t2;
EXEC dbms_stats.delete_table_stats(user,'T2');
INSERT /*+APPEND*/ INTO t2
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;
COMMIT;
I expect to get statistics on both tables.
alter session set nls_date_Format = 'hh24:mi:ss dd/mm/yy';
column table_name format a10
column column_name format a11
SELECT table_name, num_rows, last_analyzed FROM user_tables WHERE table_name LIKE 'T_' ORDER BY 1;
SELECT table_name, column_name, num_distinct, histogram, num_buckets FROM user_tab_columns WHERE table_name LIKE 'T_' ORDER BY 1,2;
But I only get table and column statistics on T2, the one without the statistics preference.
TABLE_NAME   NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1
T2             100000 10:08:30 15/01/20

Table Column
Name  Name   NUM_DISTINCT HISTOGRAM       NUM_BUCKETS
----- ------ ------------ --------------- -----------
T1    A                   NONE
T1    B                   NONE
T1    C                   NONE
T2    A            100000 NONE                      1
T2    B             98928 NONE                      1
T2    C               317 NONE                      1
It appears that I don't get statistics on T1 because I have specified a METHOD_OPT table statistic preference that is specific to some named columns. It doesn't have to specify creating a histogram, it might be preventing a histogram from being created.
For example, this preference does not disable on-line statistics collection.
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR ALL COLUMNS SIZE 1');
But these preferences do disable on-line statistics collection.
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 B C');
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 A B C');
I have not found any other statistics preferences (for other DBMS_STATS parameters) that cause this behaviour.

Conclusion 

Table preferences are recommended as a method of controlling statistics collection declaratively and consistently. You don't have to specify parameters to DBMS_STATS into scripts that collect statistics ad-hoc. The table statistics preferences provide a method that every time statistics are collected on a particular table, they are collected consistently, albeit in a way that may be different from the default.
However, take the example of an ETL process loading data into a data warehouse. If you rely on on-line statistics gathering to collect table statistics as a part of a data load process, you must now be careful not to disable statistics collection during the load with METHOD_OPT statistics preferences.

Autonomous & 19c Addendum

(Updated 21st January 2020) In the Oracle Autonomous Data Warehouse, in order to make statistics collection as self-managing as possible, Oracle sets two undocumented parameters.
  • _optimizer_gather_stats_on_load_hist=TRUE - so that histograms are created on every column when on-line statistics are gathered
  • _optimizer_gather_stats_on_load_all=TRUE - so that on-line statistics are collected for every direct-path insert, not just the first one.
Creating a column specific METHOD_OPT statistics preference disables this behaviour.
From 19c, on Engineered systems, Oracle collects Real-Time statistics.  The METHOD_OPT preference does not disable this behaviour, but you only get Real-Time statistics if you have conventional statistics in the first place.  If you don't get statistics after a bulk load, then you won't get Real-Time statistics until conventional statistics have been explicitly gathered, for example, by the maintenance job.

(Update 22nd January 2020) This has been filed as bug 30787109.

Monday, January 06, 2020

Partial Indexing

This is the second of two blog posts that discuss sparse and partial indexing.

Problem Statement

(This is the same problem statements as for sparse indexing)
It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values.  So, the distribution of values is skewed.  A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values.  A histogram is would almost certainly be required on such a column.  However, if you build an index on the column you have to index all the rows.  The index is, therefore, larger and requires more maintenance.  Could we not just index the rare values for which we want to use the index to find?
  • Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values.  This is sometimes called sparse indexing and was discussed in the previous blog.
  • Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values.  This is called partial indexing and is discussed in this blog.
As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links.  However, I want to draw some of the issues together.

Partition Table and Locally Partitioned Partial Index 

I could partition the table on the status column. Here, I have used list partitioning, because the common status is between the two rare status, so I only need two partitions not three. From Oracle 12.1, I can specify indexing on and off on the table and certain partitions so that later I can build partial local indexes only on some partitions. See also:
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) INDEXING OFF 
PARTITION BY LIST (status)
(PARTITION t_status_rare   VALUES ('R','A') INDEXING ON
,PARTITION t_status_common VALUES (DEFAULT) 
) ENABLE ROW MOVEMENT 
/
INSERT /*+APPEND*/ INTO t --(key, status)
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
Here Oracle eliminated the common status partition and only scanned the rare status partition (partition 1). Note that I don't even have an index at this point.  So simply partitioning the table can be effective.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |    10 |   580 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |    10 |   580 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')
However, now when the application updates the status from R (rare) to C (common) the row must be moved between partitions. It is necessary to enable row movement on the table, otherwise, an error will be generated. However, there is additional overhead in moving the row. It is effectively deleted from one partition and inserted into the other.
In this test, I have increased the frequency of one of the rare statuses. Otherwise, the optimizer determines that it is cheaper just to scan the table partition than use the index!
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
A        990
C     999000
Note that I have already specified INDEXING OFF on the table and INDEXING ON on the rare statuses partition. Now I can just build a locally partitioned partial index.
CREATE INDEX t_status ON t(status) LOCAL INDEXING PARTIAL;
Note that only partition T_STATUS_RARE is physically built, and it only contains a single extent. Partition T_STATUS_COMMON exists, is unusable and the segment has not been physically built. It contains no rows and no leaf blocks.
SELECT partition_name, status, num_rows, leaf_blocks
from user_ind_partitions where index_name = 'T_STATUS';

PARTITION_NAME       STATUS     NUM_ROWS LEAF_BLOCKS
-------------------- -------- ---------- -----------
T_STATUS_COMMON      UNUSABLE          0           0
T_STATUS_RARE        USABLE         1000           2

SELECT segment_name, partition_name, blocks
FROM user_segments WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME           BLOCKS
------------ -------------------- ----------
T_STATUS     T_STATUS_RARE                 8

SELECT segment_name, partition_name, segment_type, extent_id, blocks
FROM user_extents WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME       SEGMENT_TYPE        EXTENT_ID     BLOCKS
------------ -------------------- ------------------ ---------- ----------
T_STATUS     T_STATUS_RARE        INDEX PARTITION             0          8
Scans for the common status value can only full scan the table partition because there is no index to use.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    55 |  2444   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    55 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   998K|    52M|  2444   (1)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   998K|    52M|  2444   (1)| 00:00:01 |     2 |     2 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='C')
To query the rare value Oracle does use the index on the rare values partition.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3051124889
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |          |     1 |    58 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                             |          |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE                     |          |    10 |   580 |     2   (0)| 00:00:01 |   KEY |   KEY |
|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T        |    10 |   580 |     2   (0)| 00:00:01 |     1 |     1 |
|*  4 |     INDEX RANGE SCAN                        | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     1 |     1 |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')
However, it is not worth using the index for the slightly more common status A.  Here, Oracle full scans the table partition.
SELECT COUNT(other) FROM t WHERE status='A';

COUNT(OTHER)
------------
         990

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   990 | 57420 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   990 | 57420 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='A')
Note that to use partial indexing I also had to partition the table.

Globally Partitioned Index with Zero-Sized Unusable Partitions 

Since Oracle 11.2.0.4, it has been possible to achieve the same effect without partitioning the table, thus avoiding the overhead of row movement. See also:
This feature also worked in earlier versions, but Oracle built a single extent for each unusable partition.
Here, I will recreate a non-partitioned table.
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) 
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;

exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');

SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
R         10
C     999000
A        990
It is not possible to create a globally list-partitioned index. Oracle simply does not support it.
CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY LIST (status, id2)
(PARTITION t_status_common VALUES ('R','A')
,PARTITION t_status_rare   VALUES (DEFAULT)
);

GLOBAL PARTITION BY LIST (status)
                    *
ERROR at line 2:
ORA-14151: invalid table partitioning method
You can create a global range or hash partitioned index. It is unlikely that the hash values of the column will break down conveniently into particular hash values. In this example, I would still have needed to create 4 hash partitions and still build 2 of them.
WITH x as (
SELECT status, COUNT(*) freq
FROM   t 
GROUP BY status
) SELECT x.*
, dbms_utility.get_hash_value(status,0,2)
, dbms_utility.get_hash_value(status,0,4)
FROM x
/ 

S       FREQ DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,2) DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,4)
- ---------- --------------------------------------- ---------------------------------------
R        990                                       1                                       1
C    1009000                                       0                                       0
A         10                                       0                                       2
It is easier to create a globally range partitioned index. Although in my example, the common status lies between the two rare statuses, so I need to create three partitions. I will create the index unusable and build the two rare status partitions.
CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY RANGE (status)
(PARTITION t_status_rare1  VALUES LESS THAN ('C')
,PARTITION t_status_common VALUES LESS THAN ('D')
,PARTITION t_status_rare2  VALUES LESS THAN (MAXVALUE)
) UNUSABLE;
ALTER INDEX t_status REBUILD PARTITION t_status_rare1;
ALTER INDEX t_status REBUILD PARTITION t_status_rare2;
The index partition for the common status is unusable so Oracle can only full scan the table.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')
However, for the rare statuses, Oracle scans the index and looks up each of the table rows.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2558590380
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |     1 |    55 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                       |          |     1 |    55 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE              |          |    10 |   550 |     2   (0)| 00:00:01 |     3 |     3 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     2   (0)| 00:00:01 |       |       |
|*  4 |     INDEX RANGE SCAN                  | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     3 |     3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')

Conclusion 

The advantage of this global partitioning approach is that it does not require any change to application code, and it does not involve partitioning the table. However, you will have to remember not to rebuild the unusable partitions, otherwise, they will have to be maintained as the table changes until you make them unusable again and, they will consume space that you will only get back by recreating the entire index.
NB: Partitioning is a licenced option that is only available on the Enterprise Edition of the database.

Sparse Indexing

This is the first of two blog posts that discuss sparse and partial indexing.

Problem Statement

It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values.  So, the distribution of values is skewed.  A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values.  A histogram is would almost certainly be required on such a column.  However, if you build an index on the column you have to index all the rows.  The index is, therefore, larger and requires more maintenance.  Could we not just index the rare values for which we want to use the index to find?
  • Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values.  This is sometimes called sparse indexing and is discussed in this blog.
  • Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values.  This is called partial indexing and is discussed in the next blog.
As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links.  However, I want to draw some of the issues together.

Sparse Indexing

The ideas discussed in this section are based on the principle that Oracle indexes do not include rows where the key values are null.  

Store Null Values in the Database?

One option is to engineer the application to use null as the common status value.  However, this means that the column in question has to be nullable, and you may require different logic because the comparison to null is always false.
CREATE TABLE t 
(key    NUMBER NOT NULL
,status VARCHAR2(1)
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
);

INSERT /*+APPEND*/ INTO t 
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN NULL /*common status*/
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
FROM dual
CONNECT BY level <= 1e6;

CREATE INDEX t_status ON t (status);
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
I have created a test table with 1000000 rows. 10 rows have status R, and 32 rows have status A. The rest have status NULL. I have indexed the status column, and also created a histogram on it when I collected statistics.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
      999958
R         10
A         32
I can see from the statistics that I have 1000000 rows in the primary key index, but only 42 rows in the status index because it only contains the not null values. Therefore, it is much smaller, having only a single leaf block, whereas the primary key index has 1875 leaf blocks.
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T';

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
There are some problems with this approach.

Not All Index Columns are Null 
If any of the index columns are not null, then there is an entry in the index for the row, and there is no saving of space. It is not uncommon to add additional columns to such an index, either for additional filtering, or to avoid accessing the table by satisfying the query from the index.
CREATE INDEX t_status2 ON t (status,other);
SELECT index_name, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   NUM_ROWS LEAF_BLOCKS
---------- ---------- -----------
T_PK          1000000        1875
T_STATUS           42           1
T_STATUS2     1000000        9081

Null Logic
If, for example, I want to find the rows that do not have status A, then a simple inequality does not find the null statuses because comparison to null is always false.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A'
GROUP BY status
/

S   COUNT(*)
- ----------
R         10
Instead, I would have to explicitly code for the null values.
SELECT status, COUNT(*)
FROM   t 
WHERE  status != 'A' OR status IS NULL
GROUP BY status
/

S   COUNT(*)
- ----------
      999958
R         10
This additional complexity is certainly one reason why developers shy away from this approach in custom applications. It is almost impossible to retrofit it into an existing or packaged application. 

Function-Based Indexes 

It is possible to build an index on a function, such that that function to evaluates to null for the common values. This time my test table still has 1,000,000 rows. The status column is now not nullable.
CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000) 
,CONSTRAINT t_pk PRIMARY KEY(key)
)
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-42 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');
10 rows have status A, and 32 rows have status O. The rest have status C.
SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
C     999958
A         32
I will build a simple index on status, and a second index on a function of status that decodes the common status C back to NULL;
CREATE INDEX t_status ON t (status);
CREATE INDEX t_status_fn ON t (DECODE(status,'C',NULL,status));
As before, with the null column, the function-based index has only a single leaf block, the other indexes are much larger because they contain all 1 million rows.
SELECT index_name, index_type, num_rows, leaf_blocks 
from user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME   INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
------------ --------------------------- ---------- -----------
T_PK         NORMAL                         1000000        1875
T_STATUS     NORMAL                         1000000        1812
T_STATUS_FN  FUNCTION-BASED NORMAL               42           1
If I query the table for the common status, Oracle quite reasonably full scans the table.
SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999958

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2446   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2446   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')
If I query for the rare status value, it will use the normal index to look that up.
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 1997248105
-------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |          |     1 |    55 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |          |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS |    10 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("STATUS"='R')
Now I will make that index invisible, and the optimizer can only choose to full scan the table. It cannot use the function-based index because the query does not match the function.
ALTER INDEX t_status INVISIBLE;
SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |    10 |   550 |  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='R')
Instead, I must change the query to reference the function in the function-based index, and then the optimizer chooses the function-based index, even if I make the normal index visible again. Note that the function is shown in the access operation in the predicate section.
ALTER INDEX t_status VISIBLE;
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2511618215
----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |             |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |             |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T           |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_FN |    21 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access(DECODE("STATUS",'C',NULL,"STATUS")='R')

Invisible Virtual Columns 

Function-based indexes are implemented using an invisible virtual column. You can even reference that virtual column in a query. However, the name of the column is system generated, so you may not want to include it in your application.
SELECT * FROM user_stat_extensions WHERE table_name = 'T';

TABLE_NAME EXTENSION_NAME  EXTENSION                                CREATO DRO
---------- --------------- ---------------------------------------- ------ ---
T          SYS_NC00004$    (DECODE("STATUS",'C',NULL,"STATUS"))     SYSTEM NO

SELECT SYS_NC00004$, COUNT(*) FROM t group by SYS_NC00004$;

S   COUNT(*)
- ----------
      999958
R         10
A         32
Instead, you could create a virtual column and then index it. The resulting index is still function-based because it references the function inside the virtual column. From Oracle 12c it is also possible to make a column invisible. I would recommend doing so in case you have any insert statements without explicit column lists, otherwise, you might get ORA-00947: not enough values.
ALTER TABLE t ADD virtual_status VARCHAR2(1) INVISIBLE 
GENERATED ALWAYS AS (DECODE(status,'C',null,status));
CREATE INDEX t_status_virtual ON t (virtual_status);

SELECT index_name, index_type, num_rows, leaf_blocks FROM user_indexes WHERE table_name = 'T' ORDER BY 1;

INDEX_NAME       INDEX_TYPE                    NUM_ROWS LEAF_BLOCKS
---------------- --------------------------- ---------- -----------
T_PK             NORMAL                         1000000        1875
T_STATUS         NORMAL                         1000000        1812
T_STATUS_VIRTUAL FUNCTION-BASED NORMAL               42           1
The only difference between this and previous function-based index example is that now you can control the name of the virtual column, and you can easily reference it in the application. 
If you have only ever referenced the virtual column in the application, and never the function, then it is also easy to change the function.  Although you would have to rebuild the index.
SELECT COUNT(other) FROM t WHERE virtual_status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("VIRTUAL_STATUS"='R')
If you have already created function-based indexes and referenced the function in the application you can replace them with an index on a named virtual column and the index will still be used.
SELECT COUNT(other) FROM t WHERE DECODE(status,'C',null,status)='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3855131553
---------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |     1 |    55 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |                  |     1 |    55 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T                |    21 |  1155 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_STATUS_VIRTUAL |    21 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T"."VIRTUAL_STATUS"='R')

Conclusion 

A function-based index, preferably on an explicitly named and created virtual column, will permit you to build an index on just the rare values in a column. Making the virtual column invisible will prevent errors during insert statements without explicit column lists. However, you will still need to alter the application SQL to reference either the virtual column or the function that generates it.

Monday, December 16, 2019

Extended Column Group Statistics, Composite Index Statistics, Histograms and an EDB360 Enhancement to Detect the Coincidence

In this post:
  • A simple demonstration to show the behaviour of extended statistics and how it can be disabled by the presence of histograms.  None of this is new, there are many other blog posts on this topic. I provide links to some of them.
  • I have added an enhancement to the EDB360 utility to detect histograms on columns in extended statistics.

Introduction

'Extended statistics were introduced in Oracle 11g to allow statistics to be gathered on groups of columns, to highlight the relationship between them, or on expressions. Oracle 11gR2 makes the process of gathering extended statistics for column groups easier'. [Tim Hall: https://oracle-base.com/articles/11g/extended-statistics-enhancements-11gr2]

Example 1: Cardinality from Extended Statistics

Without extended statistics, Oracle will simply multiply column cardinalities together.  Here is a simple example.  I will create a table with 10000 rows, where two columns each have the same 100 rows of 100 values, so they correlate perfectly.  I will gather statistics, but no histograms.
create table t
(k number
,a number
,b number
,x varchar2(1000)
);

insert into t
with n as (select rownum n from dual connect by level <= 100)
select rownum, a.n, a.n, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
from n a, n b
order by a.n, b.n;

exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1');
I will deliberately disable optimizer feedback so that Oracle cannot learn from experience about the cardinality misestimates.
alter session set statistics_level=ALL;
alter session set "_optimizer_use_feedback"=FALSE;

select count(*) from t where a = 42 and b=42;

  COUNT(*)
----------
       100
Oracle estimates that it will get 1 row but actually gets 100.
It estimates 1 because it is 1/100 * 1/100 * 10000 rows
select * from table(dbms_xplan.display_cursor(null,null,format=>'ADVANCED +ALLSTATS LAST, IOSTATS -PROJECTION -OUTLINE'));

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    26 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |      1 |    26 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
Now I will create extended statistics on the column group.  I can do that in one of two ways:
  • either by explicitly creating the definition and then creating them by gathering statistics:
SELECT dbms_stats.create_extended_stats(ownname=>user, tabname=>'t', extension=>'(a,b)')
FROM dual;
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1');
  • Or, I can create extended statistics directly in one go by specifying them in the method opt clause.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1, FOR COLUMNS SIZE 1 (A,B)');
Now Oracle correctly estimates that the same query will fetch 100 rows because it directly knows the cardinality for the two columns in the query.
Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------

Example 2: Cardinality from Index Statistics

I can get exactly the same effect by creating an index on the two columns.
drop table t purge;

create table t
(k number
,a number
,b number
,x varchar2(1000)
);

insert into t
with n as (select rownum n from dual connect by level <= 100)
select rownum, a.n, a.n, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
from n a, n b
order by a.n, b.n;

create index t_ab on t(a,b) compress;
This time I have not collected any statistics on the table.  Statistics are automatically collected on the index when it is built.  I have used a hint to stop the query using the index to look up the rows, nonetheless, Oracle has correctly estimated that it will get 100 rows because it has used the number of distinct keys from the index statistics.
SQL_ID  711banpfgfa18, child number 0
-------------------------------------
select /*+FULL(t)*/ count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      74 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    26 |            |          |      1 |00:00:00.01 |      74 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |  2600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      74 |
---------------------------------------------------------------------------------------------------------------------
Note, that there is nothing in the execution plan to indicate that the index statistics were used to estimate the number of rows returned!

Example 3: Histograms Disable the Use of Extended Statistics

There have long been blogs that refer to behaviour that Oracle has documented as Bug 6972291: Column group selectivity is not used when there is a histogram on one column:
'As of 10.2.0.4 CBO can use the selectivity of column groups but this option is disabled if there is a histogram defined on any of the columns of the column group.
Note:  This fix is disabled by default. To enable the fix set "_fix_control"="6972291:ON"
When ENABLED the code will use multi-column stats regardless of whether there is a histogram on one of the columns or not.  When DISABLED (default) CBO will not use multi-column stats if there is a histogram on one of the columns in the column group.'
  • Christian Antognini, 2014: https://antognini.ch/2014/02/extension-bypassed-because-of-missing-histogram/
  • Jonathan Lewis, 2012: https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/
    • Maria Colgan also commented: 'This … was a deliberate design decision to prevent over-estimations when one of the values supplied is ‘out of range’. We can’t ignore the ‘out of range’ scenario just because we have a column group. Extended statistics do not contain the min, max values for the column group so we rely on the individual column statistics to check for ‘out of range’ scenarios like yours. When one of the columns is ‘out of range’, we revert back to the column statistics because we know it is going to generate a lower selectivity range and if one of the columns is ‘out of range’ then the number of rows returned will be lower or none at all, as in your example'
In this example, I explicitly create a histogram on one of the columns in my extended statistics.  However, in the real world that can happen automatically if the application references one column and not another.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 100 A, FOR COLUMNS SIZE 1 B (A,B)');
My cardinality estimate goes back to 1 because Oracle does use the extended statistics in the presence of a histogram on any of the constituent columns.  Exactly the same happens if the number of distinct values on the combination of columns comes from composite index statistics.  A histogram similarly disables their use.
SQL_ID  8trj2kacqhm6f, child number 1
-------------------------------------
select count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |      1 |     6 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
This is likely to happen in real-life systems because histograms can be automatically created when statistics are gathered.

Out-of-Range Predicates

If you have one or more predicates on columns that are part of an extended statistics, and that predicate goes out of range when compared to the column statistics, then Oracle still doesn’t use the extended statistics (see also https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/), irrespective of whether it has a histogram or not, or whether fix control 6972291 is set or not.
The extended histogram uses a virtual column whose value is derived from SYS_OP_COMBINED_HASH().  You can see this in the default data value for the column.  Therefore the optimizer cannot use the minimum/maximum value (see also https://jonathanlewis.wordpress.com/2018/08/02/extended-histograms-2/).
Instead, Oracle does the linear decay of the density of the column predicates, and if there is a frequency or top-frequency histogram then it uses half the density of the lowest frequency bucket and applies linear decay to that.

Example 4: Extended Histograms

This time I will create a histogram on my extended statistics as well as histograms on the underlying columns.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 100 A B (A,B)');
I am back to getting the correct cardinality estimate.
SQL_ID  8trj2kacqhm6f, child number 0
-------------------------------------
select count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
And this is also something that has been blogged about previously:

Threats

In this blog, Jonathan Lewis comments (https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/) on certain weaknesses in the implementations.  He also references other bloggers.
Either creating or removing histograms on columns in either extended statistics or composite indexes may result in the execution plan changing because those extended statistics may change.  This could happen automatically when gathering statistics as data skew and predicate usage changes.
If I drop a composite index, maybe because it is not used, or maybe it is redundant because it is a subset of another index, then I should replace it with an extended histogram on the same set of columns.

Detecting the Problem

I have added a report to section 3c of EDB360 to detect the problem.  The SQL query is shown below.  It will report on histograms on columns in either:
  • composite indexes where there are no extended column group statistics, or
  • extended column group statistics for which there are no histograms.

3c.25. Columns with Histograms in Extended Statistics (DBA_STAT_EXTENSIONS)

#
Table
Owner
Table
Name
Object
Type
Index/Extension Name
Number of
Distinct 
Values
Number of
Buckets
EXTENSION
Column
Name
Column
Number of
Distinct
Values

Column
Number of
Buckets

Column
Histogram
Type
1
HR
JOB_HISTORY
Index
JHIST_EMP_ID_ST_DATE_PK
10
("EMPLOYEE_ID","START_DATE")
EMPLOYEE_ID
7
7
FREQUENCY
2
OE
INVENTORIES
Index
INVENTORY_IX
1112
("WAREHOUSE_ID","PRODUCT_ID")
PRODUCT_ID
208
208
FREQUENCY
3
OE
ORDER_ITEMS
Index
ORDER_ITEMS_PK
665
("ORDER_ID","LINE_ITEM_ID")
ORDER_ID
105
105
FREQUENCY
4
OE
ORDER_ITEMS
Index
ORDER_ITEMS_UK
665
("ORDER_ID","PRODUCT_ID")
ORDER_ID
105
105
FREQUENCY
5
OE
ORDER_ITEMS
Index
ORDER_ITEMS_UK
665
("ORDER_ID","PRODUCT_ID")
PRODUCT_ID
185
185
FREQUENCY
6
SCOTT
T
Extension
SYS_STUNA$6DVXJXTP05EH56DTIR0X
100
1
("A","B")
A
100
100
FREQUENCY
7
SCOTT
T
Extension
SYS_STUNA$6DVXJXTP05EH56DTIR0X
100
1
("A","B")
B
100
100
FREQUENCY
8
SOE
ORDERS
Index
ORD_WAREHOUSE_IX
10270
("WAREHOUSE_ID","ORDER_STATUS")
ORDER_STATUS
10
10
FREQUENCY
9
SOE
ORDER_ITEMS
Index
ORDER_ITEMS_PK
13758515
("ORDER_ID","LINE_ITEM_ID")
LINE_ITEM_ID
7
7
FREQUENCY

Just because something is reported by this test, does not necessarily mean that you need to change anything.
  • Providing fix control 6972291 is not enabled, should you wish to drop or alter any reported index, you at least know that it cannot be used to provide column group statistics.  Though you would still need to consider SQL that might use the index directly.
  • You might choose to add column group histograms, and sometimes that will involve adding column statistics.  However, the number of distinct values on the column group will usually be higher than on the individual columns and can easily be greater than the number of buckets you can have in a frequency histogram.  In such cases, from 12c, you may end up with either a Top-frequency histogram or a hybrid histogram.
  • Or you might choose to remove the histograms from the individual columns so that the column group statistics are used.
  • Or you might choose to enforce the status quo, by setting table statistics preferences to ensure currently existing histograms are preserved and currently, non-existent histograms are not introduced.
Whatever you choose to do regarding statistics and histogram collection, I would certainly recommend doing so declaratively, by defining a table statistic preference.  For example, here I will preserve the histograms on the columns in the column group, but I will also build a histogram on the column group:
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 A B (A,B)');
  • Or, you might even enable the fix_control. You can also do that at session level or even statement level (but beware of disabling any other fix controls that may be set). 
SQL_ID  16judk2v0uf7w, child number 0
-------------------------------------
select /*+FULL(t) OPT_PARAM('_fix_control','6972291:on')*/ count(*)
from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      OPT_PARAM('_fix_control' '6972291:1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "T"@"SEL$1")
      END_OUTLINE_DATA
  */

EDB360 Test Query

This is the SQL query that produces the report in EDB360.
WITH i as ( /*composite indexes*/
SELECT i.table_owner, i.table_name, i.owner index_owner, i.index_name, i.distinct_keys
,        '('||(LISTAGG('"'||c.column_name||'"',',') WITHIN GROUP (order by c.column_position))||')' column_list
FROM    dba_indexes i
,       dba_ind_columns c
WHERE   i.table_owner = c.table_owner
AND     i.table_name = c.table_name
AND     i.owner = c.index_owner
AND     i.index_name = c.index_name
AND     i.table_name NOT LIKE 'BIN$%'
AND     i.table_owner NOT IN ('ANONYMOUS','APEX_030200','APEX_040000','APEX_040200','APEX_SSO','APPQOSSYS','CTXSYS','DBSNMP','DIP','EXFSYS','FLOWS_FILES','MDSYS','OLAPSYS','ORACLE_OCM','ORDDATA','ORDPLUGINS','ORDSYS','OUTLN','OWBSYS')
AND     i.table_owner NOT IN ('SI_INFORMTN_SCHEMA','SQLTXADMIN','SQLTXPLAIN','SYS','SYSMAN','SYSTEM','TRCANLZR','WMSYS','XDB','XS$NULL','PERFSTAT','STDBYPERF','MGDSYS','OJVMSYS')
GROUP BY i.table_owner, i.table_name, i.owner, i.index_name, i.distinct_keys
HAVING COUNT(*) > 1 /*index with more than one column*/
), e as ( /*extended stats*/
SELECT  e.owner, e.table_name, e.extension_name
,       CAST(e.extension AS VARCHAR(1000)) extension
,       se.histogram, se.num_buckets, se.num_distinct
FROM    dba_stat_extensions e
,       dba_tab_col_statistics se
WHERE   e.creator = 'USER'
AND     se.owner = e.owner
AND     se.table_name = e.table_name
AND     se.column_name = e.extension_name
AND     e.table_name NOT LIKE 'BIN$%'
AND     e.owner NOT IN ('ANONYMOUS','APEX_030200','APEX_040000','APEX_040200','APEX_SSO','APPQOSSYS','CTXSYS','DBSNMP','DIP','EXFSYS','FLOWS_FILES','MDSYS','OLAPSYS','ORACLE_OCM','ORDDATA','ORDPLUGINS','ORDSYS','OUTLN','OWBSYS')
AND     e.owner NOT IN ('SI_INFORMTN_SCHEMA','SQLTXADMIN','SQLTXPLAIN','SYS','SYSMAN','SYSTEM','TRCANLZR','WMSYS','XDB','XS$NULL','PERFSTAT','STDBYPERF','MGDSYS','OJVMSYS')
)
SELECT  e.owner, e.table_name
,       'Extension' object_type
,       e.extension_name object_name, e.num_distinct, e.num_buckets, e.extension
,       sc.column_name
,       sc.num_distinct col_num_distinct
,       sc.num_buckets col_num_buckets
,       sc.histogram col_histogram
FROM    e
,       dba_tab_col_statistics sc
WHERE   e.histogram = 'NONE'
AND     e.extension LIKE '%"'||sc.column_name||'"%'
AND     sc.owner = e.owner
AND     sc.table_name = e.table_name
AND     sc.histogram != 'NONE'
AND     sc.num_buckets > 1 /*histogram on column*/
AND     e.num_buckets = 1 /*no histogram on extended stats*/
UNION ALL
SELECT  /*+  NO_MERGE  */ /* 3c.25 */
        i.table_owner, i.table_name
,       'Index' object_type
,       i.index_name object_name, i.distinct_keys, TO_NUMBER(null), i.column_list
,       sc.column_name
,       sc.num_distinct col_num_distinct
,       sc.num_buckets col_num_buckets
,       sc.histogram col_histogram
From    i
,       dba_ind_columns ic
,       dba_tab_col_statistics sc
WHERE   ic.table_owner = i.table_owner
AND     ic.table_name = i.table_name
AND     ic.index_owner = i.index_owner
AND     ic.index_name = i.index_name
AND     sc.owner = i.table_owner
AND     sc.table_name = ic.table_name
AND     sc.column_name = ic.column_name
AND     sc.histogram != 'NONE' 
AND     sc.num_buckets > 1 /*histogram on column*/
AND NOT EXISTS( /*report index if no extension*/
        SELECT 'x'
        FROM    e
        WHERE   e.owner = i.table_owner
        AND     e.table_name = i.table_name
        AND     e.extension = i.column_list)
ORDER BY 1,2,3,4;