Monday, December 16, 2019

Extended Column Group Statistics, Composite Index Statistics, Histograms and an EDB360 Enhancement to Detect the Coincidence

In this post:
  • A simple demonstration to show the behaviour of extended statistics and how it can be disabled by the presence of histograms.  None of this is new, there are many other blog posts on this topic. I provide links to some of them.
  • I have added an enhancement to the EDB360 utility to detect histograms on columns in extended statistics.

Introduction

'Extended statistics were introduced in Oracle 11g to allow statistics to be gathered on groups of columns, to highlight the relationship between them, or on expressions. Oracle 11gR2 makes the process of gathering extended statistics for column groups easier'. [Tim Hall: https://oracle-base.com/articles/11g/extended-statistics-enhancements-11gr2]

Example 1: Cardinality from Extended Statistics

Without extended statistics, Oracle will simply multiply column cardinalities together.  Here is a simple example.  I will create a table with 10000 rows, where two columns each have the same 100 rows of 100 values, so they correlate perfectly.  I will gather statistics, but no histograms.
create table t
(k number
,a number
,b number
,x varchar2(1000)
);

insert into t
with n as (select rownum n from dual connect by level <= 100)
select rownum, a.n, a.n, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
from n a, n b
order by a.n, b.n;

exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1');
I will deliberately disable optimizer feedback so that Oracle cannot learn from experience about the cardinality misestimates.
alter session set statistics_level=ALL;
alter session set "_optimizer_use_feedback"=FALSE;

select count(*) from t where a = 42 and b=42;

  COUNT(*)
----------
       100
Oracle estimates that it will get 1 row but actually gets 100.
It estimates 1 because it is 1/100 * 1/100 * 10000 rows
select * from table(dbms_xplan.display_cursor(null,null,format=>'ADVANCED +ALLSTATS LAST, IOSTATS -PROJECTION -OUTLINE'));

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    26 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |      1 |    26 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
Now I will create extended statistics on the column group.  I can do that in one of two ways:
  • either by explicitly creating the definition and then creating them by gathering statistics:
SELECT dbms_stats.create_extended_stats(ownname=>user, tabname=>'t', extension=>'(a,b)')
FROM dual;
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1');
  • Or, I can create extended statistics directly in one go by specifying them in the method opt clause.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE 1, FOR COLUMNS SIZE 1 (A,B)');
Now Oracle correctly estimates that the same query will fetch 100 rows because it directly knows the cardinality for the two columns in the query.
Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------

Example 2: Cardinality from Index Statistics

I can get exactly the same effect by creating an index on the two columns.
drop table t purge;

create table t
(k number
,a number
,b number
,x varchar2(1000)
);

insert into t
with n as (select rownum n from dual connect by level <= 100)
select rownum, a.n, a.n, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
from n a, n b
order by a.n, b.n;

create index t_ab on t(a,b) compress;
This time I have not collected any statistics on the table.  Statistics are automatically collected on the index when it is built.  I have used a hint to stop the query using the index to look up the rows, nonetheless, Oracle has correctly estimated that it will get 100 rows because it has used the number of distinct keys from the index statistics.
SQL_ID  711banpfgfa18, child number 0
-------------------------------------
select /*+FULL(t)*/ count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      74 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    26 |            |          |      1 |00:00:00.01 |      74 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |  2600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      74 |
---------------------------------------------------------------------------------------------------------------------
Note, that there is nothing in the execution plan to indicate that the index statistics were used to estimate the number of rows returned!

Example 3: Histograms Disable the Use of Extended Statistics

There have long been blogs that refer to behaviour that Oracle has documented as Bug 6972291: Column group selectivity is not used when there is a histogram on one column:
'As of 10.2.0.4 CBO can use the selectivity of column groups but this option is disabled if there is a histogram defined on any of the columns of the column group.
Note:  This fix is disabled by default. To enable the fix set "_fix_control"="6972291:ON"
When ENABLED the code will use multi-column stats regardless of whether there is a histogram on one of the columns or not.  When DISABLED (default) CBO will not use multi-column stats if there is a histogram on one of the columns in the column group.'
  • Christian Antognini, 2014: https://antognini.ch/2014/02/extension-bypassed-because-of-missing-histogram/
  • Jonathan Lewis, 2012: https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/
    • Maria Colgan also commented: 'This … was a deliberate design decision to prevent over-estimations when one of the values supplied is ‘out of range’. We can’t ignore the ‘out of range’ scenario just because we have a column group. Extended statistics do not contain the min, max values for the column group so we rely on the individual column statistics to check for ‘out of range’ scenarios like yours. When one of the columns is ‘out of range’, we revert back to the column statistics because we know it is going to generate a lower selectivity range and if one of the columns is ‘out of range’ then the number of rows returned will be lower or none at all, as in your example'
In this example, I explicitly create a histogram on one of the columns in my extended statistics.  However, in the real world that can happen automatically if the application references one column and not another.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 100 A, FOR COLUMNS SIZE 1 B (A,B)');
My cardinality estimate goes back to 1 because Oracle does use the extended statistics in the presence of a histogram on any of the constituent columns.  Exactly the same happens if the number of distinct values on the combination of columns comes from composite index statistics.  A histogram similarly disables their use.
SQL_ID  8trj2kacqhm6f, child number 1
-------------------------------------
select count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |      1 |     6 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
This is likely to happen in real-life systems because histograms can be automatically created when statistics are gathered.

Out-of-Range Predicates

If you have one or more predicates on columns that are part of an extended statistics, and that predicate goes out of range when compared to the column statistics, then Oracle still doesn’t use the extended statistics (see also https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/), irrespective of whether it has a histogram or not, or whether fix control 6972291 is set or not.
The extended histogram uses a virtual column whose value is derived from SYS_OP_COMBINED_HASH().  You can see this in the default data value for the column.  Therefore the optimizer cannot use the minimum/maximum value (see also https://jonathanlewis.wordpress.com/2018/08/02/extended-histograms-2/).
Instead, Oracle does the linear decay of the density of the column predicates, and if there is a frequency or top-frequency histogram then it uses half the density of the lowest frequency bucket and applies linear decay to that.

Example 4: Extended Histograms

This time I will create a histogram on my extended statistics as well as histograms on the underlying columns.
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 100 A B (A,B)');
I am back to getting the correct cardinality estimate.
SQL_ID  8trj2kacqhm6f, child number 0
-------------------------------------
select count(*) from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------
And this is also something that has been blogged about previously:

Threats

In this blog, Jonathan Lewis comments (https://jonathanlewis.wordpress.com/2012/04/11/extended-stats/) on certain weaknesses in the implementations.  He also references other bloggers.
Either creating or removing histograms on columns in either extended statistics or composite indexes may result in the execution plan changing because those extended statistics may change.  This could happen automatically when gathering statistics as data skew and predicate usage changes.
If I drop a composite index, maybe because it is not used, or maybe it is redundant because it is a subset of another index, then I should replace it with an extended histogram on the same set of columns.

Detecting the Problem

I have added a report to section 3c of EDB360 to detect the problem.  The SQL query is shown below.  It will report on histograms on columns in either:
  • composite indexes where there are no extended column group statistics, or
  • extended column group statistics for which there are no histograms.

3c.25. Columns with Histograms in Extended Statistics (DBA_STAT_EXTENSIONS)

#
Table
Owner
Table
Name
Object
Type
Index/Extension Name
Number of
Distinct 
Values
Number of
Buckets
EXTENSION
Column
Name
Column
Number of
Distinct
Values

Column
Number of
Buckets

Column
Histogram
Type
1
HR
JOB_HISTORY
Index
JHIST_EMP_ID_ST_DATE_PK
10
("EMPLOYEE_ID","START_DATE")
EMPLOYEE_ID
7
7
FREQUENCY
2
OE
INVENTORIES
Index
INVENTORY_IX
1112
("WAREHOUSE_ID","PRODUCT_ID")
PRODUCT_ID
208
208
FREQUENCY
3
OE
ORDER_ITEMS
Index
ORDER_ITEMS_PK
665
("ORDER_ID","LINE_ITEM_ID")
ORDER_ID
105
105
FREQUENCY
4
OE
ORDER_ITEMS
Index
ORDER_ITEMS_UK
665
("ORDER_ID","PRODUCT_ID")
ORDER_ID
105
105
FREQUENCY
5
OE
ORDER_ITEMS
Index
ORDER_ITEMS_UK
665
("ORDER_ID","PRODUCT_ID")
PRODUCT_ID
185
185
FREQUENCY
6
SCOTT
T
Extension
SYS_STUNA$6DVXJXTP05EH56DTIR0X
100
1
("A","B")
A
100
100
FREQUENCY
7
SCOTT
T
Extension
SYS_STUNA$6DVXJXTP05EH56DTIR0X
100
1
("A","B")
B
100
100
FREQUENCY
8
SOE
ORDERS
Index
ORD_WAREHOUSE_IX
10270
("WAREHOUSE_ID","ORDER_STATUS")
ORDER_STATUS
10
10
FREQUENCY
9
SOE
ORDER_ITEMS
Index
ORDER_ITEMS_PK
13758515
("ORDER_ID","LINE_ITEM_ID")
LINE_ITEM_ID
7
7
FREQUENCY

Just because something is reported by this test, does not necessarily mean that you need to change anything.
  • Providing fix control 6972291 is not enabled, should you wish to drop or alter any reported index, you at least know that it cannot be used to provide column group statistics.  Though you would still need to consider SQL that might use the index directly.
  • You might choose to add column group histograms, and sometimes that will involve adding column statistics.  However, the number of distinct values on the column group will usually be higher than on the individual columns and can easily be greater than the number of buckets you can have in a frequency histogram.  In such cases, from 12c, you may end up with either a Top-frequency histogram or a hybrid histogram.
  • Or you might choose to remove the histograms from the individual columns so that the column group statistics are used.
  • Or you might choose to enforce the status quo, by setting table statistics preferences to ensure currently existing histograms are preserved and currently, non-existent histograms are not introduced.
Whatever you choose to do regarding statistics and histogram collection, I would certainly recommend doing so declaratively, by defining a table statistic preference.  For example, here I will preserve the histograms on the columns in the column group, but I will also build a histogram on the column group:
exec dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 A B (A,B)');
  • Or, you might even enable the fix_control. You can also do that at session level or even statement level (but beware of disabling any other fix controls that may be set). 
SQL_ID  16judk2v0uf7w, child number 0
-------------------------------------
select /*+FULL(t) OPT_PARAM('_fix_control','6972291:on')*/ count(*)
from t where a = 42 and b=42

Plan hash value: 1071362934
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |    21 (100)|          |      1 |00:00:00.01 |      73 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |     6 |            |          |      1 |00:00:00.01 |      73 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |    100 |   600 |    21   (0)| 00:00:01 |    100 |00:00:00.01 |      73 |
---------------------------------------------------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      OPT_PARAM('_fix_control' '6972291:1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "T"@"SEL$1")
      END_OUTLINE_DATA
  */

EDB360 Test Query

This is the SQL query that produces the report in EDB360.
WITH i as ( /*composite indexes*/
SELECT i.table_owner, i.table_name, i.owner index_owner, i.index_name, i.distinct_keys
,        '('||(LISTAGG('"'||c.column_name||'"',',') WITHIN GROUP (order by c.column_position))||')' column_list
FROM    dba_indexes i
,       dba_ind_columns c
WHERE   i.table_owner = c.table_owner
AND     i.table_name = c.table_name
AND     i.owner = c.index_owner
AND     i.index_name = c.index_name
AND     i.table_name NOT LIKE 'BIN$%'
AND     i.table_owner NOT IN ('ANONYMOUS','APEX_030200','APEX_040000','APEX_040200','APEX_SSO','APPQOSSYS','CTXSYS','DBSNMP','DIP','EXFSYS','FLOWS_FILES','MDSYS','OLAPSYS','ORACLE_OCM','ORDDATA','ORDPLUGINS','ORDSYS','OUTLN','OWBSYS')
AND     i.table_owner NOT IN ('SI_INFORMTN_SCHEMA','SQLTXADMIN','SQLTXPLAIN','SYS','SYSMAN','SYSTEM','TRCANLZR','WMSYS','XDB','XS$NULL','PERFSTAT','STDBYPERF','MGDSYS','OJVMSYS')
GROUP BY i.table_owner, i.table_name, i.owner, i.index_name, i.distinct_keys
HAVING COUNT(*) > 1 /*index with more than one column*/
), e as ( /*extended stats*/
SELECT  e.owner, e.table_name, e.extension_name
,       CAST(e.extension AS VARCHAR(1000)) extension
,       se.histogram, se.num_buckets, se.num_distinct
FROM    dba_stat_extensions e
,       dba_tab_col_statistics se
WHERE   e.creator = 'USER'
AND     se.owner = e.owner
AND     se.table_name = e.table_name
AND     se.column_name = e.extension_name
AND     e.table_name NOT LIKE 'BIN$%'
AND     e.owner NOT IN ('ANONYMOUS','APEX_030200','APEX_040000','APEX_040200','APEX_SSO','APPQOSSYS','CTXSYS','DBSNMP','DIP','EXFSYS','FLOWS_FILES','MDSYS','OLAPSYS','ORACLE_OCM','ORDDATA','ORDPLUGINS','ORDSYS','OUTLN','OWBSYS')
AND     e.owner NOT IN ('SI_INFORMTN_SCHEMA','SQLTXADMIN','SQLTXPLAIN','SYS','SYSMAN','SYSTEM','TRCANLZR','WMSYS','XDB','XS$NULL','PERFSTAT','STDBYPERF','MGDSYS','OJVMSYS')
)
SELECT  e.owner, e.table_name
,       'Extension' object_type
,       e.extension_name object_name, e.num_distinct, e.num_buckets, e.extension
,       sc.column_name
,       sc.num_distinct col_num_distinct
,       sc.num_buckets col_num_buckets
,       sc.histogram col_histogram
FROM    e
,       dba_tab_col_statistics sc
WHERE   e.histogram = 'NONE'
AND     e.extension LIKE '%"'||sc.column_name||'"%'
AND     sc.owner = e.owner
AND     sc.table_name = e.table_name
AND     sc.histogram != 'NONE'
AND     sc.num_buckets > 1 /*histogram on column*/
AND     e.num_buckets = 1 /*no histogram on extended stats*/
UNION ALL
SELECT  /*+  NO_MERGE  */ /* 3c.25 */
        i.table_owner, i.table_name
,       'Index' object_type
,       i.index_name object_name, i.distinct_keys, TO_NUMBER(null), i.column_list
,       sc.column_name
,       sc.num_distinct col_num_distinct
,       sc.num_buckets col_num_buckets
,       sc.histogram col_histogram
From    i
,       dba_ind_columns ic
,       dba_tab_col_statistics sc
WHERE   ic.table_owner = i.table_owner
AND     ic.table_name = i.table_name
AND     ic.index_owner = i.index_owner
AND     ic.index_name = i.index_name
AND     sc.owner = i.table_owner
AND     sc.table_name = ic.table_name
AND     sc.column_name = ic.column_name
AND     sc.histogram != 'NONE' 
AND     sc.num_buckets > 1 /*histogram on column*/
AND NOT EXISTS( /*report index if no extension*/
        SELECT 'x'
        FROM    e
        WHERE   e.owner = i.table_owner
        AND     e.table_name = i.table_name
        AND     e.extension = i.column_list)
ORDER BY 1,2,3,4;

No comments :