The /*+Go-Faster*/ Oracle Blog: January 2020

Tuesday, January 28, 2020

Data Warehouse Design: Snowflake Dimensions and Lost Skew Trap

This post is part of a series that discusses some common issues in data warehouses. Originally written in 2018, but I never got round to publishing it.
While I was experimenting with the previous query I noticed that the cost of the execution plans didn't change as I changed the COUNTRY_ISO_CODE, yet the data volumes for different countries are very different.

select c.country_name
,      u.cust_state_province
,      COUNT(*) num_sales
,      SUM(s.amount_sold) total_amount_sold
from   sales s
,      customers u
,      products p
,      times t
,      countries c
WHERE  s.time_id = t.time_id
AND    s.prod_id = p.prod_id
AND    u.cust_id = s.cust_id
AND    u.country_id = c.country_id
AND    c.country_iso_code = '&&iso_country_code'
AND    p.prod_category_id = 205
and    t.fiscal_year = 1999
GROUP BY c.country_name, u.cust_state_province
ORDER BY 1,2
/

Plan hash value: 3095970037
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   Id  | Operation                                 | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT                          |                           |      1 |        |       |  1473 (100)|          |       |       |     45 |00:00:01.77 |     101K|       |       |          |
|     1 |  TEMP TABLE TRANSFORMATION                |                           |      1 |        |       |            |          |       |       |     45 |00:00:01.77 |     101K|       |       |          |
|     2 |   LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D7C68_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.13 |    1889 |  1024 |  1024 |          |
|  *  3 |    HASH JOIN                              |                           |      1 |   2413 | 94107 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.10 |    1888 |  1185K|  1185K|  639K (0)|
|  *  4 |     TABLE ACCESS FULL                     | COUNTRIES                 |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|     5 |     TABLE ACCESS FULL                     | CUSTOMERS                 |      1 |  55500 |  1138K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |       |       |          |
|     6 |   SORT GROUP BY                           |                           |      1 |   2359 |   101K|  1055   (1)| 00:00:01 |       |       |     45 |00:00:01.65 |   99111 |  6144 |  6144 | 6144  (0)|
|  *  7 |    HASH JOIN                              |                           |      1 |   3597 |   154K|  1054   (1)| 00:00:01 |       |       |  64818 |00:00:01.58 |   99111 |  2391K|  1595K| 2025K (0)|
|     8 |     TABLE ACCESS FULL                     | SYS_TEMP_0FD9D7C68_A4BC21 |      1 |   2413 | 62738 |     5   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|     9 |     VIEW                                  | VW_ST_C525CEF3            |      1 |   3597 | 64746 |  1048   (1)| 00:00:01 |       |       |  64818 |00:00:01.44 |   99111 |       |       |          |
…

Note:

There are 55500 rows on CUSTOMERS
There are 23 rows on COUNTRIES
Oracle expects 2413 rows on joining those tables

55500÷23= 2413.04, so Oracle assumes the data is evenly distributed between countries, although there are histograms on COUNTRY_ISO_CODE and COUNTRY_ID.
This is sometimes called 'lost skew'. The skew of a dimension does not pass into the cardinality calculation on the fact table.

See Jonathan Lewis' blog: Bitmap John Indexes 2.

If I replace the predicate on COUNTRY_ISO_CODE with a predicate on COUNTRY_ID then the estimate of the number of rows from customers is correctly 18520 rows. The cost of the star transformation has gone up from 1473 to 6922.

Plan hash value: 1339390240

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                           |      1 |        |       |  6922 (100)|          |       |       |     45 |00:00:01.50 |   97998 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION                |                           |      1 |        |       |            |          |       |       |     45 |00:00:01.50 |   97998 |       |       |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D7C6A_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.06 |    1524 |  1024 |  1024 |          |
|   3 |    NESTED LOOPS                           |                           |      1 |  18520 |   651K|   417   (1)| 00:00:01 |       |       |  18520 |00:00:00.04 |    1523 |       |       |          |
|   4 |     TABLE ACCESS BY INDEX ROWID           | COUNTRIES                 |      1 |      1 |    15 |     1   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|   5 |      INDEX UNIQUE SCAN                    | COUNTRIES_PK              |      1 |      1 |       |     0   (0)|          |       |       |      1 |00:00:00.01 |       1 |       |       |          |
|   6 |     TABLE ACCESS FULL                     | CUSTOMERS                 |      1 |  18520 |   379K|   416   (1)| 00:00:01 |       |       |  18520 |00:00:00.03 |    1521 |       |       |          |
|   7 |   SORT GROUP BY                           |                           |      1 |   2359 |   101K|  6505   (1)| 00:00:01 |       |       |     45 |00:00:01.43 |   96473 |  6144 |  6144 | 6144  (0)|
|   8 |    HASH JOIN                              |                           |      1 |  82724 |  3554K|  6499   (1)| 00:00:01 |       |       |  64818 |00:00:01.37 |   96473 |  2391K|  1595K| 2002K (0)|
|   9 |     TABLE ACCESS FULL                     | SYS_TEMP_0FD9D7C6A_A4BC21 |      1 |  18520 |   470K|    25   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
…

In fact, I only get the star transformation if I force the issue with a STAR_TRANSFORMATION hint. Otherwise, I get the full scan plan which is much cheaper, but again the cardinality calculation on CUSTOMERS is correct.

Plan hash value: 3784979335
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name         | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |              |      1 |        |       |  1595 (100)|          |       |       |     45 |00:00:00.54 |    2065 |    472 |       |       |          |
|   1 |  SORT GROUP BY                    |              |      1 |     45 |  3510 |  1595   (3)| 00:00:01 |       |       |     45 |00:00:00.54 |    2065 |    472 |  6144 |  6144 | 6144  (0)|
|   2 |   HASH JOIN                       |              |      1 |  81133 |  6180K|  1589   (3)| 00:00:01 |       |       |  64818 |00:00:00.43 |    2065 |    472 |  2337K|  2200K| 2221K (0)|
|   3 |    TABLE ACCESS FULL              | CUSTOMERS    |      1 |  18520 |   379K|   416   (1)| 00:00:01 |       |       |  18520 |00:00:00.02 |    1521 |      0 |       |       |          |
|   4 |    HASH JOIN                      |              |      1 |  81133 |  4516K|  1172   (3)| 00:00:01 |       |       |    110K|00:00:00.35 |     544 |    472 |  2546K|  2546K| 1610K (0)|
|   5 |     TABLE ACCESS FULL             | PRODUCTS     |      1 |     26 |   208 |     3   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       4 |      0 |       |       |          |
|   6 |     HASH JOIN                     |              |      1 |    229K|    10M|  1167   (3)| 00:00:01 |       |       |    246K|00:00:00.30 |     539 |    472 |  1133K|  1133K| 1698K (0)|
|   7 |      PART JOIN FILTER CREATE      | :BF0000      |      1 |    364 |  9828 |    17   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      57 |      0 |       |       |          |
|   8 |       NESTED LOOPS                |              |      1 |    364 |  9828 |    17   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      57 |      0 |       |       |          |
|   9 |        TABLE ACCESS BY INDEX ROWID| COUNTRIES    |      1 |      1 |    15 |     1   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|  10 |         INDEX UNIQUE SCAN         | COUNTRIES_PK |      1 |      1 |       |     0   (0)|          |       |       |      1 |00:00:00.01 |       1 |      0 |       |       |          |
|  11 |        TABLE ACCESS FULL          | TIMES        |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|  12 |      PARTITION RANGE JOIN-FILTER  |              |      1 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.21 |     482 |    472 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SALES        |      5 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.20 |     482 |    472 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Friday, January 24, 2020

Online Statistics Collection during Bulk Loads on Partitioned Tables

Introduction

One of the enhancements to statistics collection and management in Oracle 12c was the ability of the database will automatically collect statistics during either a create-table-as-select operation or during the initial insert into a freshly created or freshly truncated table, provide that insert is done in direct-path mode (i.e. using the APPEND hint).
When that occurs, there is an additional operation in the execution plan; OPTIMIZER STATISTICS GATHERING.

----------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                 |       |       |   495K(100)|          |
|   1 |  LOAD AS SELECT                  |                 |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |                 |    70M|    11G|   495K  (2)| 00:00:20 |
|   3 |    TABLE ACCESS FULL             | XXXXXXXXXXXXXXX |    70M|    11G|   495K  (2)| 00:00:20 |
----------------------------------------------------------------------------------------------------

The motivation for this blog was encountering a bulk insert into a partitioned table where the statistics gathering operation consumed a very significant amount of time. Partitioning gives you more things to consider.

A Simple Test

I created a simple test that compares the time taken by online statistics collection on partitioned and non-partitioned tables, with the explicit collection of statistics using DBMS_STATS. I have four tables with the same structure.

T1: Not partitioned. Data will be copied from this table to each of the others.
T2: Partitioned. Online statistics only.
T3: Partitioned. Explicitly gathered statistics.
T4: Partitioned. Explicitly gathered incremental statistics.

CREATE TABLE T1 (a number, b varchar2(1000), c number) NOLOGGING;
CREATE TABLE T2 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T3 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T4 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;

I loaded 100 million rows into each in direct-path mode. The partitioned tables end up with 100 partitions, each with 1 million rows. I have also suppressed redo logging during the direct-path insert by creating the tables with the NOLOGGING attribute.

Online statistics will be collected on tables T1 and T2.
Online statistics collection will be suppressed on T3 and T4 by using the NO_GATHER_OPTIMIZER_STATISTICS hint. Instead, statistics will be explicitly collected on T3 and T4.
Incremental Statistics are enabled on T4, they collect partition-level statistics and calculate table-level statistics using synopses.

see also Efficient Statistics Maintenance for Partitioned Tables Using Incremental Statistics – Part 1 (Nigel Bayliss).

EXEC dbms_stats.set_table_prefs(user,'T3','INCREMENTAL','FALSE');
EXEC dbms_stats.set_table_prefs(user,'T4','INCREMENTAL','TRUE');

The following set of tests will be run for different combinations of:

Parallel hint on query, or not
Parallel hint on insert, or not
Table parallelism specified
Parallel DML enabled or disabled at session level
Column-specific METHOD_OPT table preference specified or not.

I enabled SQL trace, from which I was able to obtain the elapsed time of the various statements, and I can determine the amount of time spent on online statistics gathering from timings on the OPTIMIZER STATISTICS GATHERING operation in the execution plan in the trace.

TRUNCATE TABLE T2;
TRUNCATE TABLE T3;
EXEC dbms_stats.delete_table_stats(user,'T2');
EXEC dbms_stats.delete_table_stats(user,'T3');
EXEC dbms_stats.delete_table_stats(user,'T4');

INSERT /*+APPEND &inshint*/ into T2 i SELECT * /*+&selhint*/ from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T3 i SELECT /*+&selhint*/ * from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T4 i SELECT /*+&selhint*/ * from t1 s;
commit;
EXEC dbms_stats.gather_table_stats(user,'T3');
EXEC dbms_stats.gather_table_stats(user,'T4');

Quirks

It was while building this test that I discovered a couple of quirks:

Specifying a column-specific METHOD_OPT table statistics preference disables online statistics collection (bug 30787109) - see Online Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference.
12c Online Statistics Gathering Not Working For Insert As Select (Doc ID 2328896.1) – unless all columns are inserted into.

What Statistics Are Normally Collected by Online Statistics Gathering?

After just the initial insert, I can see that I have table statistics on T1 and T2, but not on T3 and T4.

SELECT table_name, num_rows from user_tables where table_name LIKE 'T_' order by 1; 

TABLE_NAME   NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1           10000000 14:07:36 16/01/20
T2           10000000 14:07:36 16/01/20
T3
T4

I also have column statistics on T1 and T2, but no histograms.

break on table_name skip 1
SELECT table_name, column_name, num_distinct, global_stats, histogram, num_buckets, last_analyzed
FROM user_tab_columns where table_name like 'T_' order by 1,2; 

TABLE_NAME COLUMN_NAME  NUM_DISTINCT GLO HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
---------- ------------ ------------ --- --------------- ----------- -----------------
T1         A                   10000 YES NONE                      1 14:06:58 16/01/20
           B                   10000 YES NONE                      1 14:06:58 16/01/20
           C                     100 YES NONE                      1 14:06:58 16/01/20

T2         A                   10000 YES NONE                      1 14:07:11 16/01/20
           B                   10000 YES NONE                      1 14:07:11 16/01/20
           C                     100 YES NONE                      1 14:07:11 16/01/20

T3         A                         NO  NONE
           B                         NO  NONE
           C                         NO  NONE

T4         A                         NO  NONE
           B                         NO  NONE
           C                         NO  NONE

However, I do not have any partition statistics (I have only shown the first and last partition of each table in this report).

break on table_name skip 1
SELECT table_name, partition_position, partition_name, num_rows 
FROM user_tab_partitions WHERE table_name like 'T_' ORDER BY 1,2 nulls first;

TABLE_NAME PARTITION_POSITION PARTITION_NAME         NUM_ROWS LAST_ANALYZED
---------- ------------------ -------------------- ---------- -----------------
T2                          1 T_PART
…
                          100 SYS_P20008

T3                          1 T_PART
…
                          100 SYS_P20107

T4                          1 T_PART
…
                          100 SYS_P20206

Online optimizer statistics gathering only collects statistics at table level but not partition or sub-partition level. Histograms are not collected.

See also Online Statistics Gathering for Bulk Loads in Oracle Database 12c Release 1 (12.1) – Tim Hall

From Oracle 18c, there are two undocumented parameters that modify this behaviour. Both default to false. Interestingly, both are enabled in the Oracle Autonomous Data Warehouse.

If _optimizer_gather_stats_on_load_hist=TRUE histograms are be collected on all columns during online statistics collection.
If _optimizer_gather_stats_on_load_all=TRUE statistics are collected online during every direct-path insert, not just the first one into a segment.

Do I Need Partition Statistics?

Statistics will be collected on partitions that do not have them when the automatic statistics collection job runs in the next database maintenance window. The question is whether to manage without them until then?
"The optimizer will use global or table level statistics if one or more of your queries touches two or more partitions. The optimizer will use partition level statistics if your queries do partition elimination, such that only one partition is necessary to answer each query. If your queries touch two or more partitions the optimizer will use a combination of global and partition level statistics."
– Oracle The Data Warehouse Insider Blog: Managing Optimizer Statistics in an Oracle Database 11g - Maria Colgan
It will depend upon the nature of the SQL in the application. If the optimizer does some partition elimination, and the data is not uniformly distributed across the partitions, then partition statistics are likely to be beneficial. If there is no partition elimination, then you might question whether partitioning (or at least the current partitioning strategy) is appropriate!

What is the Fastest Way to Collect Statistics on Partitioned Tables?

Let's look at how long it takes to insert data into, and then subsequently collect statistics on the tables in my example. This test was run on Oracle 19c on one compute node of a virtualised Exadata X4 machine with 16 CPUs. This table shows elapsed time and the total DB time include all parallel server processes for each operation.

Table Name	Oper-ation	Comment	Option	Serial Insert & Statistics	Parallel Insert & Statistics	Parallel SQL & Statistics	Parallel DML, Insert, Select & Statistics	Parallel DML, SQL & Statistics	Parallel Tables	Parallel Tables & DML	Parallel Tables, DML & Method Opt
			Table	NOPARALLEL	NOPARALLEL	NOPARALLEL	NOPARALLEL	NOPARALLEL	PARALLEL	PARALLEL	PARALLEL
			Insert Hint	blank	PARALLEL(i)	blank	PARALLEL(i)	blank	blank	blank	blank
			Select Hint	blank	PARALLEL(s)	PARALLEL	PARALLEL(s)	PARALLEL	blank	blank	blank
			Parallel DML	DISABLE	DISABLE	DISABLE	ENABLE	ENABLE	DISABLE	ENABLE	ENABLE
			Stats Degree	none	DEFAULT	DEFAULT	DEFAULT	DEFAULT	none	none	none
			Method Opt	none	none	none	none	none	none	none	… FOR COLUMNS SIZE 1 A
T2	Insert	Online Stats Gathering	Elapsed Time (s)	172.46	160.86	121.61	108.29	60.31	194.47	23.57	20.57
T2	Insert	Optimizer Statistics Gathering		82.71	55.59	55.90	-	-	-	-	-
T3	Insert	NO_GATHER_OPTIMIZER_STATS		125.40	156.36	124.18	20.62	29.01	199.20	20.97	21.15
T3	Explicit Stats			122.80	146.25	63.23	15.99	24.88	24.58	24.99	24.62
T4	Insert	NO_GATHER_OPTIMIZER_STATS		123.18	158.15	147.04	20.44	29.91	204.61	20.65	20.60
T4	Incremental Explicit Stats			80.51	104.85	46.05	23.42	23.14	23.21	22.60	23.03
T2	Insert	Online Stats Gathering	DB Time (s)	174	163	169	359	337	248	366	308
T3	Insert	NO_GATHER_OPTIMIZER_STATS		128	193	160	290	211	236	312	326
T3	Explicit Stats			122	146	63		265	305	262	335
T4	Insert	NO_GATHER_OPTIMIZER_STATS		126	194	167	295	205	233	304	295
T4	Incremental Explicit Stats			80	105	2	281	266	300	179	226

It is difficult to determine the actual duration of the OPTIMIZER STATISTICS GATHERING operation, short of measuring the effect of disabling it. The time in the above table has been taken from SQL trace files. That duration is always greater than the amount saved by disabling online statistics gathering with the NO_GATHER_OPTIMIZER_STATS hint. However, the amount of time accounted in Active Session History (ASH) for that line in the execution plan is usually less than the elapsed saving.

Eg. For the sequential insert, 83s was accounted for OPTIMIZER STATISTICS GATHERING in the trace, while ASH showed only 23s of database time for that line of the plan. However, perhaps the only meaningful measurement is that disabling online statistics gathering saved 47s,

DML statements, including insert statements in direct-path, only actually execute in parallel if parallel DML is enabled. Specifying a degree of parallelism on the table, or a parallel hint is not enough. Parallel DML should be enabled.

either at session level

ALTER SESSION ENABLE PARALLEL DML;

or for the individual statement.

insert /*+APPEND ENABLE_PARALLEL_DML*/ into T2 SELECT * from t1;

Specifying parallel insert with a hint, without enabling parallel DML will not improve performance and can actually degrade it.
Specifying parallel query without running the insert in parallel can also degrade performance.

Online statistics will be collected in parallel if

either the table being queried has a degree of parallelism,
or a parallel hint applies to the table being queried, or the entire statement,
or parallel DML has been enabled

Where statistics are collected explicitly (i.e. with a call to DBMS_STATS.GATHER_TABLE_STATS) they are collected in parallel if

either, the DEGREE is specified (I specified a table statistics preference),

EXEC dbms_stats.set_table_prefs(user,'T3','DEGREE','DBMS_STATS.DEFAULT_DEGREE');

or the table has a degree of parallelism.

ALTER TABLE T3 PARALLEL;

Incremental statistics are generally faster to collect because they calculate table-level statistics from partition-level statistics, saving a second pass through the data.
When parallel DML is enabled at session level, I found that the performance of statistics collection also improves.

Conclusion

Overall, the best performance was obtained when the tables were altered to use parallelism, and parallel DML was enabled; then the query, insert and statistics collection are performed in parallel.
However, the improved performance of parallelism comes at a cost. It can be a brutal way of bringing more resource to bear on an activity. A parallel operation can be expected to use more database time across all the parallel server processes than the same operation processed serially. My best results were obtained by activating all of the CPUs on the server without regard for any other activity. Too many concurrent parallel operations have the potential to overload a system. Remember also, that while the parallel attribute remains on the table any subsequent query will also run in parallel.
Suppressing online statistics collection saves total database time whether working in parallel or not. The saving in elapsed time is reduced when the insert and query are running in parallel. The time taken to explicitly collect statistics will exceed that saving because it is doing additional work to collect partition statistics not done during online statistics collection.
Using incremental statistics for partitioned tables will also reduce the total amount of work and database time required to gather statistics, but may not significantly change the elapsed time to collect statistics.
If you need table statistics but can manage without partition statistics until the next maintenance window, then online statistics collection is very effective. However, I think the general case will be to require partition statistics, so you will probably need to explicitly collect statistics instead. If you want histograms, then you will also need to explicitly collect statistics.

Tuesday, January 21, 2020

Analysing Database Time with Active Session History for Statements with Online Optimizer Statistics Gathering Operations

(Updated 30.1.2020) I have been looking into the performance of online statistics collection. This feature was introduced in 12c. When statistics are collected online there is an extra OPTIMIZER STATISTICS GATHERING operation in the execution plan. However, I have noticed that the presence or absence of this operation does not change the hash value of the plan. I am grateful to RobK for his comment pointing out that the full plan hash value does differ.
This has consequences for profiling DB time by execution plan line and then describing that line from a captured plan.

OPTIMIZER STATISTICS GATHERING Operation

When statistics are collected online during either a create-table-as-select operation or the initial direct-path insert into a new segment. Below, I have different statements, whose execution plans have the same plan hash value, but actually differ. So, the differences are in areas that do not contribute to the plan hash value.

The first statement performs online statistics gathering, and so the plan includes the OPTIMIZER STATISTICS GATHERING operation, the second does not.
Note also that the statements insert into different tables, and that does not alter the plan hash value either. However, if the data was queried from different tables that would have produced a different plan hash value.

SQL_ID  c2dy3rmnqp7d7, child number 0
-------------------------------------
insert /*+APPEND*/ into T2 select * from t1

Plan hash value: 1069440229
-----------------------------------------------------------------------------------------
| Id  | Operation                        | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |      |       |       |   187K(100)|          |
|   1 |  LOAD AS SELECT                  | T2   |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |   100M|  4291M|   187K  (1)| 00:00:08 |
|   3 |    TABLE ACCESS STORAGE FULL     | T1   |   100M|  4291M|   187K  (1)| 00:00:08 |
-----------------------------------------------------------------------------------------

SQL_ID  5kr7cpvnmdwat, child number 0
-------------------------------------
insert /*+APPEND NO_GATHER_OPTIMIZER_STATISTICS*/ into T4 select * from t1

Plan hash value: 1069440229
-----------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT           |      |       |       |   187K(100)|          |
|   1 |  LOAD AS SELECT            | T4   |       |       |            |          |
|   2 |   TABLE ACCESS STORAGE FULL| T1   |   100M|  4291M|   187K  (1)| 00:00:08 |
-----------------------------------------------------------------------------------

I find that it is often useful to profile database time from DBA_HIST_ACTIVE_SESS_HISTORY (or v$active_session_history) by line in the execution plan, in order to see how much time was consumed by the different operations. I can then join the profile to DBA_HIST_SQL_PLAN (or v$sql_plan) to see what is the operation for each line. So long as I also join these tables by SQL_ID, the answer I get will be correct, but I may not always get an answer.

column inst_id heading 'Inst|Id' format 99
column sql_plan_line_id heading 'SQL Plan|Line ID'
column sql_plan_hash_value heading 'SQL Plan|Hash Value'
column ash_secs heading 'ASH|Secs' format 999
break on sql_id skip 1
with h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 1069440229
AND    h.sql_id IN('c2dy3rmnqp7d7','5kr7cpvnmdwat','g7awpb71jbup1')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p
  ON p.dbid = h.dbid
  and p.sql_id = h.sql_id
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/

You can see that while all the plans have the same hash value, the plan that included the extra OPTIMIZER STATISTICS GATHERING operation produced a different full plan hash value.
If the plan was not captured into AWR or is no longer in the library cache (as is the case for SQL ID g7awpb71jbup1), I don't get a description of the operations in the plan.

                           SQL Plan   SQL Plan SQL Full Plan  ASH
      DBID SQL_ID           Line ID Hash Value    Hash Value Secs OPERATION
---------- ------------- ---------- ---------- ------------- ---- --------------------------------
1278460406 5kr7cpvnmdwat          1 1069440229    2621293920   90 LOAD AS SELECT
1278460406                        2 1069440229    2621293920   30 TABLE ACCESS

1278460406 c2dy3rmnqp7d7          1 1069440229    3552282137  170 LOAD AS SELECT
1278460406                        2 1069440229    3552282137   10 OPTIMIZER STATISTICS GATHERING
1278460406                        3 1069440229    3552282137   60 TABLE ACCESS
1278460406                          1069440229    3552282137   20

1278460406 g7awpb71jbup1          1 1069440229    2621293920   90
1278460406                        2 1069440229    2621293920   20
1278460406                          1069440229    2621293920   10

Normally, I would look for another SQL_ID that produced the same plan hash value. FULL_PLAN_HASH_VALUE is recorded on V$SQL_PLAN, so I can join that to V$ACTIVE_SESSION_HISTORY.

WITH h as (
SELECT h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value, h.sql_full_plan_hash_value
,      SUM(1) ash_secs
FROM   v$active_session_history h
WHERE  h.sql_plan_hash_value = 1069440229
AND    h.sql_id IN('c2dy3rmnqp7d7','g7awpb71jbup1','5kr7cpvnmdwat')
GROUP BY h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value, h.sql_full_plan_hash_value
), p as (
SELECT DISTINCT plan_hash_value, full_plan_hash_value, id, operation
from v$sql_plan
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN p
  ON p.plan_hash_value = h.sql_plan_hash_value
  AND p.full_plan_hash_value = h.sql_full_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/

                SQL Plan   SQL Plan SQL Full Plan  ASH
SQL_ID           Line ID Hash Value    Hash Value Secs OPERATION
------------- ---------- ---------- ------------- ---- --------------------------------
5kr7cpvnmdwat          0 1069440229    2621293920    2 INSERT STATEMENT
                       1 1069440229    2621293920   98 LOAD AS SELECT
                       2 1069440229    2621293920   24 TABLE ACCESS

c2dy3rmnqp7d7          0 1069440229    3552282137    6 INSERT STATEMENT
                       1 1069440229    3552282137  137 LOAD AS SELECT
                       2 1069440229    3552282137   46 OPTIMIZER STATISTICS GATHERING
                       3 1069440229    3552282137   74 TABLE ACCESS
                         1069440229    3552282137    1

g7awpb71jbup1          0 1069440229    2621293920    3 INSERT STATEMENT
                       1 1069440229    2621293920   93 LOAD AS SELECT
                       2 1069440229    2621293920   28 TABLE ACCESS

However, while the full plan hash value is recorded in the ASH data in AWR (in DBA_HIST_ACTIVE_SESS_HISTORY), it is not on the plan history (DBA_HIST_SQL_PLAN). So, for an execution plan that only sometimes includes online statistics gathering, the operations may not match correctly because the OPTIMIZER STATISTICS GATHERING operation changes the line IDs.

WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value, h.sql_full_plan_hash_value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 1069440229
AND    h.sql_id IN('c2dy3rmnqp7d7','5kr7cpvnmdwat','g7awpb71jbup1')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_Value, h.sql_full_plan_hash_value
), p as (
SELECT DISTINCT dbid, plan_hash_value, id, operation
from dba_hist_sql_plan
)
SELECT h.*, p.operation
FROM   h
  LEFT OUTER JOIN p
  ON p.dbid = h.dbid
  AND p.plan_hash_value = h.sql_plan_hash_value
  AND p.id = h.sql_plan_line_id
ORDER BY 1,2,3
/

If I just join the ASH profile to a distinct list of ID and operation for the same plan hash value but matching any SQL_ID, I can get duplicate rows returned, starting at the line with the OPTIMIZER STATISTICS GATHERING operation because I have different plans with the same plan hash value.

                           SQL Plan   SQL Plan SQL Full Plan  ASH
      DBID SQL_ID           Line ID Hash Value    Hash Value Secs OPERATION
---------- ------------- ---------- ---------- ------------- ---- --------------------------------
1278460406 5kr7cpvnmdwat          1 1069440229    2621293920   90 LOAD AS SELECT
1278460406                        2 1069440229    2621293920   30 OPTIMIZER STATISTICS GATHERING
1278460406                        2 1069440229    2621293920   30 TABLE ACCESS

1278460406 c2dy3rmnqp7d7          1 1069440229    3552282137  170 LOAD AS SELECT
1278460406                        2 1069440229    3552282137   10 TABLE ACCESS
1278460406                        2 1069440229    3552282137   10 OPTIMIZER STATISTICS GATHERING
1278460406                        3 1069440229    3552282137   60 TABLE ACCESS
1278460406                          1069440229    3552282137   20

...

To mitigate this problem, in the following SQL Query, I look for the plan in the shared pool if it is available, and failing that in AWR where I check that the maximum plan line ID for which I have ASH data matches the maximum line ID (i.e. the number of lines) in any alternative plan with the same hash value.

WITH h as (
SELECT h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_value, h.sql_full_plan_hash_value
,      SUM(10) ash_secs
FROM   dba_hist_Active_Sess_history h
WHERE  h.sql_plan_hash_value = 1069440229
AND    h.sql_id IN('c2dy3rmnqp7d7','5kr7cpvnmdwat','g7awpb71jbup1')
GROUP BY h.dbid, h.sql_id, h.sql_plan_line_id, h.sql_plan_hash_value, h.sql_full_plan_hash_value
), x as (
SELECT h.*
,      MAX(sql_plan_line_id) OVER (PARTITION BY h.dbid, h.sql_id, h.sql_plan_hash_value) plan_lines
,      p1.operation
FROM   h
  LEFT OUTER JOIN dba_hist_sql_plan p1
  ON  p1.dbid = h.dbid
  AND p1.sql_id = h.sql_id
  AND p1.plan_hash_value = h.sql_plan_hash_value
  AND p1.id = h.sql_plan_line_id
)
SELECT x.*
, COALESCE(
   (SELECT p2.operation
    FROM   gv$sql_plan p2
    WHERE  p2.plan_hash_value = x.sql_plan_hash_value
    AND    p2.full_plan_hash_value = x.sql_full_plan_hash_value
    AND    p2.id = x.sql_plan_line_id
    AND    rownum = 1)
  ,(SELECT p3.operation
    FROM   dba_hist_sql_plan p3
    WHERE  p3.dbid = x.dbid
    AND    p3.plan_hash_value = x.sql_plan_hash_value
    AND    p3.id = x.sql_plan_line_id
    AND    p3.sql_id IN(
                    SELECT p.sql_id
                    FROM   dba_hist_sql_plan p
                    WHERE  p.dbid = x.dbid
                    AND    p.plan_hash_value = x.sql_plan_hash_value
                    GROUP BY p.dbid, p.sql_id
                    HAVING MAX(p.id) = x.plan_lines)
    AND    rownum = 1)
   ) operation2
FROM   x
ORDER BY 1,2,3
/

Now, I get an operation description for every line ID (if the same plan was gathered for a different SQL_ID).

                           SQL Plan   SQL Plan SQL Full Plan  ASH
      DBID SQL_ID           Line ID Hash Value    Hash Value Secs PLAN_LINES OPERATION                      OPERATION2
---------- ------------- ---------- ---------- ------------- ---- ---------- ------------------------------ ------------------------------
1278460406 5kr7cpvnmdwat          1 1069440229    2621293920   90          2 LOAD AS SELECT                 LOAD AS SELECT
1278460406                        2 1069440229    2621293920   30          2 TABLE ACCESS                   TABLE ACCESS

1278460406 c2dy3rmnqp7d7          1 1069440229    3552282137  170          3 LOAD AS SELECT                 LOAD AS SELECT
1278460406                        2 1069440229    3552282137   10          3 OPTIMIZER STATISTICS GATHERING OPTIMIZER STATISTICS GATHERING
1278460406                        3 1069440229    3552282137   60          3 TABLE ACCESS                   TABLE ACCESS
1278460406                          1069440229    3552282137   20          3

1278460406 g7awpb71jbup1          1 1069440229    2621293920   90          2                                LOAD AS SELECT
1278460406                        2 1069440229    2621293920   20          2                                TABLE ACCESS
1278460406                          1069440229    2621293920   10          2

However, this approach, while better, is still not perfect. I may not have sufficient DB time for the last line in the execution plan to be sampled, and therefore I may not choose a valid alternative plan.

Autonomous & Cloud Databases

Automatic online statistics gathering is becoming a more common occurrence.

In the Autonomous Data Warehouse, Oracle has set _optimizer_gather_stats_on_load_all=TRUE, so statistics are collected on every direct-path insert.
From 19c, on Engineered Systems (both in the cloud and on-premises), Real-Time statistics are collected during conventional DML (on inserts, updates and some deletes), also using the OPTIMIZER STATISTICS GATHERING operation. Again, the presence or absence of this operation does not affect the execution plan hash value.

Wednesday, January 15, 2020

Online Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference

I have come across a quirk where the presence of a table statistics preference that specifies METHOD_OPT that is specific to some columns disables online statistics gathering. This behaviour is at least not documented. I have reproduced this in Oracle version 12.1.0.2 and 19.3.

Demonstration

I will create two identical tables, but on the first table, I will specify a table statistic preference to collect a histogram on column C.

set serveroutput on verify on autotrace off
CREATE TABLE t1(a number, b varchar2(1000), c number);
CREATE TABLE t2(a number, b varchar2(1000), c number);
exec dbms_stats.set_table_prefs(user,'t1','METHOD_OPT','FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 C');

To make the test rerunnable, I will truncate each table, delete any statistics (because truncate does not delete statistics) and then populate the table again in direct-path mode.

TRUNCATE TABLE t1;
EXEC dbms_stats.delete_table_stats(user,'T1');
INSERT /*+APPEND*/ INTO t1 
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;

TRUNCATE TABLE t2;
EXEC dbms_stats.delete_table_stats(user,'T2');
INSERT /*+APPEND*/ INTO t2
SELECT ROWNUM a, TO_CHAR(TO_DATE(rownum,'J'),'Jsp') b, CEIL(SQRT(rownum)) c 
FROM dual CONNECT BY level <= 1e5;
COMMIT;

I expect to get statistics on both tables.

alter session set nls_date_Format = 'hh24:mi:ss dd/mm/yy';
column table_name format a10
column column_name format a11
SELECT table_name, num_rows, last_analyzed FROM user_tables WHERE table_name LIKE 'T_' ORDER BY 1;
SELECT table_name, column_name, num_distinct, histogram, num_buckets FROM user_tab_columns WHERE table_name LIKE 'T_' ORDER BY 1,2;

But I only get table and column statistics on T2, the one without the statistics preference.

TABLE_NAME   NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1
T2             100000 10:08:30 15/01/20

Table Column
Name  Name   NUM_DISTINCT HISTOGRAM       NUM_BUCKETS
----- ------ ------------ --------------- -----------
T1    A                   NONE
T1    B                   NONE
T1    C                   NONE
T2    A            100000 NONE                      1
T2    B             98928 NONE                      1
T2    C               317 NONE                      1

It appears that I don't get statistics on T1 because I have specified a METHOD_OPT table statistic preference that is specific to some named columns. It doesn't have to specify creating a histogram, it might be preventing a histogram from being created.
For example, this preference does not disable online statistics collection.

EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR ALL COLUMNS SIZE 1');

But these preferences do disable online statistics collection.

EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 B C');
EXEC dbms_stats.set_table_prefs(user,'t2','METHOD_OPT','FOR COLUMNS SIZE 1 A B C');

I have not found any other statistics preferences (for other DBMS_STATS parameters) that cause this behaviour.

Conclusion

Table preferences are recommended as a method of controlling statistics collection declaratively and consistently. You don't have to specify parameters to DBMS_STATS into scripts that collect statistics ad-hoc. The table statistics preferences provide a method that every time statistics are collected on a particular table, they are collected consistently, albeit in a way that may be different from the default.
However, take the example of an ETL process loading data into a data warehouse. If you rely on online statistics gathering to collect table statistics as a part of a data load process, you must now be careful not to disable statistics collection during the load with METHOD_OPT statistics preferences.

Autonomous & 19c Addendum

(Updated 21st January 2020) In the Oracle Autonomous Data Warehouse, in order to make statistics collection as self-managing as possible, Oracle sets two undocumented parameters.

_optimizer_gather_stats_on_load_hist=TRUE - so that histograms are created on every column when online statistics are gathered
_optimizer_gather_stats_on_load_all=TRUE - so that online statistics are collected for every direct-path insert, not just the first one.

Creating a column specific METHOD_OPT statistics preference disables this behaviour.
From 19c, on Engineered systems, Oracle collects Real-Time statistics. The METHOD_OPT preference does not disable this behaviour, but you only get Real-Time statistics if you have conventional statistics in the first place. If you don't get statistics after a bulk load, then you won't get Real-Time statistics until conventional statistics have been explicitly gathered, for example, by the maintenance job.

(Update 22nd January 2020) This has been filed as bug 30787109.

Monday, January 06, 2020

Partial Indexing

This is the second of two blog posts that discuss sparse and partial indexing.

Problem Statement

(This is the same problem statements as for sparse indexing)
It is not an uncommon requirement to find rows that match a rare value in a column with a small number of distinct values. So, the distribution of values is skewed. A typical example is a status column where an application processes newer rows that are a relatively small proportion of the table because over time the majority of rows have been processed and are at the final status.
An index is effective at finding the rare values, but it is usually more efficient to scan the table for the common values. A histogram is would almost certainly be required on such a column. However, if you build an index on the column you have to index all the rows. The index is, therefore, larger and requires more maintenance. Could we not just index the rare values for which we want to use the index to find?

Oracle does not index null values. If we could engineer that the common value was null, and then the index would only contain the rare values. This is sometimes called sparse indexing and was discussed in the previous blog.
Or we could separate the rare and common values into different index partitions, and build only the index partition(s) for the rare values. This is called partial indexing and is discussed in this blog.

As usual, this is not a new subject and other people have written extensively on these subjects, and I will provide links. However, I want to draw some of the issues together.

Partition Table and Locally Partitioned Partial Index

I could partition the table on the status column. Here, I have used list partitioning, because the common status is between the two rare status, so I only need two partitions not three. From Oracle 12.1, I can specify indexing on and off on the table and certain partitions so that later I can build partial local indexes only on some partitions. See also:

CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) INDEXING OFF 
PARTITION BY LIST (status)
(PARTITION t_status_rare   VALUES ('R','A') INDEXING ON
,PARTITION t_status_common VALUES (DEFAULT) 
) ENABLE ROW MOVEMENT 
/
INSERT /*+APPEND*/ INTO t --(key, status)
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;
exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');

Here Oracle eliminated the common status partition and only scanned the rare status partition (partition 1). Note that I don't even have an index at this point. So simply partitioning the table can be effective.

SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |    10 |   580 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |    10 |   580 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='R')

However, now when the application updates the status from R (rare) to C (common) the row must be moved between partitions. It is necessary to enable row movement on the table, otherwise, an error will be generated. However, there is additional overhead in moving the row. It is effectively deleted from one partition and inserted into the other.
In this test, I have increased the frequency of one of the rare statuses. Otherwise, the optimizer determines that it is cheaper just to scan the table partition than use the index!

SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/
S   COUNT(*)
- ----------
R         10
A        990
C     999000

Note that I have already specified INDEXING OFF on the table and INDEXING ON on the rare statuses partition. Now I can just build a locally partitioned partial index.

CREATE INDEX t_status ON t(status) LOCAL INDEXING PARTIAL;

Note that only partition T_STATUS_RARE is physically built, and it only contains a single extent. Partition T_STATUS_COMMON exists, is unusable and the segment has not been physically built. It contains no rows and no leaf blocks.

SELECT partition_name, status, num_rows, leaf_blocks
from user_ind_partitions where index_name = 'T_STATUS';

PARTITION_NAME       STATUS     NUM_ROWS LEAF_BLOCKS
-------------------- -------- ---------- -----------
T_STATUS_COMMON      UNUSABLE          0           0
T_STATUS_RARE        USABLE         1000           2

SELECT segment_name, partition_name, blocks
FROM user_segments WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME           BLOCKS
------------ -------------------- ----------
T_STATUS     T_STATUS_RARE                 8

SELECT segment_name, partition_name, segment_type, extent_id, blocks
FROM user_extents WHERE segment_name = 'T_STATUS';

SEGMENT_NAME PARTITION_NAME       SEGMENT_TYPE        EXTENT_ID     BLOCKS
------------ -------------------- ------------------ ---------- ----------
T_STATUS     T_STATUS_RARE        INDEX PARTITION             0          8

Scans for the common status value can only full scan the table partition because there is no index to use.

SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    55 |  2444   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    55 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   998K|    52M|  2444   (1)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   998K|    52M|  2444   (1)| 00:00:01 |     2 |     2 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='C')

To query the rare value Oracle does use the index on the rare values partition.

SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 3051124889
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |          |     1 |    58 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                             |          |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE                     |          |    10 |   580 |     2   (0)| 00:00:01 |   KEY |   KEY |
|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T        |    10 |   580 |     2   (0)| 00:00:01 |     1 |     1 |
|*  4 |     INDEX RANGE SCAN                        | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     1 |     1 |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')

However, it is not worth using the index for the slightly more common status A. Here, Oracle full scans the table partition.

SELECT COUNT(other) FROM t WHERE status='A';

COUNT(OTHER)
------------
         990

Plan hash value: 2831600127
-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |     1 |    58 |     4   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |      |     1 |    58 |            |          |       |       |
|   2 |   PARTITION LIST SINGLE|      |   990 | 57420 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS FULL   | T    |   990 | 57420 |     4   (0)| 00:00:01 |     1 |     1 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("STATUS"='A')

Note that to use partial indexing I also had to partition the table.

Globally Partitioned Index with Zero-Sized Unusable Partitions

Since Oracle 11.2.0.4, it has been possible to achieve the same effect without partitioning the table, thus avoiding the overhead of row movement. See also:

Partially Index a Table – Christian Antognini 2010.
Zero-Size Unusable Indexes and the Query Optimizer – Christian Antognini 2009
Indexing new features oracle 11g release 1 and release 2 - Richard Foote 2015

This feature also worked in earlier versions, but Oracle built a single extent for each unusable partition.

Creating a "Sparse" Index – Hermant Chitale, 2010

Here, I will recreate a non-partitioned table.

CREATE TABLE t 
(key NUMBER NOT NULL
,status VARCHAR2(1) NOT NULL
,other  VARCHAR2(1000)
,CONSTRAINT t_pk PRIMARY KEY(key)
) 
/
INSERT /*+APPEND*/ INTO t
SELECT rownum
, CASE WHEN rownum<=1e6-1000 THEN 'C' 
       WHEN rownum<=1e6-10 THEN 'A' 
       ELSE 'R' END CASE
, TO_CHAR(TO_DATE(rownum,'J'),'Jsp')
FROM dual
CONNECT BY level <= 1e6;

exec sys.dbms_stats.gather_table_stats(user,'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 254 status');

SELECT status, COUNT(*)
FROM   t 
GROUP BY status
/ 

S   COUNT(*)
- ----------
R         10
C     999000
A        990

It is not possible to create a globally list-partitioned index. Oracle simply does not support it.

CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY LIST (status, id2)
(PARTITION t_status_common VALUES ('R','A')
,PARTITION t_status_rare   VALUES (DEFAULT)
);

GLOBAL PARTITION BY LIST (status)
                    *
ERROR at line 2:
ORA-14151: invalid table partitioning method

You can create a global range or hash partitioned index. It is unlikely that the hash values of the column will break down conveniently into particular hash values. In this example, I would still have needed to create 4 hash partitions and still build 2 of them.

WITH x as (
SELECT status, COUNT(*) freq
FROM   t 
GROUP BY status
) SELECT x.*
, dbms_utility.get_hash_value(status,0,2)
, dbms_utility.get_hash_value(status,0,4)
FROM x
/ 

S       FREQ DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,2) DBMS_UTILITY.GET_HASH_VALUE(STATUS,0,4)
- ---------- --------------------------------------- ---------------------------------------
R        990                                       1                                       1
C    1009000                                       0                                       0
A         10                                       0                                       2

It is easier to create a globally range partitioned index. Although in my example, the common status lies between the two rare statuses, so I need to create three partitions. I will create the index unusable and build the two rare status partitions.

CREATE INDEX t_status ON t(status)
GLOBAL PARTITION BY RANGE (status)
(PARTITION t_status_rare1  VALUES LESS THAN ('C')
,PARTITION t_status_common VALUES LESS THAN ('D')
,PARTITION t_status_rare2  VALUES LESS THAN (MAXVALUE)
) UNUSABLE;
ALTER INDEX t_status REBUILD PARTITION t_status_rare1;
ALTER INDEX t_status REBUILD PARTITION t_status_rare2;

The index partition for the common status is unusable so Oracle can only full scan the table.

SELECT COUNT(other) FROM t WHERE status='C';

COUNT(OTHER)
------------
      999000

Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    55 |  2445   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    55 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |   999K|    52M|  2445   (1)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("STATUS"='C')

However, for the rare statuses, Oracle scans the index and looks up each of the table rows.

SELECT COUNT(other) FROM t WHERE status='R';

COUNT(OTHER)
------------
          10

Plan hash value: 2558590380
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |     1 |    55 |     2   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                       |          |     1 |    55 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE              |          |    10 |   550 |     2   (0)| 00:00:01 |     3 |     3 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T        |    10 |   550 |     2   (0)| 00:00:01 |       |       |
|*  4 |     INDEX RANGE SCAN                  | T_STATUS |    10 |       |     1   (0)| 00:00:01 |     3 |     3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("STATUS"='R')

Conclusion

The advantage of this global partitioning approach is that it does not require any change to application code, and it does not involve partitioning the table. However, you will have to remember not to rebuild the unusable partitions, otherwise, they will have to be maintained as the table changes until you make them unusable again and, they will consume space that you will only get back by recreating the entire index.
NB: Partitioning is a licenced option that is only available on the Enterprise Edition of the database.

Subscribe to: Posts ( Atom )

The /+Go-Faster/ Oracle Blog

Tuesday, January 28, 2020

Data Warehouse Design: Snowflake Dimensions and Lost Skew Trap

Friday, January 24, 2020

Online Statistics Collection during Bulk Loads on Partitioned Tables

Introduction

A Simple Test

Quirks

What Statistics Are Normally Collected by Online Statistics Gathering?

Do I Need Partition Statistics?

What is the Fastest Way to Collect Statistics on Partitioned Tables?

Conclusion

Tuesday, January 21, 2020

Analysing Database Time with Active Session History for Statements with Online Optimizer Statistics Gathering Operations

OPTIMIZER STATISTICS GATHERING Operation

Autonomous & Cloud Databases

Wednesday, January 15, 2020

Online Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference

Demonstration

Conclusion

Autonomous & 19c Addendum

Monday, January 06, 2020

Partial Indexing

Problem Statement

Partition Table and Locally Partitioned Partial Index

Globally Partitioned Index with Zero-Sized Unusable Partitions

Conclusion

Announcements

My Book

Oracle ACE Director

BlueSky / Twitter

Blog Archive

Tags

Other Oracle Blogs

Other PeopleSoft Blogs

OakTable Network

Oracle ACE Director

©David Kurtz