The /*+Go-Faster*/ Oracle Blog: Online Statistics Collection during Bulk Loads on Partitioned Tables

Introduction

One of the enhancements to statistics collection and management in Oracle 12c was the ability of the database will automatically collect statistics during either a create-table-as-select operation or during the initial insert into a freshly created or freshly truncated table, provide that insert is done in direct-path mode (i.e. using the APPEND hint).
When that occurs, there is an additional operation in the execution plan; OPTIMIZER STATISTICS GATHERING.

----------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                 |       |       |   495K(100)|          |
|   1 |  LOAD AS SELECT                  |                 |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |                 |    70M|    11G|   495K  (2)| 00:00:20 |
|   3 |    TABLE ACCESS FULL             | XXXXXXXXXXXXXXX |    70M|    11G|   495K  (2)| 00:00:20 |
----------------------------------------------------------------------------------------------------

The motivation for this blog was encountering a bulk insert into a partitioned table where the statistics gathering operation consumed a very significant amount of time. Partitioning gives you more things to consider.

A Simple Test

I created a simple test that compares the time taken by online statistics collection on partitioned and non-partitioned tables, with the explicit collection of statistics using DBMS_STATS. I have four tables with the same structure.

T1: Not partitioned. Data will be copied from this table to each of the others.
T2: Partitioned. Online statistics only.
T3: Partitioned. Explicitly gathered statistics.
T4: Partitioned. Explicitly gathered incremental statistics.

CREATE TABLE T1 (a number, b varchar2(1000), c number) NOLOGGING;
CREATE TABLE T2 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T3 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T4 (a number, b varchar2(1000), c number) 
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;

I loaded 100 million rows into each in direct-path mode. The partitioned tables end up with 100 partitions, each with 1 million rows. I have also suppressed redo logging during the direct-path insert by creating the tables with the NOLOGGING attribute.

Online statistics will be collected on tables T1 and T2.
Online statistics collection will be suppressed on T3 and T4 by using the NO_GATHER_OPTIMIZER_STATISTICS hint. Instead, statistics will be explicitly collected on T3 and T4.
Incremental Statistics are enabled on T4, they collect partition-level statistics and calculate table-level statistics using synopses.

see also Efficient Statistics Maintenance for Partitioned Tables Using Incremental Statistics – Part 1 (Nigel Bayliss).

EXEC dbms_stats.set_table_prefs(user,'T3','INCREMENTAL','FALSE');
EXEC dbms_stats.set_table_prefs(user,'T4','INCREMENTAL','TRUE');

The following set of tests will be run for different combinations of:

Parallel hint on query, or not
Parallel hint on insert, or not
Table parallelism specified
Parallel DML enabled or disabled at session level
Column-specific METHOD_OPT table preference specified or not.

I enabled SQL trace, from which I was able to obtain the elapsed time of the various statements, and I can determine the amount of time spent on online statistics gathering from timings on the OPTIMIZER STATISTICS GATHERING operation in the execution plan in the trace.

TRUNCATE TABLE T2;
TRUNCATE TABLE T3;
EXEC dbms_stats.delete_table_stats(user,'T2');
EXEC dbms_stats.delete_table_stats(user,'T3');
EXEC dbms_stats.delete_table_stats(user,'T4');

INSERT /*+APPEND &inshint*/ into T2 i SELECT * /*+&selhint*/ from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T3 i SELECT /*+&selhint*/ * from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T4 i SELECT /*+&selhint*/ * from t1 s;
commit;
EXEC dbms_stats.gather_table_stats(user,'T3');
EXEC dbms_stats.gather_table_stats(user,'T4');

Quirks

It was while building this test that I discovered a couple of quirks:

Specifying a column-specific METHOD_OPT table statistics preference disables online statistics collection (bug 30787109) - see Online Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference.
12c Online Statistics Gathering Not Working For Insert As Select (Doc ID 2328896.1) – unless all columns are inserted into.

What Statistics Are Normally Collected by Online Statistics Gathering?

After just the initial insert, I can see that I have table statistics on T1 and T2, but not on T3 and T4.

SELECT table_name, num_rows from user_tables where table_name LIKE 'T_' order by 1; 

TABLE_NAME   NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1           10000000 14:07:36 16/01/20
T2           10000000 14:07:36 16/01/20
T3
T4

I also have column statistics on T1 and T2, but no histograms.

break on table_name skip 1
SELECT table_name, column_name, num_distinct, global_stats, histogram, num_buckets, last_analyzed
FROM user_tab_columns where table_name like 'T_' order by 1,2; 

TABLE_NAME COLUMN_NAME  NUM_DISTINCT GLO HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
---------- ------------ ------------ --- --------------- ----------- -----------------
T1         A                   10000 YES NONE                      1 14:06:58 16/01/20
           B                   10000 YES NONE                      1 14:06:58 16/01/20
           C                     100 YES NONE                      1 14:06:58 16/01/20

T2         A                   10000 YES NONE                      1 14:07:11 16/01/20
           B                   10000 YES NONE                      1 14:07:11 16/01/20
           C                     100 YES NONE                      1 14:07:11 16/01/20

T3         A                         NO  NONE
           B                         NO  NONE
           C                         NO  NONE

T4         A                         NO  NONE
           B                         NO  NONE
           C                         NO  NONE

However, I do not have any partition statistics (I have only shown the first and last partition of each table in this report).

break on table_name skip 1
SELECT table_name, partition_position, partition_name, num_rows 
FROM user_tab_partitions WHERE table_name like 'T_' ORDER BY 1,2 nulls first;

TABLE_NAME PARTITION_POSITION PARTITION_NAME         NUM_ROWS LAST_ANALYZED
---------- ------------------ -------------------- ---------- -----------------
T2                          1 T_PART
…
                          100 SYS_P20008

T3                          1 T_PART
…
                          100 SYS_P20107

T4                          1 T_PART
…
                          100 SYS_P20206

Online optimizer statistics gathering only collects statistics at table level but not partition or sub-partition level. Histograms are not collected.

See also Online Statistics Gathering for Bulk Loads in Oracle Database 12c Release 1 (12.1) – Tim Hall

From Oracle 18c, there are two undocumented parameters that modify this behaviour. Both default to false. Interestingly, both are enabled in the Oracle Autonomous Data Warehouse.

If _optimizer_gather_stats_on_load_hist=TRUE histograms are be collected on all columns during online statistics collection.
If _optimizer_gather_stats_on_load_all=TRUE statistics are collected online during every direct-path insert, not just the first one into a segment.

Do I Need Partition Statistics?

Statistics will be collected on partitions that do not have them when the automatic statistics collection job runs in the next database maintenance window. The question is whether to manage without them until then?
"The optimizer will use global or table level statistics if one or more of your queries touches two or more partitions. The optimizer will use partition level statistics if your queries do partition elimination, such that only one partition is necessary to answer each query. If your queries touch two or more partitions the optimizer will use a combination of global and partition level statistics."
– Oracle The Data Warehouse Insider Blog: Managing Optimizer Statistics in an Oracle Database 11g - Maria Colgan
It will depend upon the nature of the SQL in the application. If the optimizer does some partition elimination, and the data is not uniformly distributed across the partitions, then partition statistics are likely to be beneficial. If there is no partition elimination, then you might question whether partitioning (or at least the current partitioning strategy) is appropriate!

What is the Fastest Way to Collect Statistics on Partitioned Tables?

Let's look at how long it takes to insert data into, and then subsequently collect statistics on the tables in my example. This test was run on Oracle 19c on one compute node of a virtualised Exadata X4 machine with 16 CPUs. This table shows elapsed time and the total DB time include all parallel server processes for each operation.

Table Name	Oper-ation	Comment	Option	Serial Insert & Statistics	Parallel Insert & Statistics	Parallel SQL & Statistics	Parallel DML, Insert, Select & Statistics	Parallel DML, SQL & Statistics	Parallel Tables	Parallel Tables & DML	Parallel Tables, DML & Method Opt
			Table	NOPARALLEL	NOPARALLEL	NOPARALLEL	NOPARALLEL	NOPARALLEL	PARALLEL	PARALLEL	PARALLEL
			Insert Hint	blank	PARALLEL(i)	blank	PARALLEL(i)	blank	blank	blank	blank
			Select Hint	blank	PARALLEL(s)	PARALLEL	PARALLEL(s)	PARALLEL	blank	blank	blank
			Parallel DML	DISABLE	DISABLE	DISABLE	ENABLE	ENABLE	DISABLE	ENABLE	ENABLE
			Stats Degree	none	DEFAULT	DEFAULT	DEFAULT	DEFAULT	none	none	none
			Method Opt	none	none	none	none	none	none	none	… FOR COLUMNS SIZE 1 A
T2	Insert	Online Stats Gathering	Elapsed Time (s)	172.46	160.86	121.61	108.29	60.31	194.47	23.57	20.57
T2	Insert	Optimizer Statistics Gathering		82.71	55.59	55.90	-	-	-	-	-
T3	Insert	NO_GATHER_OPTIMIZER_STATS		125.40	156.36	124.18	20.62	29.01	199.20	20.97	21.15
T3	Explicit Stats			122.80	146.25	63.23	15.99	24.88	24.58	24.99	24.62
T4	Insert	NO_GATHER_OPTIMIZER_STATS		123.18	158.15	147.04	20.44	29.91	204.61	20.65	20.60
T4	Incremental Explicit Stats			80.51	104.85	46.05	23.42	23.14	23.21	22.60	23.03
T2	Insert	Online Stats Gathering	DB Time (s)	174	163	169	359	337	248	366	308
T3	Insert	NO_GATHER_OPTIMIZER_STATS		128	193	160	290	211	236	312	326
T3	Explicit Stats			122	146	63		265	305	262	335
T4	Insert	NO_GATHER_OPTIMIZER_STATS		126	194	167	295	205	233	304	295
T4	Incremental Explicit Stats			80	105	2	281	266	300	179	226

It is difficult to determine the actual duration of the OPTIMIZER STATISTICS GATHERING operation, short of measuring the effect of disabling it. The time in the above table has been taken from SQL trace files. That duration is always greater than the amount saved by disabling online statistics gathering with the NO_GATHER_OPTIMIZER_STATS hint. However, the amount of time accounted in Active Session History (ASH) for that line in the execution plan is usually less than the elapsed saving.

Eg. For the sequential insert, 83s was accounted for OPTIMIZER STATISTICS GATHERING in the trace, while ASH showed only 23s of database time for that line of the plan. However, perhaps the only meaningful measurement is that disabling online statistics gathering saved 47s,

DML statements, including insert statements in direct-path, only actually execute in parallel if parallel DML is enabled. Specifying a degree of parallelism on the table, or a parallel hint is not enough. Parallel DML should be enabled.

either at session level

ALTER SESSION ENABLE PARALLEL DML;

or for the individual statement.

insert /*+APPEND ENABLE_PARALLEL_DML*/ into T2 SELECT * from t1;

Specifying parallel insert with a hint, without enabling parallel DML will not improve performance and can actually degrade it.
Specifying parallel query without running the insert in parallel can also degrade performance.

Online statistics will be collected in parallel if

either the table being queried has a degree of parallelism,
or a parallel hint applies to the table being queried, or the entire statement,
or parallel DML has been enabled

Where statistics are collected explicitly (i.e. with a call to DBMS_STATS.GATHER_TABLE_STATS) they are collected in parallel if

either, the DEGREE is specified (I specified a table statistics preference),

EXEC dbms_stats.set_table_prefs(user,'T3','DEGREE','DBMS_STATS.DEFAULT_DEGREE');

or the table has a degree of parallelism.

ALTER TABLE T3 PARALLEL;

Incremental statistics are generally faster to collect because they calculate table-level statistics from partition-level statistics, saving a second pass through the data.
When parallel DML is enabled at session level, I found that the performance of statistics collection also improves.

Conclusion

Overall, the best performance was obtained when the tables were altered to use parallelism, and parallel DML was enabled; then the query, insert and statistics collection are performed in parallel.
However, the improved performance of parallelism comes at a cost. It can be a brutal way of bringing more resource to bear on an activity. A parallel operation can be expected to use more database time across all the parallel server processes than the same operation processed serially. My best results were obtained by activating all of the CPUs on the server without regard for any other activity. Too many concurrent parallel operations have the potential to overload a system. Remember also, that while the parallel attribute remains on the table any subsequent query will also run in parallel.
Suppressing online statistics collection saves total database time whether working in parallel or not. The saving in elapsed time is reduced when the insert and query are running in parallel. The time taken to explicitly collect statistics will exceed that saving because it is doing additional work to collect partition statistics not done during online statistics collection.
Using incremental statistics for partitioned tables will also reduce the total amount of work and database time required to gather statistics, but may not significantly change the elapsed time to collect statistics.
If you need table statistics but can manage without partition statistics until the next maintenance window, then online statistics collection is very effective. However, I think the general case will be to require partition statistics, so you will probably need to explicitly collect statistics instead. If you want histograms, then you will also need to explicitly collect statistics.

The /+Go-Faster/ Oracle Blog

Friday, January 24, 2020

Online Statistics Collection during Bulk Loads on Partitioned Tables

Introduction

A Simple Test

Quirks

What Statistics Are Normally Collected by Online Statistics Gathering?

Do I Need Partition Statistics?

What is the Fastest Way to Collect Statistics on Partitioned Tables?

Conclusion

No comments :

Announcements

My Book

Oracle ACE Director

BlueSky / Twitter

Blog Archive

Tags

Other Oracle Blogs

Other PeopleSoft Blogs

OakTable Network

Oracle ACE Director

©David Kurtz

The /*+Go-Faster*/ Oracle Blog

Friday, January 24, 2020

Online Statistics Collection during Bulk Loads on Partitioned Tables

Introduction

A Simple Test

Quirks

What Statistics Are Normally Collected by Online Statistics Gathering?

Do I Need Partition Statistics?

What is the Fastest Way to Collect Statistics on Partitioned Tables?

Conclusion

No comments :

Announcements

My Book

Oracle ACE Director

BlueSky / Twitter

Blog Archive

Subscribe

Tags

Other Oracle Blogs

Other PeopleSoft Blogs

OakTable Network

Oracle ACE Director

©David Kurtz

The /+Go-Faster/ Oracle Blog