Introduction
One of the enhancements to statistics collection and management in Oracle 12c was the ability of the database will automatically collect statistics during either a create-table-as-select operation or during the initial insert into a freshly created or freshly truncated table, provide that insert is done in direct-path mode (i.e. using the APPEND hint).When that occurs, there is an additional operation in the execution plan; OPTIMIZER STATISTICS GATHERING.
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | 495K(100)| |
| 1 | LOAD AS SELECT | | | | | |
| 2 | OPTIMIZER STATISTICS GATHERING | | 70M| 11G| 495K (2)| 00:00:20 |
| 3 | TABLE ACCESS FULL | XXXXXXXXXXXXXXX | 70M| 11G| 495K (2)| 00:00:20 |
----------------------------------------------------------------------------------------------------
The motivation for this blog was encountering a bulk insert into a partitioned table where the statistics gathering operation consumed a very significant amount of time. Partitioning gives you more things to consider.
A Simple Test
I created a simple test that compares the time taken by online statistics collection on partitioned and non-partitioned tables, with the explicit collection of statistics using DBMS_STATS. I have four tables with the same structure.- T1: Not partitioned. Data will be copied from this table to each of the others.
- T2: Partitioned. Online statistics only.
- T3: Partitioned. Explicitly gathered statistics.
- T4: Partitioned. Explicitly gathered incremental statistics.
CREATE TABLE T1 (a number, b varchar2(1000), c number) NOLOGGING;
CREATE TABLE T2 (a number, b varchar2(1000), c number)
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T3 (a number, b varchar2(1000), c number)
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
CREATE TABLE T4 (a number, b varchar2(1000), c number)
PARTITION BY RANGE (a) INTERVAL(100) (PARTITION t_part VALUES less than (101)) NOLOGGING;
I loaded 100 million rows into each in direct-path mode. The partitioned tables end up with 100 partitions, each with 1 million rows. I have also suppressed redo logging during the direct-path insert by creating the tables with the NOLOGGING attribute.
- Online statistics will be collected on tables T1 and T2.
- Online statistics collection will be suppressed on T3 and T4 by using the NO_GATHER_OPTIMIZER_STATISTICS hint. Instead, statistics will be explicitly collected on T3 and T4.
- Incremental Statistics are enabled on T4, they collect partition-level statistics and calculate table-level statistics using synopses.
EXEC dbms_stats.set_table_prefs(user,'T3','INCREMENTAL','FALSE');
EXEC dbms_stats.set_table_prefs(user,'T4','INCREMENTAL','TRUE');
The following set of tests will be run for different combinations of:
- Parallel hint on query, or not
- Parallel hint on insert, or not
- Table parallelism specified
- Parallel DML enabled or disabled at session level
- Column-specific METHOD_OPT table preference specified or not.
TRUNCATE TABLE T2;
TRUNCATE TABLE T3;
EXEC dbms_stats.delete_table_stats(user,'T2');
EXEC dbms_stats.delete_table_stats(user,'T3');
EXEC dbms_stats.delete_table_stats(user,'T4');
INSERT /*+APPEND &inshint*/ into T2 i SELECT * /*+&selhint*/ from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T3 i SELECT /*+&selhint*/ * from t1 s;
INSERT /*+APPEND &inshint NO_GATHER_OPTIMIZER_STATISTICS*/ into T4 i SELECT /*+&selhint*/ * from t1 s;
commit;
EXEC dbms_stats.gather_table_stats(user,'T3');
EXEC dbms_stats.gather_table_stats(user,'T4');
Quirks
It was while building this test that I discovered a couple of quirks:- Specifying a column-specific METHOD_OPT table statistics preference disables online statistics collection (bug 30787109) - see Online Statistics Gathering Disabled by Column Specific METHOD_OPT Table Statistics Preference.
- 12c Online Statistics Gathering Not Working For Insert As Select (Doc ID 2328896.1) – unless all columns are inserted into.
What Statistics Are Normally Collected by Online Statistics Gathering?
After just the initial insert, I can see that I have table statistics on T1 and T2, but not on T3 and T4.
"The optimizer will use global or table level statistics if one or more of your queries touches two or more partitions. The optimizer will use partition level statistics if your queries do partition elimination, such that only one partition is necessary to answer each query. If your queries touch two or more partitions the optimizer will use a combination of global and partition level statistics."
– Oracle The Data Warehouse Insider Blog: Managing Optimizer Statistics in an Oracle Database 11g - Maria Colgan
It will depend upon the nature of the SQL in the application. If the optimizer does some partition elimination, and the data is not uniformly distributed across the partitions, then partition statistics are likely to be beneficial. If there is no partition elimination, then you might question whether partitioning (or at least the current partitioning strategy) is appropriate!
SELECT table_name, num_rows from user_tables where table_name LIKE 'T_' order by 1;
TABLE_NAME NUM_ROWS LAST_ANALYZED
---------- ---------- -----------------
T1 10000000 14:07:36 16/01/20
T2 10000000 14:07:36 16/01/20
T3
T4
I also have column statistics on T1 and T2, but no histograms.
break on table_name skip 1
SELECT table_name, column_name, num_distinct, global_stats, histogram, num_buckets, last_analyzed
FROM user_tab_columns where table_name like 'T_' order by 1,2;
TABLE_NAME COLUMN_NAME NUM_DISTINCT GLO HISTOGRAM NUM_BUCKETS LAST_ANALYZED
---------- ------------ ------------ --- --------------- ----------- -----------------
T1 A 10000 YES NONE 1 14:06:58 16/01/20
B 10000 YES NONE 1 14:06:58 16/01/20
C 100 YES NONE 1 14:06:58 16/01/20
T2 A 10000 YES NONE 1 14:07:11 16/01/20
B 10000 YES NONE 1 14:07:11 16/01/20
C 100 YES NONE 1 14:07:11 16/01/20
T3 A NO NONE
B NO NONE
C NO NONE
T4 A NO NONE
B NO NONE
C NO NONE
However, I do not have any partition statistics (I have only shown the first and last partition of each table in this report).break on table_name skip 1
SELECT table_name, partition_position, partition_name, num_rows
FROM user_tab_partitions WHERE table_name like 'T_' ORDER BY 1,2 nulls first;
TABLE_NAME PARTITION_POSITION PARTITION_NAME NUM_ROWS LAST_ANALYZED
---------- ------------------ -------------------- ---------- -----------------
T2 1 T_PART
…
100 SYS_P20008
T3 1 T_PART
…
100 SYS_P20107
T4 1 T_PART
…
100 SYS_P20206
Online optimizer statistics gathering only collects statistics at table level but not partition or sub-partition level. Histograms are not collected.- See also Online Statistics Gathering for Bulk Loads in Oracle Database 12c Release 1 (12.1) – Tim Hall
- If _optimizer_gather_stats_on_load_hist=TRUE histograms are be collected on all columns during online statistics collection.
- If _optimizer_gather_stats_on_load_all=TRUE statistics are collected online during every direct-path insert, not just the first one into a segment.
Do I Need Partition Statistics?
Statistics will be collected on partitions that do not have them when the automatic statistics collection job runs in the next database maintenance window. The question is whether to manage without them until then?"The optimizer will use global or table level statistics if one or more of your queries touches two or more partitions. The optimizer will use partition level statistics if your queries do partition elimination, such that only one partition is necessary to answer each query. If your queries touch two or more partitions the optimizer will use a combination of global and partition level statistics."
– Oracle The Data Warehouse Insider Blog: Managing Optimizer Statistics in an Oracle Database 11g - Maria Colgan
It will depend upon the nature of the SQL in the application. If the optimizer does some partition elimination, and the data is not uniformly distributed across the partitions, then partition statistics are likely to be beneficial. If there is no partition elimination, then you might question whether partitioning (or at least the current partitioning strategy) is appropriate!
What is the Fastest Way to Collect Statistics on Partitioned Tables?
Let's look at how long it takes to insert data into, and then subsequently collect statistics on the tables in my example. This test was run on Oracle 19c on one compute node of a virtualised Exadata X4 machine with 16 CPUs. This table shows elapsed time and the total DB time include all parallel server processes for each operation.
Table Name
|
Oper-ation | Comment | Option | Serial Insert & Statistics | Parallel Insert & Statistics | Parallel SQL & Statistics | Parallel DML, Insert, Select & Statistics | Parallel DML, SQL & Statistics | Parallel Tables | Parallel Tables & DML | Parallel Tables, DML & Method Opt |
Table
|
NO |
NO |
NO |
NO |
NO |
PARALLEL | PARALLEL | PARALLEL | |||
Insert Hint | blank | PARA |
blank | PARA |
blank | blank | blank | blank | |||
Select Hint | blank | PARA |
PARA |
PARA |
PARA |
blank | blank | blank | |||
Parallel DML | DISABLE | DISABLE | DISABLE | ENABLE | ENABLE | DISABLE | ENABLE | ENABLE | |||
Stats Degree | none | DEFAULT | DEFAULT | DEFAULT | DEFAULT | none | none | none | |||
Method Opt | none | none | none | none | none | none | none | … FOR COLUMNS SIZE 1 A | |||
T2 |
Insert
|
Online Stats Gathering
|
Elapsed Time (s)
|
172.46
|
160.86
|
121.61
|
108.29
|
60.31
|
194.47
|
23.57
|
20.57
|
Optimizer Statistics
Gathering
|
82.71
|
55.59
|
55.90
|
-
|
-
|
-
|
-
|
-
|
|||
T3 |
Insert
|
NO_GATHER
|
125.40
|
156.36
|
124.18
|
20.62
|
29.01
|
199.20
|
20.97
|
21.15
|
|
Explicit Stats
|
122.80
|
146.25
|
63.23
|
15.99
|
24.88
|
24.58
|
24.99
|
24.62
|
|||
T4 |
Insert
|
NO_GATHER
|
123.18
|
158.15
|
147.04
|
20.44
|
29.91
|
204.61
|
20.65
|
20.60
|
|
Incremental
Explicit Stats
|
80.51
|
104.85
|
46.05
|
23.42
|
23.14
|
23.21
|
22.60
|
23.03
| |||
T2 |
Insert
|
Online Stats Gathering
|
DB Time (s)
|
174
| 163 | 169 | 359 | 337 | 248 | 366 | 308 |
T3 |
Insert
|
NO_GATHER
|
128
| 193 | 160 | 290 | 211 | 236 | 312 | 326 | |
Explicit Stats
|
122
| 146 | 63 | 265 | 305 | 262 | 335 | ||||
T4 |
Insert
|
NO_GATHER
|
126
| 194 | 167 | 295 | 205 | 233 | 304 | 295 | |
Incremental
Explicit Stats
|
80
| 105 | 2 | 281 | 266 | 300 | 179 | 226 |
- It is difficult to determine the actual duration of the OPTIMIZER STATISTICS GATHERING operation, short of measuring the effect of disabling it. The time in the above table has been taken from SQL trace files. That duration is always greater than the amount saved by disabling online statistics gathering with the NO_GATHER_OPTIMIZER_STATS hint. However, the amount of time accounted in Active Session History (ASH) for that line in the execution plan is usually less than the elapsed saving.
- Eg. For the sequential insert, 83s was accounted for OPTIMIZER STATISTICS GATHERING in the trace, while ASH showed only 23s of database time for that line of the plan. However, perhaps the only meaningful measurement is that disabling online statistics gathering saved 47s,
- DML statements, including insert statements in direct-path, only actually execute in parallel if parallel DML is enabled. Specifying a degree of parallelism on the table, or a parallel hint is not enough. Parallel DML should be enabled.
- either at session level
ALTER SESSION ENABLE PARALLEL DML;
- or for the individual statement.
insert /*+APPEND ENABLE_PARALLEL_DML*/ into T2 SELECT * from t1;
- Specifying parallel insert with a hint, without enabling parallel DML will not improve performance and can actually degrade it.
- Specifying parallel query without running the insert in parallel can also degrade performance.
- Online statistics will be collected in parallel if
- either the table being queried has a degree of parallelism,
- or a parallel hint applies to the table being queried, or the entire statement,
- or parallel DML has been enabled
- Where statistics are collected explicitly (i.e. with a call to DBMS_STATS.GATHER_TABLE_STATS) they are collected in parallel if
- either, the DEGREE is specified (I specified a table statistics preference),
EXEC dbms_stats.set_table_prefs(user,'T3','DEGREE','DBMS_STATS.DEFAULT_DEGREE');
- or the table has a degree of parallelism.
ALTER TABLE T3 PARALLEL;
- Incremental statistics are generally faster to collect because they calculate table-level statistics from partition-level statistics, saving a second pass through the data.
- When parallel DML is enabled at session level, I found that the performance of statistics collection also improves.
Conclusion
Overall, the best performance was obtained when the tables were altered to use parallelism, and parallel DML was enabled; then the query, insert and statistics collection are performed in parallel.However, the improved performance of parallelism comes at a cost. It can be a brutal way of bringing more resource to bear on an activity. A parallel operation can be expected to use more database time across all the parallel server processes than the same operation processed serially. My best results were obtained by activating all of the CPUs on the server without regard for any other activity. Too many concurrent parallel operations have the potential to overload a system. Remember also, that while the parallel attribute remains on the table any subsequent query will also run in parallel.
Suppressing online statistics collection saves total database time whether working in parallel or not. The saving in elapsed time is reduced when the insert and query are running in parallel. The time taken to explicitly collect statistics will exceed that saving because it is doing additional work to collect partition statistics not done during online statistics collection.
Using incremental statistics for partitioned tables will also reduce the total amount of work and database time required to gather statistics, but may not significantly change the elapsed time to collect statistics.
If you need table statistics but can manage without partition statistics until the next maintenance window, then online statistics collection is very effective. However, I think the general case will be to require partition statistics, so you will probably need to explicitly collect statistics instead. If you want histograms, then you will also need to explicitly collect statistics.
No comments :
Post a Comment