Saturday, November 14, 2015

Disappearing Histograms

In general, once you have a histogram on a column you keep it.  However, this is not because DBMS_STATS simply maintains it because it is there, but because it is still appropriate to have it based on the column workload usage.
Since Oracle 10g, the default method for collecting histograms is AUTO, which means that Oracle determines whether to collect a histogram based on data distribution and the workload of the column.
What is less well known (including to me until recently), is that histograms can be removed if there is no column workload to justify them.

This can be an issue when for some reason you have imported statistics on a freshly rebuilt table with no column usage.  I think the first two scenarios are the most likely:
  1. Export statistics.  Reorganize table by recreation.  Import statistics.  
  2. On a test database, import cut down data set, but with full volume production statistics.
  3. Column usage is deleted.
In all cases, the histograms will be removed by the maintenance window statistics job if column usage has not built up again.  This can lead to changes in the execution plan and performance problems.  When this happens in a production system, it is often a transient problem until the column usage builds up and the histograms are usually reinstated when statistics are next collected.
In the third scenario you have the potential to refer back to any statistics history (see Mauro Pagano's blog Histograms Come Histograms Go), but that will not be the case where the original table has been dropped.

Demonstration

I will create a table and populate it.  The data in columns A, B and C is perfectly uniform, only column SKEW is skewed.
drop table t purge
/
create table t
(a number
,b number
,c number
,skew number
)
/
insert into t 
with x as (
select rownum n from dual connect by level <= 10
)
select a.n, b.n, c.n, a.n*b.n*c.n 
from x a, x b, x c
/
Note that I have collected statistics with SKEWONLY. So statistics will be collected if the data distribution is skewed irrespective of the usage of the column.
REM collect stats and histograms on everything
begin
 dbms_stats.gather_table_stats(ownname=>'SCOTT' ,tabname=>'T' ,method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY'
 );
end;
/
REM check for histogram
select table_name, column_name, num_distinct, num_nulls, histogram, num_buckets, sample_size
from   dba_tab_col_statistics
where  owner = 'SCOTT'
and    table_name = 'T'
/
Now I have histograms on all columns, even though the data in columns A, B, and C is not skewed at all but is perfectly uniform,  This shows the value of the column usage feature in limiting the excessive collection of histograms.
Table Column      Num   Num               Num  Sample
Name  Name   Distinct Nulls Histogram Buckets    Size
----- ------ -------- ----- --------- ------- -------                                                 
T     A            10     0 FREQUENCY      10    1000
T     B            10     0 FREQUENCY      10    1000
T     C            10     0 FREQUENCY      10    1000
T     SKEW        120     0 FREQUENCY     120    1000
At this point, there is no column usage. So when I collect statistics the histogram will all disappear.
REM check no column usage
select *
from   sys.col_usage$ u
where  obj# IN(select object_id 
               from  dba_objects 
               WHERE owner = 'SCOTT'
               AND   object_name = 'T'
               AND   object_Type = 'TABLE')
/
REM collect the stats with default col usage
begin
 dbms_stats.gather_table_stats(ownname=>'SCOTT',tabname=>'T');
end;
/
REM check for histogram - should have gone with no col usage
select table_name, column_name, num_distinct, num_nulls, histogram, num_buckets, sample_size
from   dba_tab_col_statistics
where  owner = 'SCOTT'
and    table_name = 'T'
/

Table Column      Num   Num               Num  Sample
Name  Name   Distinct Nulls Histogram Buckets    Size
----- ------ -------- ----- --------- ------- -------
T     A            10     0 NONE            1    1000
T     B            10     0 NONE            1    1000
T     C            10     0 NONE            1    1000
T     SKEW        120     0 NONE            1    1000
Sure enough, there are now no histograms on the columns, and there is just a single bucket on each column. So now I will generate some column usage, I will do some equality queries on the column SKEW, and I will flush the monitoring information to the data dictionary.
set serveroutput off
declare
 z NUMBER;
begin
 for a in 1..10 loop
  for b in 1..10 loop
   for c in 1..10 loop
    EXECUTE IMMEDIATE 'SELECT '||'/*'||(a*b*c)||'*/'||' count(*) FROM t where skew = '||(a*b*c) INTO z;
   end loop;
  end loop;
 end loop;
end;
/
exec dbms_stats.flush_database_monitoring_info;
select *
from sys.col_usage$ u
where obj# IN(select object_id 
 from  dba_objects 
 WHERE owner = 'SCOTT'
  AND   object_name = 'T'
 AND   object_Type = 'TABLE')
/
      OBJ#    INTCOL# EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
---------- ---------- -------------- -------------- ----------------- ----------- ---------- ---------- ---------
     98167          4              1              0                 0           0          0          0 04-NOV-15
The column usage report tells us that there has been an equality predicate used on the column SKEW.
select dbms_stats.report_col_usage('SCOTT','T') from dual
/
DBMS_STATS.REPORT_COL_USAGE('SCOTT','T')
--------------------------------------------------------------------------------
LEGEND:                                                                                                                                 
.......                                                                                               
                                                
EQ         : Used in single table EQuality predicate
RANGE      : Used in single table RANGE predicate   
LIKE       : Used in single table LIKE predicate    
NULL       : Used in single table is (not) NULL predicate
EQ_JOIN    : Used in EQuality JOIN predicate             
NONEQ_JOIN : Used in NON EQuality JOIN predicate         
FILTER     : Used in single table FILTER predicate       
JOIN       : Used in JOIN predicate                      

DBMS_STATS.REPORT_COL_USAGE('SCOTT','T')                 
-------------------------------------------------------------------------------- 
GROUP_BY   : Used in GROUP BY expression
...............................................................................
###############################################################################
                       COLUMN USAGE REPORT FOR SCOTT.T                                                                       
................................
1. SKEW                                : EQ                                                           
###############################################################################
Now, when I collect statistics with the default METHOD_OPT, I will get a histogram on SKEW,
begin
 dbms_stats.gather_table_stats(ownname=>'SCOTT',tabname=>'T');
end;
/
REM check for histogram - should have gone with no col usage
select table_name, column_name, num_distinct, num_nulls, histogram, num_buckets, sample_size
from dba_tab_col_statistics
where owner = 'SCOTT'
and  table_name = 'T'
/

Table Column      Num   Num               Num  Sample
Name  Name   Distinct Nulls Histogram Buckets    Size
----- ------ -------- ----- --------- ------- -------
T     A            10     0 NONE            1    1000
T     B            10     0 NONE            1    1000
T     C            10     0 NONE            1    1000
T     SKEW        120     0 FREQUENCY     120    1000
Nest, I will delete the column usage and collect the statistics again
REM remove the column usage
exec dbms_stats.reset_col_usage('SCOTT','T');
begin
 dbms_stats.gather_table_stats(ownname=>'SCOTT',tabname=>'T');
end;
/
REM check for histogram - should have gone with no col usage
select table_name, column_name, num_distinct, num_nulls, histogram, num_buckets, sample_size
from dba_tab_col_statistics
where owner = 'SCOTT'
and  table_name = 'T'
/
And now the histogram has gone. QED.

Table Column      Num   Num               Num  Sample
Name  Name   Distinct Nulls Histogram Buckets    Size
----- ------ -------- ----- --------- ------- -------
T     A            10     0 NONE            1    1000
T     B            10     0 NONE            1    1000
T     C            10     0 NONE            1    1000
T     SKEW        120     0 NONE            1    1000

Conclusion

As you add histograms to a table, the time taken to collect statistics and parse statements increases.  I think having a mechanism to remove unwanted histograms is desirable.  However, it has the potential to cause problems, so just be aware of it.
If you are sure that you need histograms on a particular column, then I would suggest creating table preferences for METHOD_OPT and managing those preferences along with the table DDL.

Monday, August 24, 2015

When Is An Idle Wait Not An Idle Wait?

In general, performance tuning is quite straightforward. You work out how much time is spent doing what, and follow the numbers to the things that are taking the most time. If you are fortunate to the have the Diagnostics Pack on Enterprise Edition of Oracle your Active Session History (ASH) will report the active database sessions. If you can measure the overall response time of a batch or application server you have merely to deduct the DB reported in ASH, and the difference is client/middleware time.
While that is an over-simplification, it is substantially true. The rest of this blog is about an exception to the general rule.

A War Story 

I've been working with a system that makes extensive, and probably excessive, use of database links. There is a constellation of 3 Oracle databases and there are database links between them. There is also a soon-to-be legacy DB2 database, and one of the Oracle databases uses Oracle Transparent Gateway to create Heterogeneous Database Links over which the application running on Oracle references data in the DB2 database.
During a load test, nearly all of the 64 application server processes are reported as busy handling requests, yet ASH reports than usually less than 2 database sessions are concurrently busy. You would expect to see 60 application server processes consuming CPU, but CPU does not appear to be as highly utilised as we should expect.
Application Server Activity
Database Activity
An application SQL trace reports that while the majority of the time is indeed spent on client CPU, about 20% is spent on a few SQL statements that reference synonyms that refer to tables or views in a DB2 database. Why didn't these show up in ASH?

A Simple Test Script 

I wrote a simple test script that just executes the SQL statements that refer to the DB2 objects, but it collects AWR snapshots at the beginning and end of the script, and it also enables Oracle session trace.
set verify on head on timi on feedback on lines 200 trimspool on
spool otg_db2

variable b1 VARCHAR2(10)
variable b2 VARCHAR2(10)

BEGIN
 :b1 := '123456789';
 :b2 := '42';
END;

ALTER SESSION SET tracefile_identifier='OTG_DB2';
ALTER SESSION SET current_schema=APPSCHEMA;

EXEC dbms_workload_repository.create_snapshot;
EXEC dbms_application_info.set_action('OTG_DB2');
EXEC dbms_monitor.session_trace_enable;

SELECT …
FROM db2_object1 a
WHERE a.field1 = :b1 AND a.field2 = :b2
/
SELECT …
FROM db2_object2 a
WHERE a.field1 = :b1 AND a.field2 = :b2
/
…

exec dbms_application_info.set_action('NONE');
exec dbms_monitor.session_Trace_disable;
exec dbms_workload_repository.create_snapshot;

select sample_id, sample_time, sql_id, sql_plan_hash_value, sql_plan_line_id, sql_exec_id, event
from v$active_session_history
where action = 'OTG_DB2';

spool off
The first interesting output is that although some of the queries on DB2 objects took several seconds, I didn't get any rows in the query on ASH buffer for this session.
So, now let's look at the trace file. This is the output for the longest SQL statement having profiled it in TKPROF. We can see that nearly all the time is spent on an event called HS message to agent.
SELECT …
FROM db2_object1 a
WHERE a.field1 = :b1 AND a.field2 = :b2

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.04          0          0          1           0
Fetch        1      0.00       2.73          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.00       2.77          0          0          1           0

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 35  

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  Disk file operations I/O                        1        0.00          0.00
  DFS lock handle                                 1        0.00          0.00
  rdbms ipc reply                                 2        0.00          0.00
  HS message to agent                             5        2.73          2.77
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00
And this is the corresponding section of the raw trace. The execution took just under 42ms, and then the fetch took 2.73 seconds.
PARSING IN CURSOR #4579254328 len=171 dep=0 uid=35 oct=3 lid=0 tim=14984112395546 hv=1291470523 ad='700010b2204a518' sqlid='xxxxxxxxxxxxx'
SELECT …
FROM db2_object1 a
WHERE a.field1 = :b1 AND a.field2 = :b2
END OF STMT
PARSE #4579254328:c=106,e=172,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=0,tim=14984112395545
WAIT #4579254328: nam='Disk file operations I/O' ela= 23 FileOperation=8 fileno=0 filetype=8 obj#=-1 tim=14984112395633
WAIT #4579254328: nam='DFS lock handle' ela= 5205 type|mode=1146617861 id1=3569868743 id2=0 obj#=-1 tim=14984112401137
WAIT #4579254328: nam='rdbms ipc reply' ela= 75 from_process=46 timeout=900 p3=0 obj#=-1 tim=14984112401291
WAIT #4579254328: nam='rdbms ipc reply' ela= 710 from_process=46 timeout=900 p3=0 obj#=-1 tim=14984112402047
WAIT #4579254328: nam='HS message to agent' ela= 3464 p1=0 p2=0 p3=0 obj#=-1 tim=14984112405590
WAIT #4579254328: nam='HS message to agent' ela= 38459 p1=0 p2=0 p3=0 obj#=-1 tim=14984112444724
WAIT #4579254328: nam='HS message to agent' ela= 17 p1=0 p2=0 p3=0 obj#=-1 tim=14984112444791
EXEC #4579254328:c=932,e=49215,p=0,cr=0,cu=1,mis=1,r=0,dep=0,og=1,plh=0,tim=14984112444865
WAIT #4579254328: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=14984112444958
WAIT #4579254328: nam='HS message to agent' ela= 2730045 p1=0 p2=0 p3=0 obj#=-1 tim=14984115175093
WAIT #4579254328: nam='HS message to agent' ela= 8 p1=0 p2=0 p3=0 obj#=-1 tim=14984115175165
FETCH #4579254328:c=206,e=2730295,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=0,tim=14984115175285
WAIT #4579254328: nam='SQL*Net message from client' ela= 895 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=14984115176264
CLOSE #4579254328:c=10,e=16,dep=0,type=0,tim=14984115176343
Across the whole of my test I have 29 waits on this event totalling 5.3s. Why I am not seeing this event in ASH?
OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        9      0.00       0.00          0          0          0           0
Execute     10      0.00       0.21          0          0          1           3
Fetch        7      0.00       5.10          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       26      0.01       5.32          0          0          1           3

Misses in library cache during parse: 3
Misses in library cache during execute: 4

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       9        0.00          0.00
  SQL*Net message from client                     9        0.00          0.00
  Disk file operations I/O                        9        0.00          0.00
  DFS lock handle                                 1        0.00          0.00
  rdbms ipc reply                                 2        0.00          0.00
  HS message to agent                            29        2.73          5.30
  library cache pin                               2        0.00          0.00
If I generate an ASH report on the two snapshots created in this test and look at the Foreground events I might start to see why. AWR reports 29 waits totalling 5 seconds, so that agrees. However, note that this is at the bottom of the report with 'SQL*Net message from client' and there is no '%DB Time'.
AWR Foreground Events

We (should) know that 'SQL*Net message from client' is an idle event. So we can understand that there are 806 seconds reported on this event in this AWR report because there were 151 sessions connect to the database, most of which were mostly idle for the 6 seconds between the AWR snapshots. That is perfectly normal.
However, HS message to agent is classed as an idle wait, although the user was not idle. During this event the application is not busy, it is waiting for the Oracle database to respond, which is, in turn, waiting for the remote non-Oracle database to respond.
 This is in contrast to Oracle-to-Oracle (homogeneous) database links which report time to 'SQL*Net message from dblink' which is not an idle wait, but part of the 'Network' wait class. I think the difference is an anomaly.
The next question is how much time my application spent on this event. For that I must look at the AWR data, either by generating an AWR report that corresponds to the load test, or directly in dba_hist_system_event.

Waits from AWR snspshots

Conclusion

  1. I am not the first person to call this out, just the latest person to get caught by it. It may be an idle wait, but the user is waiting for the database to respond. 
    Database Link Wait Event Wait Class
    Heterogeneous HS message to agent Idle
    Homogeneous SQL*Net message from dblink Network
  2. I got as far as starting to write an e-mail to a colleague asking for help, but before sending it I realised that I could do more to find out what was happening. Perhaps just the act of trying to explain the problem helped me to think more clearly about it.