Wednesday, October 09, 2019

Purging SQL Statements and Execution Plans from AWR

A previous blog discussed the problem posed by old execution plans that were previously captured by AWR being used by dbms_xplan.display_awr() to describe current executions plans for the same execution plan but displaying the old costs.
Tanel Poder commented that there is a TIMESTAMP column on both V$SQL_PLAN and DBA_HIST_SQL_PLAN so you know when the plans were generated, but this is not used to purge plans from AWR.  It prompted me to look into purging of statements and plans in more detail.
There has been a long-standing problem of old plans not being purged at all from AWR.  There are various notes on the Oracle support site that refer to this behaviour.   However, it appears to have been addressed by at least Oracle 19c. 
There is no snapshot ID (SNAP_ID) on either DBA_HIST_SQLTEXT or DBA_HIST_SQL_PLAN but this column does exist on the underlying tables WRH$_SQLTEXT and WRH$_SQL_PLAN, where it is used to restrict the purge process to rows created before the retention period.  SQL statements and plans collected prior to the AWR retention limit, for which there are no current SQL statistics, are purged.  However, SQL statements are not purged if SQL baselines are present for that snapshot.
Even so, an old plan can still be retained in the AWR, if the same SQL statement is still submitted and the same plan is still generated by the optimiser, and the statement is captured by the AWR snapshot either because it is in the top statements, or the SQL is 'colored'.  dbms_xplan.display_awr() will show the plan with the original costs.  It will not be purged because there are SQL statistics more recent than the AWR retention limit.

Oracle Notes

There are various references on the Oracle support website to the lack of purging resulting in significant growth in the size of the AWR tables, including bugs in old releases.

dbms_workload_repository.purge_sql_details()

SQL statements and plans are automatically purged along with the AWR snapshots when they reach the retention limit.  There is also a procedure for manual purge.  Session trace including bind variable values of the purge procedure reveals the behaviour of the purge process.  It purges WRH$_SQLTEXT, WRH$_SQL_PLAN and WRH$_SQL_BIND_METADATA for SNAP_IDs in the range 0 to one less than the oldest snapshot not beyond the retention limit.  When purging manually, it is also possible to set a maximum number of rows to be purged from each table.
select min(snap_id) 
from   wrm$_snapshot  
where  dbid = :dbid 
and    (cast(SYSTIMESTAMP as TIMESTAMP) - end_interval_time) <= :retention
SQL Statements are not purged if there are either SQL statistics for it in AWR or a SQL Baseline on it.
delete from (
  select *       
  from  wrh$_sqltext  tab  
  where tab.dbid = :dbid 
  and :beg_snap <= tab.snap_id and tab.snap_id < :end_snap 
  and not exists (
    select 1
    from   wri$_sqltext_refcount  refc  
    where  refc.dbid = tab.dbid 
    and    refc.sql_id = tab.sql_id
    and    refc.con_dbid = tab.con_dbid and refc.ref_count > 0) 
  and not exists (
    select 1  
    from   wrh$_sqlstat  s         
    where  s.dbid = :dbid and s.con_dbid = tab.con_dbid 
    and    s.sql_id = tab.sql_id))
where rownum <= :bind1
Execution plans are not purged if there are SQL statistics for it in AWR.  Statistics older than the retention limit will have been purged by this stage.
delete from (
  select *       
  from  wrh$_sql_plan  tab  
  where tab.dbid = :dbid 
  and :beg_snap <= tab.snap_id and tab.snap_id < :end_snap 
  and not exists (
    select 1           
    from   wrh$_sqlstat s          
    where  s.dbid = :dbid and s.con_dbid = tab.con_dbid 
    and    s.sql_id = tab.sql_id))
where rownum <= :bind1
dbms_workload_repository.drop_snapshot_range() behaves similarly but does not need to test for SQL statistics because it is deleting them in that range.

Comparison of Optimizer Cost of Captured Plans with Optimizer Cost of SQL Statistics

It is possible to compare the optimizer cost on SQL statistics with the optimizer cost of the captured plan and see where they have diverged.  In this example, I am looking for an absolute difference in cost of 10 and also of at least 10%.
set lines 140 pages 99
col costpctdiff heading 'Cost|%Diff' format 99999 
col costdiff heading 'Cost|Diff' format 99999999 
col plan_hash_value heading 'SQL Plan|Hash Value'
col child_number heading 'Child|No.' format 9999
col inst_id heading 'Inst|ID' format 999
col hcost heading 'AWR|Cost' format 9999999
col ccost heading 'Cursor|Cost' format 9999999
col htimestamp heading 'AWR|Timestamp'
col ctimestamp heading 'Cursor|Timestamp'
col end_interval_time format a26
col snap_id heading 'Snap|ID' format 99999
col awr_cost format 99999999
col optimizer_Cost heading 'Opt.|Cost' format 99999999
col optimizer_env_hash_value heading 'Opt. Env.|Hash Value'
col num_stats heading 'Num|Stats' format 9999
alter session set nls_date_format = 'hh24:mi:ss dd.mm.yy';
break on plan_hash_value skip 1 on sql_id on dbid
ttitle 'compare AWR/recent plan costs'

with h as ( /*captured plan outside retention limit*/
select  p.dbid, p.sql_id, p.plan_hash_Value, max(cost) cost
,       max(p.timestamp) timestamp
from    dba_hist_sql_plan p 
,       dba_hist_wr_control c
where   p.dbid = c.dbid
and     p.cost>0
and     (p.object_owner != 'SYS' OR p.object_owner IS NULL)  --omit SYS owned objects
and     p.timestamp < sysdate-c.retention 
group by p.dbid, p.sql_id, p.plan_hash_value
), s as ( /*SQL statistics*/
select  t.dbid, t.sql_id, t.plan_hash_value, t.optimizer_env_hash_value
,       t.optimizer_cost
,       MIN(t.snap_id) snap_id
,       MIN(s.end_interval_time) end_interval_time
,       COUNT(*) num_stats
from    dba_hist_snapshot s
,       dba_hist_sqlstat t
where   s.dbid = t.dbid
and     s.snap_id = t.snap_id
and     t.optimizer_cost > 0
GROUP BY t.dbid, t.sql_id, t.plan_hash_value, t.optimizer_env_hash_value
,       t.optimizer_cost
), x as (
Select  NVL(h.dbid,s.dbid) dbid
,       NVL(h.sql_id,s.sql_id) sql_id
,       NVL(h.plan_hash_value,s.plan_hash_value) plan_hash_value
,       h.cost hcost, h.timestamp htimestamp
,       s.snap_id, s.end_interval_time
,       s.optimizer_env_hash_value, s.optimizer_cost
,       s.num_stats
,       s.optimizer_cost-h.cost costdiff
,       100*s.optimizer_cost/NULLIF(h.cost,0) costpctdiff
From    h join s
on      h.plan_hash_value = s.plan_hash_value
and     h.sql_id = s.sql_id
and     h.dbid = s.dbid
), y as (
SELECT  x.*
,       MAX(ABS(costpctdiff)) OVER (PARTITION BY dbid, sql_id, plan_hash_value) maxcostpctdiff
,       MAX(ABS(costdiff)) OVER (PARTITION BY dbid, sql_id, plan_hash_value) maxcostabsdiff
FROM    x
)
SELECT  dbid, sql_id, plan_hash_value, hcost, htimestamp
,       snap_id, end_interval_time, optimizer_env_hash_value, optimizer_cost, num_stats, costdiff, costpctdiff
FROM    y
WHERE   maxcostpctdiff>=10
And     maxcostabsdiff>=10
order by plan_hash_value,sql_id,end_interval_time
/
break on report
ttitle off
We can see that according to some statement statistics, for the same execution plan, costs vary over time.  They can go up or down.  This can be caused by different object statistics or different bind variables.  It is also perfectly possible to have different cursors produce the same execution plan but with a different cost.
                                                       compare AWR/recent plan costs

                           SQL Plan      AWR AWR                 Snap                             Opt. Env.      Opt.   Num      Cost   Cost
      DBID SQL_ID        Hash Value     Cost Timestamp             ID END_INTERVAL_TIME          Hash Value      Cost Stats      Diff  %Diff
---------- ------------- ---------- -------- ----------------- ------ -------------------------- ---------- --------- ----- --------- ------
…
3719366004 1852wpfdvbbdb  428098007     5759 06:04:57 07.01.19  66189 18-JUL-19 01.00.29.158 PM  2275008064     15760    22     10001    274
                                        5759 06:04:57 07.01.19  66191 18-JUL-19 02.00.30.019 PM  2275008064      6434    40       675    112
                                        5759 06:04:57 07.01.19  66196 18-JUL-19 04.30.41.024 PM  2275008064      6435    24       676    112
                                        5759 06:04:57 07.01.19  66197 18-JUL-19 05.00.26.793 PM  2275008064      6448    10       689    112
                                        5759 06:04:57 07.01.19  66217 19-JUL-19 03.00.09.245 AM  2275008064      6444     2       685    112
3719366004 dnxpk0a7tmk2c                5761 14:25:15 07.01.19  66247 19-JUL-19 06.00.59.483 PM  2275008064      4258     2     -1503     74
…
3719366004 882x55gh06a0k 4034560521   321903 09:02:37 01.04.19  66208 18-JUL-19 10.30.45.355 PM  2275008064    471900     6    149997    147
                                      321903 09:02:37 01.04.19  66285 20-JUL-19 01.00.01.061 PM  2275008064    471848     2    149945    147
                                      321903 09:02:37 01.04.19  66326 21-JUL-19 09.30.18.514 AM  2275008064    473557     2    151654    147
…
3719366004 fgyja8zfdf4s6 4057647380      188 22:02:32 11.02.19  66214 19-JUL-19 01.30.10.479 AM    50414560       496     4       308    264

           9ud99dpmkdtq6 4059558786        3 03:56:20 01.11.18  66191 18-JUL-19 02.00.30.019 PM    50414560       170   226       167   5667

           3d6ph39wbq0zq 4078589457    62946 06:23:42 01.05.18  66270 20-JUL-19 05.30.15.218 AM  3570113222    196673     2    133727    312

           9ud99dpmkdtq6 4082397478        3 20:36:53 03.10.18  66189 18-JUL-19 01.00.29.158 PM    50414560      2296   174      2293  76533

           6rkn5ty3mpdg8 4086831979     1096 11:54:01 22.01.19  66201 18-JUL-19 07.00.07.631 PM    50414560       549    16      -547     50
                                        1096 11:54:01 22.01.19  66212 19-JUL-19 12.30.00.506 AM    50414560      6055    16      4959    552
…

Summary

Old execution plans will not be purged from AWR if there are recent SQL statistics for the same SQL ID and execution plan hash value.  You will see the old plan costs if you extract the plan with
dbms_xplan.display_awr().

Is that a problem?
  • This does not have an impact on the behaviour of the database.  The data in AWR tells you what happened in the past.
  • I think it is most important to know how far you can trust display_awr().  The plan can be used as a guide to the path of execution, but not to the amount of work done.  It will show you the correct operations in the execution plan, but costs, bind variables, partition IDs and anything that does not contribute to the plan hash value should be approached with caution.  This is why I often profile DB time for a SQL statement by SQL_PLAN_LINE_ID so I can see how much time was spent in each operation.
Oracle Doc ID 1478615.1 implies that it is permitted to manually purge specific plans.  A subsequent AWR snapshot might then capture a new version of the same execution plan.  However, there are potential problems with this approach.
You may have an example where the old estimates for a plan don't make sense for a recent execution, but if you replace the plan with an updated one to get recent estimates that may not make sense for some activity in the past.
You can have different child cursors with different costs for the same execution plan.  Therefore, it is effectively arbitrary as to which plan would be captured into AWR, so there is no guarantee that you will get a more representative plan.  It is not even certain that costs for one cursor make sense for another.

I am not sure that there is any material benefit that is worth the effort of implementing something to additionally purge old execution plans.  I would welcome comments and opinions on this subject.

No comments :