Tuesday, March 04, 2025

SQL Quarantine Behaviour When the Same SQL Executes in Different Resource Manager Consumer Groups With Different CPU Time Limits

What happens if I have two consumer groups with different CPU time limits, but the same long-running SQLs can be run in either group?

There is a use case for this question.  In my sample resource manager plan for PeopleSoft, there are consumer groups with different maximum CPU time limits.  4 hours for scheduled queries, and 2 hours for SQL*Plus/SQL Developer sessions.  

The tests in this article are based on the examples in Tim Hall's article about SQL Quarantine in 19c.

N.B. SQL Quarantine is an Exadata-only feature.

I have created a simple plan with two consumer groups that have CPU time-outs of 30 and 60 seconds.  There are no group mappings; I will switch groups manually.

BEGIN 
  DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA;
  DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA;

  DBMS_RESOURCE_MANAGER.CREATE_PLAN('GFC_TEST_PLAN');
  DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP('LOW_LIMITED30_GROUP');
  DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP('LOW_LIMITED60_GROUP');

  DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
    'GFC_TEST_PLAN', 'LOW_LIMITED30_GROUP',
    mgmt_p8 => 1,
    switch_group => 'CANCEL_SQL', 
    switch_time => 30, 
    switch_for_call => TRUE);

  DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
    'GFC_TEST_PLAN', 'LOW_LIMITED60_GROUP', 
    mgmt_p8 => 1,
    switch_group => 'CANCEL_SQL', 
    switch_time => 60, 
    switch_for_call => TRUE);

  DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
    'GFC_TEST_PLAN', 'OTHER_GROUPS');

  DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA;
  DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA;

  DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group(
    grantee_name   => user,
    consumer_group => 'LOW_LIMITED30_GROUP',
    grant_option   => FALSE);

  DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group(
    grantee_name   => user,
    consumer_group => 'LOW_LIMITED60_GROUP',
    grant_option   => FALSE);
END;
/

ALTER SYSTEM SET resource_manager_plan = 'GFC_TEST_PLAN';
I am going to create a PL/SQL function that will burn CPU as it goes around a loop.  It is just like Tim's, except that I specify the number of seconds for which I want mine to run, and it will return the number of times it has looped.
set serveroutput on timi on
CREATE OR REPLACE FUNCTION burn_cpu (p_secs IN NUMBER)
  RETURN NUMBER
AS
  l_start_time DATE;
  l_number     NUMBER := 1;
BEGIN
  l_start_time := SYSDATE;
  LOOP
    EXIT WHEN SYSDATE - l_start_time > (p_secs/86400);
    l_number := l_number + 1;
  END LOOP;
  RETURN l_number;
END;
/
I will start by switching to the consumer group LOW_LIMITED30_GROUP
DECLARE
  l_session v$session%ROWTYPE;
  l_consumer_group VARCHAR2(30):= 'LOW_LIMITED30_GROUP';
--l_consumer_group VARCHAR2(30):= 'LOW_LIMITED60_GROUP';
BEGIN 
  SELECT * INTO l_session FROM v$session WHERE sid = Sys_Context('USERENV', 'SID');
  dbms_output.put_line(l_session.sid||'.'||l_session.serial#||':'||l_session.module||':'||l_session.action||':'||l_session.resource_Consumer_Group);
  DBMS_RESOURCE_MANAGER.SWITCH_CONSUMER_GROUP_FOR_SESS(l_session.sid,l_session.serial#,consumer_group=>l_consumer_group);
  SELECT * INTO l_session FROM v$session WHERE sid = Sys_Context('USERENV', 'SID');
  dbms_output.put_line(l_session.sid||'.'||l_session.serial#||':'||l_session.module||':'||l_session.action||':'||l_session.resource_Consumer_Group);
END;
/
I have patch 30104721 installed on 19c to backport the new parameters in 23c, so I need to enable quarantine capture (see Oracle Doc ID 2635030.1: 19c New Feature SQL Quarantine - How To Stop Automatic SQL Quarantine).
ALTER SESSION SET OPTIMIZER_CAPTURE_SQL_QUARANTINE = TRUE;
ALTER SESSION SET OPTIMIZER_USE_SQL_QUARANTINE = TRUE;
I can run queries that usually run for 45 and 75 seconds, but they will be stopped when they have consumed 30 seconds of CPU time.
SELECT burn_cpu (45) FROM dual
Error report -
ORA-12777: A non-continuable error encountered.  Check the error stack for additional information [40], [pfrruncae_check_async_events], [], [].
ORA-00040: active time limit exceeded - call aborted

Elapsed: 00:00:33.574
SELECT burn_cpu (75) FROM dual
Error report -
ORA-12777: A non-continuable error encountered.  Check the error stack for additional information [40], [pfrruncae_check_async_events], [], [].
ORA-00040: active time limit exceeded - call aborted

Elapsed: 00:00:34.023
After a short period, I have 2 quarantine directives

select signature, name, sql_text, plan_hash_value, cpu_time, origin
 from dba_sql_quarantine
where sql_text like '%burn_cpu%'


             SIGNATURE NAME                                     SQL_TEXT                       PLAN_HASH_VALUE CPU_TIME   ORIGIN           
---------------------- ---------------------------------------- ------------------------------ --------------- ---------- ----------------
  12281544607895693562 SQL_QUARANTINE_anw6gs0gbta7u125daea2     SELECT burn_cpu (75) FROM dual       308129442 30         RESOURCE-MANAGER
   3516266497211383477 SQL_QUARANTINE_31m29rz6y8gpp125daea2     SELECT burn_cpu (45) FROM dual       308129442 30         RESOURCE-MANAGER
Now, the statements are prevented from running immediately.

SELECT burn_cpu (45) FROM dual
Error report -
ORA-56955: quarantined plan used

Elapsed: 00:00:00.414
SELECT burn_cpu (75) FROM dual
Error report -
ORA-56955: quarantined plan used

Elapsed: 00:00:00.416

Next, I will switch the consumer group to LOW_LIMITED60_GROUP.  The CPU limit is now 60 seconds.  Now, neither command returns ORA-56955.

The 45-second process runs to a successful conclusion.

SELECT burn_cpu (45) FROM dual;

BURN_CPU(45)
------------
    63264226

Elapsed: 00:00:45.846

The 75-second process runs for 65 seconds and is then terminated with ORA-00040 because it has reached the time limit, but not because the execution plan was quarantined.

SELECT burn_cpu (75) FROM dual
Error report -
ORA-12777: A non-continuable error encountered.  Check the error stack for additional information [40], [pfrruncae_check_async_events], [], [].
ORA-00040: active time limit exceeded - call aborted

Elapsed: 00:01:05.952

A quarantine directive with a lower CPU_TIME limit than that of the current consumer group is not applied because the statement may run to completion in a time between the lower limit and the higher limit.  Oracle allows the query to execute; it will be aborted when it reaches the higher CPU time.  

After a while, as the documentation indicates, quarantine capture will update the CPU time limit on the existing quarantine definition limit to the higher limit in the current consumer group.

The query in our example runs for 20 minutes only once, and then never again—unless the resource limit increases or the plan changes. If the limit is increased to 25 minutes, then the Resource Manager permits the statement to run again with the quarantined plan. If the statement runs for 23 minutes, which is below the new threshold, then the Resource Manager removes the plan from quarantine. If the statement runs for 26 minutes, which is above the new threshold, the plan remains in quarantine unless the limit is increased.

However, for the statement that runs successfully (for 45 seconds) in the consumer group with the higher limit (60 seconds), I have not seen the database remove the quarantine directive, even if I remove LOW_LIMITED30_GROUP consumer group from all resource plans and drop it from the database.

SIGNATURE           NAME                                     SQL_TEXT                       PLAN_HASH_VALUE CPU_TIME   ORIGIN          
------------------- ---------------------------------------- ------------------------------ --------------- ---------- ----------------
  12281544607895693562 SQL_QUARANTINE_anw6gs0gbta7u125daea2  SELECT burn_cpu (75) FROM dual       308129442 60         RESOURCE-MANAGER
   3516266497211383477 SQL_QUARANTINE_31m29rz6y8gpp125daea2  SELECT burn_cpu (45) FROM dual       308129442 30         RESOURCE-MANAGER

The next time an attempt is made to execute the 75-second statement in either consumer group, it is quarantined as before.

SELECT burn_cpu (75) FROM dual
Error report -
ORA-56955: quarantined plan used

Conclusion

  • SQL Quarantines are not tied to any particular resource plan or consumer group.
  • If the current consumer group doesn't have any timeout set, then any matching SQL quarantine directive is not applied
  • If a statement has a quarantine directive with a higher CPU time limit than the current consumer group, then it is applied immediately.
  • If a statement has a quarantine directive with a lower CPU time limit than the current consumer group, then it will be allowed to execute.  If the runtime then exceeds the consumer group CPU limit, then it will be timed out with ORA-00040.  The CPU time limit on the quarantine directive will be increased to the limit in the current consumer group.  In future, the statement will immediately be prevented from executing with ORA-56955 in both consumer groups.

Therefore, it is safe to allow different consumer groups in which the same long-running SQL statements may be executed to have different timeouts.  The lower quarantine timeout will not apply to executions in consumer groups with higher timeouts.  The SQL quarantine directives will evolve to have higher timeouts as required.

No comments :