Tuesday, January 20, 2009

Detecting and Fixing Row Migration

In my previous posting, I discussed how migrated rows led to latch connection problems on a system. In this entry I will explain how I identified and removed the migrated rows, and correctly set PCTFREE on each table so that the problem will not recur.

Instead, you must use the ANALYZE command if you want to know about chained and migrated rows. DBMS_STATS only collects the statistics required by the cost-based optimizer. It doesn't populate the other statistics, including CHAIN_CNT and AVG_FREE_SPACE. So I wrote a simple script (available on my website) to work though a set of tables, and list the chained or migrated rows into a table. I sometimes choose to analyse only certain tables. The criteria in the comment are examples of criteria I add to the driving query to restrict the list of tables to be analysed.

BEGIN
FOR x IN (
SELECT owner, table_name, num_rows
FROM   all_tables t
WHERE  1=1
/*-------------------------------------------------------
AND    NOT table_name IN(SELECT DISTINCT table_name FROM chained_rows)
AND    num_rows >= 1000000
AND    num_rows BETWEEN 100000 AND 1000000
/*--------------------------------------------------------*/
AND    temporary = 'N'
) LOOP
DELETE FROM chained_rows
WHERE owner_name = x.owner
AND table_name = x.table_name;

EXECUTE IMMEDIATE 'ANALYZE TABLE '||x.owner||'.'||x.table_name
||' LIST CHAINED ROWS INTO chained_rows';
END LOOP;
END;
/

Having identified the migrated rows, I produced a simple report

SELECT /*+LEADING(c)*/ c.*, t.num_rows
, c.chained_rows/t.num_rows*100 pct_chained
, t.pct_free, t.pct_used
, 100-FLOOR((100-t.pct_free)*(1-c.chained_rows/t.num_rows)) new_pct_free
FROM (
SELECT owner_name, table_name, COUNT(*) chained_rows
FROM chained_rows c
GROUP BY owner_name, table_name) c
, all_tables t
WHERE t.owner = c.owner_name
AND   t.table_name = c.table_name
AND   t.num_rows > 0
ORDER BY chained_rows desc, 1,2
/

The last column of the report is a calculated new value for PCTFREE. The amount of in a block that can be used to insert new rows (100-PCTFREE) space used is scaled back by the proportion of migrated rows.

NB: This approach doesn't take chained rows into account. Chained rows are too big to fit in a single block under any circumstances, as opposed to migrated rows that could have fitted in a block had there been space. However, Oracle counts both types and stores the result in the column CHAIN_CNT.

Chained   Number             %    %   New
Owner    Table Name                 Rows  of Rows %Chained Free Used %Free
-------- -------------------- ---------- -------- -------- ---- ---- -----
MY_APP   MY_TABLE                 239667  1193233     20.1   10   40    29
…

The script then generates certain commands for each table. First the table is rebuilt by moving it to the same tablespace.

I specify PCTFREE as 1. Usually, the whole table does not need to be rebuilt with the new higher value for PCTFREE. Most of the rows are fully populated and generally will not grow further because they have already migrated. Setting a high value for PCTFREE will simply result in leaving a lot of free space, and result in Oracle reading more blocks for the same data. Instead, I set a low value for PCTFREE in order to pack the data into the minimum number of blocks.

ALTER TABLE MY_APP.MY_TABLE MOVE TABLESPACE MYAPP_TABLE PCTFREE 1;
…

The new value for PCTFREE is then applied to the table. This approach is not foolproof. It is possible for existing rows, that have now been packed into the minimum number of blocks, to grow and to be migrated. However, I have found that this happens relatively rarely.

If necessary, I also reduce PCTUSED such that the sum of the new PCTFREE and the old PCTUSED does not exceed 90. The old value is shown in the comment for reference.

ALTER TABLE MY_APP.MY_TABLE PCTFREE 29 /*10*/ PCTUSED 40;
…

Finally, all the indexes on the rebuilt tables need to be rebuilt, because they are invalidated by the table rebuild.

ALTER INDEX IDX_MY_TABLE_BH25150A REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150B REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150C REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150D REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150E REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150F REBUILD TABLESPACE MYAPP_INDEX;

Conclusion
  • The table no longer has migrated rows.
  • The existing data is packed with little free space, minimising the size of the table.
  • Free space will be left in new blocks to allow new rows to expand without migrating.
  • All of the DDL to fix the problem has been built dynamically.
(5.11.2009) See also Tanel Poder's blog: Core IT for Geeks and Pros: Detecting and Fixing Row Migration

Monday, December 15, 2008

Row Migration can Aggravate Contention on Cache Buffers Chains Latch

AWR Wait EventsOne of my customers has an Oracle based system with a large number of concurrent users. For time to time, the users would report that the system would 'grind to a halt'. Examination of AWR data showed lots of time spent waiting on latch: cache buffers chains (I discussed how to graph AWR data in Excel in a previous blog entry, which is how I produced this graph of database wait events).

They system had been able to go for weeks without an incident. More recently, as more users have been migrated onto the system, we would a series of days with spikes on this wait event, though they were normally quite short-lived. The incident shown in this chart was unusual in that it lasted most of a day. The chart also shows that a reasonable amount of time was lost on db file sequential read, this is consistent with blocks being loaded into the buffer cache, requiring access to the cache buffer chains, and hence requiring access to the latch that protects these chain.

Concurrent Database SessionsUsing a similar technique I was graph the number of database sessions over time, and I found a similar series of spikes.

By eye I could see that the spikes in the latch contention seemed to correspond to the spikes in the number of concurrent user sessions, and they had roughly the same shape.

I then graphed time waited for this latch against the number of concurrent user sessions and I got a strong, and surprisingly linear, correlation.Cache Buffers Chains Latch Wait -v- Concurrent Sessions

It is very clear that whenever we had more than about 90 concurrent user sessions, the system also lost time waiting on the cache buffers chains latch. Basically, this application won't scale any further!

So, why was the latch being held for such a long time? The buffer cache is 1Gb (with a block size of 8Kb), so it is large, but not excessively so. At this point somebody (and I can't claim the credit for this) thought to check for migrated rows on the tables involved in the longest running SQL statements during the periods of high latch contention, and discovered that a number of the most heavily use tables had a significant quantity of migrated rows.

A row migrates when the data is updated and there is no free space left in the data block to store the new data values. Oracle puts the row into a new block, but it does not update the indexes to point to the new block. Instead, it puts a forwarding pointer into the original block. Therefore, to read the data from a migrated row requires visits to two blocks. When retrieving data into the buffer cache your process need to update two buffer blocks, and must acquire the buffer chains latch twice. Row migration causes poor performance because Oracle must do twice as much work.

There are two factors that come together to cause row migration:
  • A process inserts a row, and then subsequently updates columns on that row, often from null to not null values, but sometimes with just more data. In my case, the application was storing XML structures in a clob (and most of the clobs were small enough to be stored in-line with the rest of the row).
  • There is insufficient free space left in the data blocks to allow for the subsequent updates to be stored in the original data block.
In this particular case, I was dealing with a third-party packaged application. So there was no possibility to change the way the application inserts and updats the data. However, we could and did rebuild the object to eliminate migration of existing rows and set a realistic PCTFREE to reserve free space for future rows.

AWR Wait, magnifiedThis chart shows the same data as the first one, except that I have changed the scale on the y-axis. The tables with the worst row migration were reorganised on the Tuesday night and the system ran on Wednesday under a normal load without any problem.

Contention on the buffer cache chains latch fell to trivial levels, just 408 seconds in 24 hours. Interestingly, the amount of db file sequential read also fell by 70%. This is due to better cache efficiency. Since eliminating much of the row migration, fewer blocks need to be loaded into cache, and so blocks stay in the buffer cache for longer.

This is a real-life example of:
  • why it is important, sometimes critically so, to set physical attributes on tables properly;
  • why it is necessary to understand how your application is updating the database;
  • and what are the potential implications of not doing so!

Saturday, December 13, 2008

Graphing AWR Data in Excel

I often use data collected by the Oracle Automatic Workload Repository (AWR) to help to diagnose performance problems. However, I often work on performance problems with application teams, rather than the DBAs.  It is surprising how often that I don't have access to Oracle Enterprise Manager.

Somebody might say that the system was slow at a particular time. I want to get an overview of the database at that time, and I might also want to compare it to another time when the system wasn't slow. Later I may generate AWR reports using particular pairs of snapshots, but I need something to direct me to when an issue occurred, and hence which snapshots to compare.

When dealing with performance problems the question is nearly always the same. "How long did I spend doing what?". Amongst lots of metrics, AWR collects snapshots of how much time system-wide has been spent waiting on each wait event.

Thumbnail of AWR wait chartTo give me an overview of that data, I have created an excel spreadsheet that will query the AWR repository and retrieve that data for up to a week into am Excel pivot table, and will then graph the top wait events. Click on the thumbnail on the left to see an example.

For me, this approach has a number of advantages:

  • First; don't drown in numbers. The eye is very good at picking out patterns and irregularities. You can then drill into the numbers behind the chart.
  • The chart provides a quick overview of the behaviour of the database. You can quickly see if there is anything to examine in more detail, or whether the database is exhibiting its normal characteristics.
  • The AWR data is extracted into the spreadsheet, so I can keep a copy of the spreadsheet for reference. I generally like to collect AWR snapshots every 15 minutes. One week of data, at that frequency, turns into about 6Mb of spreadsheet.
  • Excel charts can easily be put into documents, presentations or e-mails. Excel is very flexible, it is easy to format the chart to show only what is important.
  • It's free (that is to say that you won't need to go and buy anything else). All you need is Excel and an Oracle client.
Over time I have produced a variety of spreadsheets that collect all sorts of metrics from the Oracle database and others that collect specific metrics from PeopleSoft applications.

To be fair, AWR's forerunner, Statspack collects a lot of similar data to AWR, the essential difference being that AWR is a licensed option. This technique could also be used to extract data from the statspack repository, but most of my customers are licensed for AWR.

You can download a sample Excel workbook and query from my website. Go to one of the sheets with data and 'Refresh Data'. The first time you will probably be prompted to select and ODBC source. You will be prompted for the login credentials.

In a subsequent blog posting I will discuss a real example of where this technique helped me to resolve a problem.

It is fairly straightforward to create spreadsheets that query databases, but here are some links that should help to get you going if you are not familiar with the process.
  • From the Microsoft website: 'Use Microsoft Query to retrieve external data'. This is based on Excel 2007, but it explains how Excel calls MSQuery which then calls the ODBC driver. However, I recommend that you do not use the Query Wizard. I prefer to get the query working in SQL*Plus and then I copy it into MSQuery
  • Querying External Data in Excel. This is a straightforward recipe for using MSQuery.
Update 16.6.2010: I have released a new version with an adjustment to the query used in the spreadsheet.  AWR snapshots happen at slightly different times on different RAC nodes, although they have the same SNAP_ID.  This was causing multiple rows for the same snapshot in the pivot tables, and so the charts did not show correctly aggregated wait events across RAC nodes.

Sunday, October 28, 2007

TM locking: Checking for Missing Indexes on Foreign Key Constraints

Recently, I was working on a packaged application purchased from a third-party vendor. It is one of those platform agnostic systems that started life on Microsoft SQL Server, and has been ported to Oracle. I spend a lot of my time working with PeopleSoft, so I had a certain sense of déjà vu. However, this application uses referential integrity.

The application was upgraded, and simultaneously Oracle was upgraded to 10g and then exhibited TM contention. It had probably been suffering from TM contention while running on Oracle 9i, but we hadn't realised because Oracle9i only reports 'enqueue'.

From 10g, there are no less that 208 different enqueue wait events, that show the type of lock that the process is waiting for, and sometimes additional information. In my case it was event 175. Events can be listed from v$event_name.

SELECT event#, name FROM v$event_name
WHERE UPPER(name) LIKE 'ENQ: TM%'
/
EVENT#     NAME
---------- --------------------
175 enq: TM - contention

With a little help from my friends I came to realise that the cause of this contention was that the system had foreign key constraints on columns that were not indexed. Having found one example of this, I realised that I needed a way to check the entire data model. The result was the following SQL and PL/SQL script.

REM fk_index_check.sql
REM 19.10.2007

Uncommenting the following section will produce a test case that should build two indexes.

/*--------------------------------------------------------------
ALTER TABLE EMP_TAB DROP CONSTRAINT MGR_FKEY;
ALTER TABLE EMP_TAB DROP CONSTRAINT DEPT_FKEY;
DROP TABLE Emp_tab;
DROP TABLE DEPT_TAB;

CREATE TABLE Dept_tab (
setid   NUMBER(3),
deptno  NUMBER(3),
dname   VARCHAR2(15),
loc     VARCHAR2(15)
--CONSTRAINT dname_ukey UNIQUE (Dname, Loc),
--CONSTRAINT loc_check1
--CHECK (loc IN ('NEW YORK', 'BOSTON', 'CHICAGO'))
,CONSTRAINT Dept_pkey PRIMARY KEY (setid,deptno)
)
/
CREATE TABLE Emp_tab (
empno    NUMBER(5) CONSTRAINT emp_pkey PRIMARY KEY,
ename    VARCHAR2(15) NOT NULL,
job      VARCHAR2(10),
mgr      NUMBER(5) CONSTRAINT mgr_fkey REFERENCES emp_tab,
hiredate DATE,
sal      NUMBER(7,2),
comm     NUMBER(5,2),
setid    NUMBER(3),
deptno   NUMBER(3) NOT NULL,
CONSTRAINT dept_fkey FOREIGN KEY (setid,deptno)
REFERENCES dept_tab (setid,deptno) ON DELETE CASCADE
)
/
/*------------------------------------------------------------*/
set serveroutput on buffer 1000000000

GFC_FK_INDEX_CHECK is a working storage script that is to hold results of the tests on each foreign key.

DROP TABLE gfc_fk_index_check
/
CREATE TABLE gfc_fk_index_check
(owner             VARCHAR2(30) NOT NULL
,table_name        VARCHAR2(30) NOT NULL
,constraint_name   VARCHAR2(30) NOT NULL
,r_owner           VARCHAR2(30) NOT NULL
,r_table_name      VARCHAR2(30) NOT NULL
,r_constraint_name VARCHAR2(30) NOT NULL
,i_index_owner     VARCHAR2(30)
,i_index_name      VARCHAR2(30)
,i_status          VARCHAR2(30) DEFAULT 'UNKNOWN'
,i_column_list     VARCHAR2(300)
,CONSTRAINT gfc_fk_index_check_pk
PRIMARY KEY(table_name, constraint_name)
)
/
TRUNCATE TABLE gfc_fk_index_check
/

First the script populates the working storage table with all the referential integrity constraints that reference a primary key constraint.

INSERT INTO gfc_fk_index_check
(owner, table_name, constraint_name
,r_owner, r_constraint_name, r_table_name)
SELECT c.owner, c.table_name, c.constraint_name
,      c.r_owner, c.r_constraint_name
,      r.table_name r_table_name
FROM   all_constraints c
,      all_constraints r
WHERE  c.constraint_Type = 'R'
AND    r.owner = c.r_owner
AND    r.constraint_name = c.r_constraint_name
AND    r.constraint_Type = 'P'
AND    r.owner = user
/

This PL/SQL routine checks each foreign key constraint in the table for each constraint it looks up the referring columns in all_cons_columns and builds a dynamic query that SELECTs the owner and name of an index with the same columns in the same position. The name of that index and the column list is stored on the working storage table. Depending upon how many rows that query returns, a status string is written to the table: No Index/Index Found/Multiple Indexes

DECLARE
l_counter     NUMBER;
l_column_list VARCHAR2(200);
l_sql1        VARCHAR2(4000);
l_sql2        VARCHAR2(4000);
l_tmp1        VARCHAR2(20);
l_tmp2        VARCHAR2(20);
l_alias       VARCHAR2(3);
l_oldalias    VARCHAR2(3);
l_index_owner VARCHAR2(30);
l_index_name  VARCHAR2(30);
l_status      VARCHAR2(30);
BEGIN
FOR a IN (SELECT * FROM gfc_fk_index_check) LOOP
 l_counter := 0;
 l_column_list := '';
 l_sql1 := 'SELECT i1.index_owner, i1.index_name';
 l_sql2 := '';
 FOR b IN (SELECT *
        FROM  all_cons_columns c
        WHERE c.owner = a.owner
        AND   c.constraint_name = a.constraint_name
        AND   c.table_name = a.table_name
        ORDER BY position) LOOP
  l_counter := l_counter + 1;
  l_oldalias := l_alias;
  l_alias := ' i'||TO_CHAR(l_counter);
  IF l_counter > 1 THEN
   l_sql1 := l_sql1||', '; 
   l_sql2 := l_sql2
         ||' AND '||l_oldalias||'.index_owner='
                  ||l_alias   ||'.index_owner'
         ||' AND '||l_oldalias||'.index_name='
                  ||l_alias   ||'.index_name'
         ||' AND ';
   l_column_list := l_column_list||',';
  ELSE
   l_sql1 := l_sql1||' FROM ';
   l_sql2 := l_sql2||' WHERE';
  END IF;
  l_sql1 := l_sql1||'all_ind_columns'||l_alias;
  l_sql2 := l_sql2
            ||l_alias||'.TABLE_OWNER='''||b.owner||''''
   ||' AND '||l_alias||'.TABLE_NAME='''||b.table_name||''''
   ||' AND '||l_alias||'.COLUMN_NAME='''||b.column_name||''''
   ||' AND '||l_alias||'.COLUMN_POSITION='''||b.position||'''';
  l_column_list := l_column_list||b.column_name;
 END LOOP;
--   dbms_output.put_line(l_sql1);
--   dbms_output.put_line(l_sql2);
--   dbms_output.put_line(l_column_list);
 l_status := a.i_status;
 l_index_owner := '';
 l_index_name := '';
 BEGIN
  EXECUTE IMMEDIATE l_sql1||l_sql2
               INTO l_index_owner, l_index_name;
  l_status := 'Index Found';
 EXCEPTION
  WHEN NO_DATA_FOUND THEN l_status := 'No Index';
  WHEN TOO_MANY_ROWS THEN l_status := 'Multiple Indexes';
 END;
 UPDATE gfc_fk_index_check
 SET    i_status = l_status
 ,      i_index_owner = l_index_owner
 ,      i_index_name  = l_index_name
 ,      i_column_list = l_column_list
 WHERE  owner = a.owner
 AND    table_name = a.table_name
 AND    constraint_name = a.constraint_name;
END LOOP;
COMMIT;
END;
/

This query produces a simple report on each foreign key constraint.

set lines 90 head on feedback on echo on
column owner             format a20
column table_name        format a30
column constraint_name   format a30
column r_owner           format a20
column r_constraint_name format a30
column r_table_name      format a30
column i_index_owner     format a20
column i_index_name      format a30
column i_status          format a30
column i_column_list     format a80
spool fk_index_check
SELECT g.owner, g.table_name, g.constraint_name
,      g.r_owner, g.r_table_name, g.r_constraint_name
,      g.i_index_owner, g.i_index_name, g.i_status
,      g.i_column_list
FROM   gfc_fk_index_check g
/
spool off

This query is similar to the last, but it produces a report of just largest tables that lack indexes on FK constraints. It show tables more than 10000 rows (according to the CBO statistics), or at least the top 20. These are likely to be most severe offenders.

spool fk_index_by_size
SELECT * from (
SELECT g.owner, g.table_name, g.constraint_name
,      g.r_owner, g.r_table_name, g.r_constraint_name
,      g.i_index_owner, g.i_index_name, g.i_status
,      /*t.temporary, t.partitioned, */ t.num_rows
,      g.i_column_list
FROM   gfc_fk_index_check g, all_tables t
WHERE  t.table_name = g.table_name
AND    t.owner = g.owner
AND    g.i_status = 'No Index'
ORDER BY num_rows desc
) WHERE rownum <= 20 or num_rows >= 10000
/
spool off

This query generates a script constraint create index DDL statements that will build the missing indexes. The index will have the same name as the foreign key constraint to which it relates.

set head off trimout on trimspool on feedback off verify off timi off echo off lines 200
spool fk_index_build.sql
SELECT 'CREATE INDEX '||g.owner||'.'||g.constraint_name
      ||' ON '||g.owner||'.'||g.table_name
      ||' ('||g.i_column_list||');' build_indexes
FROM   gfc_fk_index_check g, all_tables t
WHERE  t.table_name = g.table_name
AND    t.owner = g.owner
AND    g.i_status = 'No Index'
ORDER BY t.num_rows
/
spool off
set lines 90 head on feedback on echo on

The test script correctly reports (in fk_index_check.LST) that there are two foreign keys that require supporting indexes

OWNER           TABLE_NAME           CONSTRAINT_NAME
--------------- -------------------- --------------------
R_OWNER         R_TABLE_NAME         R_CONSTRAINT_NAME
--------------- -------------------- --------------------
I_INDEX_OWNER   I_INDEX_NAME         I_STATUS
--------------- -------------------- --------------------
I_COLUMN_LIST
---------------------------------------------------------
SYSADM          EMP_TAB              MGR_FKEY
SYSADM          EMP_TAB              EMP_PKEY
                                    No Index
MGR

SYSADM          EMP_TAB              DEPT_FKEY
SYSADM          DEPT_TAB             DEPT_PKEY
                                    No Index
SETID,DEPTNO

It produces another script fk_index_build.sql that will build the missing indexes.

CREATE INDEX SYSADM.MGR_FKEY ON SYSADM.EMP_TAB (MGR);
CREATE INDEX SYSADM.DEPT_FKEY ON SYSADM.EMP_TAB (SETID,DEPTNO);

When I ran this test script on my problem application, it identified over 200 missing indexes on 900 foreign key constraints, and since building the indexes on tables where I have seen TM locking, I haven't seen any TM locking contention.

The script can be downloaded from the Go-Faster website at http://www2.go-faster.co.uk/scripts.htm#fk_index_check.sql

Caveat: Just because you can index a foreign key, doesn't mean that you should. See
http://www.jlcomp.demon.co.uk/faq/fk_ind.htmlThis query produces a simple report on each foreign key constraint.