Monday, December 15, 2008

Row Migration can Aggravate Contention on Cache Buffers Chains Latch

AWR Wait EventsOne of my customers has an Oracle based system with a large number of concurrent users. For time to time, the users would report that the system would 'grind to a halt'. Examination of AWR data showed lots of time spent waiting on latch: cache buffers chains (I discussed how to graph AWR data in Excel in a previous blog entry, which is how I produced this graph of database wait events).

They system had been able to go for weeks without an incident. More recently, as more users have been migrated onto the system, we would a series of days with spikes on this wait event, though they were normally quite short-lived. The incident shown in this chart was unusual in that it lasted most of a day. The chart also shows that a reasonable amount of time was lost on db file sequential read, this is consistent with blocks being loaded into the buffer cache, requiring access to the cache buffer chains, and hence requiring access to the latch that protects these chain.

Concurrent Database SessionsUsing a similar technique I was graph the number of database sessions over time, and I found a similar series of spikes.

By eye I could see that the spikes in the latch contention seemed to correspond to the spikes in the number of concurrent user sessions, and they had roughly the same shape.

I then graphed time waited for this latch against the number of concurrent user sessions and I got a strong, and surprisingly linear, correlation.Cache Buffers Chains Latch Wait -v- Concurrent Sessions

It is very clear that whenever we had more than about 90 concurrent user sessions, the system also lost time waiting on the cache buffers chains latch. Basically, this application won't scale any further!

So, why was the latch being held for such a long time? The buffer cache is 1Gb (with a block size of 8Kb), so it is large, but not excessively so. At this point somebody (and I can't claim the credit for this) thought to check for migrated rows on the tables involved in the longest running SQL statements during the periods of high latch contention, and discovered that a number of the most heavily use tables had a significant quantity of migrated rows.

A row migrates when the data is updated and there is no free space left in the data block to store the new data values. Oracle puts the row into a new block, but it does not update the indexes to point to the new block. Instead, it puts a forwarding pointer into the original block. Therefore, to read the data from a migrated row requires visits to two blocks. When retrieving data into the buffer cache your process need to update two buffer blocks, and must acquire the buffer chains latch twice. Row migration causes poor performance because Oracle must do twice as much work.

There are two factors that come together to cause row migration:
  • A process inserts a row, and then subsequently updates columns on that row, often from null to not null values, but sometimes with just more data. In my case, the application was storing XML structures in a clob (and most of the clobs were small enough to be stored in-line with the rest of the row).
  • There is insufficient free space left in the data blocks to allow for the subsequent updates to be stored in the original data block.
In this particular case, I was dealing with a third-party packaged application. So there was no possibility to change the way the application inserts and updats the data. However, we could and did rebuild the object to eliminate migration of existing rows and set a realistic PCTFREE to reserve free space for future rows.

AWR Wait, magnifiedThis chart shows the same data as the first one, except that I have changed the scale on the y-axis. The tables with the worst row migration were reorganised on the Tuesday night and the system ran on Wednesday under a normal load without any problem.

Contention on the buffer cache chains latch fell to trivial levels, just 408 seconds in 24 hours. Interestingly, the amount of db file sequential read also fell by 70%. This is due to better cache efficiency. Since eliminating much of the row migration, fewer blocks need to be loaded into cache, and so blocks stay in the buffer cache for longer.

This is a real-life example of:
  • why it is important, sometimes critically so, to set physical attributes on tables properly;
  • why it is necessary to understand how your application is updating the database;
  • and what are the potential implications of not doing so!

Saturday, December 13, 2008

Graphing AWR Data in Excel

I often use data collected by the Oracle Automatic Workload Repository (AWR) to help to diagnose performance problems. However, I often work on performance problems with application teams, rather than the DBAs.  It is surprising how often that I don't have access to Oracle Enterprise Manager.

Somebody might say that the system was slow at a particular time. I want to get an overview of the database at that time, and I might also want to compare it to another time when the system wasn't slow. Later I may generate AWR reports using particular pairs of snapshots, but I need something to direct me to when an issue occurred, and hence which snapshots to compare.

When dealing with performance problems the question is nearly always the same. "How long did I spend doing what?". Amongst lots of metrics, AWR collects snapshots of how much time system-wide has been spent waiting on each wait event.

Thumbnail of AWR wait chartTo give me an overview of that data, I have created an excel spreadsheet that will query the AWR repository and retrieve that data for up to a week into am Excel pivot table, and will then graph the top wait events. Click on the thumbnail on the left to see an example.

For me, this approach has a number of advantages:

  • First; don't drown in numbers. The eye is very good at picking out patterns and irregularities. You can then drill into the numbers behind the chart.
  • The chart provides a quick overview of the behaviour of the database. You can quickly see if there is anything to examine in more detail, or whether the database is exhibiting its normal characteristics.
  • The AWR data is extracted into the spreadsheet, so I can keep a copy of the spreadsheet for reference. I generally like to collect AWR snapshots every 15 minutes. One week of data, at that frequency, turns into about 6Mb of spreadsheet.
  • Excel charts can easily be put into documents, presentations or e-mails. Excel is very flexible, it is easy to format the chart to show only what is important.
  • It's free (that is to say that you won't need to go and buy anything else). All you need is Excel and an Oracle client.
Over time I have produced a variety of spreadsheets that collect all sorts of metrics from the Oracle database and others that collect specific metrics from PeopleSoft applications.

To be fair, AWR's forerunner, Statspack collects a lot of similar data to AWR, the essential difference being that AWR is a licensed option. This technique could also be used to extract data from the statspack repository, but most of my customers are licensed for AWR.

You can download a sample Excel workbook and query from my website. Go to one of the sheets with data and 'Refresh Data'. The first time you will probably be prompted to select and ODBC source. You will be prompted for the login credentials.

In a subsequent blog posting I will discuss a real example of where this technique helped me to resolve a problem.

It is fairly straightforward to create spreadsheets that query databases, but here are some links that should help to get you going if you are not familiar with the process.
  • From the Microsoft website: 'Use Microsoft Query to retrieve external data'. This is based on Excel 2007, but it explains how Excel calls MSQuery which then calls the ODBC driver. However, I recommend that you do not use the Query Wizard. I prefer to get the query working in SQL*Plus and then I copy it into MSQuery
  • Querying External Data in Excel. This is a straightforward recipe for using MSQuery.
Update 16.6.2010: I have released a new version with an adjustment to the query used in the spreadsheet.  AWR snapshots happen at slightly different times on different RAC nodes, although they have the same SNAP_ID.  This was causing multiple rows for the same snapshot in the pivot tables, and so the charts did not show correctly aggregated wait events across RAC nodes.