Tuesday, February 13, 2024

Table Clusters: 1. An Alternative to Partitioning? - Introduction & Ancient History

This post is the first part of a series that discusses table clustering in Oracle.
Links will appear as sections are posted.

Introduction

Table clustering and table partitioning are very different technologies.  However, they both create a relationship between the logical value of the data and its physical location.  Similar data values are stored together, and therefore dissimilar data values are kept apart.  

The advantage of storing similar values together is to reduce I/O and improve access time.  However, this series of blogs looks at the characteristic of keeping dissimilar values apart that, as with partitioning, can be harnessed to avoid the need to maintain read consistency during concurrent processing and therefore avoid its overhead.

Partitioning is only available in the Enterprise Edition of Oracle, and then you have to license the partitioning option.  Table clustering is available in all database versions and doesn't require any additional licence.  So you might consider clustering when partitioning is not an option.

Ancient History

The last time I put tables into a cluster was in 2001 on Oracle 7.3.3 (partitioning didn't become available until Oracle 8.0).  Our problem was that multiple instances of the PeopleSoft Global Payroll calculation were concurrently updating different rows in the same data blocks leading the database to generate read consistent copies of each block for each session.  That consumed lots of CPU, required additional space in the buffer cache, generated additional physical reads on the undo segments, and generated additional writes due to delayed block cleanout of dirty data blocks in the buffer cache.  This significantly degraded performance, and very soon overall performance became worse as we increased the number of concurrent processes.

I had the idea of clustering the payroll tables on employee ID.  Thus I could ensure the data for different employees was in different data blocks and the database wouldn't have to do read-consistent recovery on the blocks in those tables.  There might still be some contention on indexes, but this would be less severe on indexes that lead on the cluster key columns because index entries are sorted in key order.

"A table cluster is a group of tables that share common columns and store related data in the same blocks … Because table clusters store related rows of different tables in the same data blocks, properly used table clusters offer the following benefits over non-clustered tables:

see Oracle 19c Database Concepts: Overview of Table Clusters

Table clusters were not fashionable then, and have certainly not become more so since.  Although we all use them every day, the Oracle catalogue has 37 tables in 10 clusters.  In 19c, the C_OBJ# cluster contains 17 tables!  When I proposed table clustering, the Swiss DBA turned to me and said 'If you build a cluster, I am going to a Kloster!' (this pun works in German: a 'Kloster' is a monastery or convent).  This rebuke has stayed with me ever since.

Nonetheless, we rebuilt our result tables in a cluster, and it delivered a performance improvement until the data volumes grew such that suddenly we had multiple data blocks per cluster key, and then the performance was much worse!  Our mistake was not having enough columns in the cluster key, thus illustrating that the choice of cluster keys is very important.

However, that forced the upgrade to Oracle 8i and we started to use table partitioning, such that a partition corresponded to the data processed by each concurrent payroll process.  That approach works very well, certainly better than clustering, for many customers who use this product and are licensed for partitioning.  They could generally scale the number of streams until they fully loaded either the CPU or the disk subsystem. 

Now in 2023, I am looking at another large PeopleSoft HCM implementation using the same calculation engine for absence, but this customer isn't licensed for partitioning, so we are back to table clusters.

Now read on.

No comments :