Monday, November 16, 2020

Retrofitting Partitioning into Existing Applications: Example 2. Payroll

This post is part of a series about the partitioning of database objects.

Partitioning Payroll

Range and List Partitioning brings similar data together and therefore keeps dissimilar data apart. This has implications for read-consistency as well as improving query performance by partition elimination. 
Hash Partitioning spreads rows roughly evenly across a number of partitions. This can be used to mitigate contention problems. It is recommended that the number of hash partitions should be an integral power of 2 (ie. 2, 4, 8, 16 etc.) because the partition is taken from a number of bits from the hash value, and the distribution of data across the partitions works better. 
Payroll calculation involves lots of computation per employee. There isn't much opportunity for database parallelism. The PeopleSoft Global Payroll (GP) calculation process works through employees in a sequential fashion. Each payroll process only consumes a single processor at any one time. In order to bring more resources to bear on the payroll and therefore process it is less time, multiple payroll calculation processes are run concurrently, each one working on a distinct set of data. In GP, the sets of data are ranges of employee IDs. Each set is called a 'stream'. The payroll processes are then configured to process a specific stream. Most of the SQLs therefore have EMPLID BETWEEN criteria.
Typically, large companies run payroll calculation processes several times per pay period. Partly to see what the payroll value is in advance, and partly to see the effect of changes before the final payroll that is actually used to pay employees. Each concurrent payroll calculation process inserts data into result tables, also concurrently. So it is common for data blocks in result tables to contain data from many different streams. When the payroll is recalculated, results from previous payrolls are deleted (by statements such as the one above), also concurrently. You now have different transactions deleting different rows from the same data block. There is never any row-level locking because each row is only in scope for one and only one process. However, each delete transaction comes from a different process that created a different database session, that started at a slightly different time and therefore has a different System Change/Commit Number (SCN). Therefore, each payroll process needs its own read-consistent version of every data block that it reads, recovered back to its own SCN. So if I have 10 streams, I am likely to need 10 copies of every data block of every payroll-related table in the buffer cache. 
 The result is that the payroll runtime degrades very significantly with the number of concurrent processes to the extent that it quickly becomes worse than running a single process because the database
  1. spends a huge amount of time on read-consistency,
  2. is more likely to run out of buffer cache, so blocks are aged out, reloaded, and may have to be recovered back to the desired SCN again. 
However, if one can align the partitioning with the processing, then this behaviour can be eliminated. If the payroll result tables (and some of the other application tables) are each range partitioned on EMPLID such that there is a 1:1 relationship of payroll stream to partition, then this problem does not occur because each stream references a single partition of each table, and each data block will only contain rows for one stream and so can only ever have a single transaction.  Thus there is no requirement to produce a consistent version of a block. The database only needs a single copy of each data block in memory. The result is almost 100% scalability of payroll processing until eventually, the file system cannot cope with the redo generation. 
This approach relies absolutely on the application processing ranges of employees specified with the between criteria, and that criteria mapping to one partition. When implemented the result is single range partition queries.
Both partitioning and application configuration had to change and meet somewhere in the middle. The number of streams is limited by the hardware, usually the number of CPUs. The streams are calculated to be of a size such that all of them take about the same time to process (this is not the same as them being the same number of employees). It is necessary to allow for new employees being given new sequential employee IDs. Therefore there is also a need to periodically rebalance the streams as employees are hired and leave. This becomes an annual process that is combined with archiving. 
Some customers have avoided the annual rebalancing by reversing the sequentially generated employee ID before it is used, but you have to do this when the system is first implemented and only if new employee IDs can be allocated. 
However, this technique depends upon the application. When I looked at PeopleSoft's North American Payroll (which is a completely different product) this approach did not work. It does use multiple concurrent processes, but the employees are group logically by other business attributes. We still see the read-consistency problems, but we can't resolve them with range partitioning. So you see that understanding both partitioning and the application is essential. 


The results of each pay period accumulate overtime. In GP, each pay period has a calendar ID. It is a character string, defined in the application. So the larger payroll result tables can be sub-partitioned on CAL_RUN_ID. 
When I first worked on Global Payroll it was often run on Oracle 8i, where we only had hash sub-partitioning. I can use dbms_utility.get_hash_value() to predict which hash partition a string value falls into (see also from 1999). I could therefore adjust the calendar ID values to manipulate which partition they fall into. 
Today, I list sub-partition the tables on CAL_RUN_ID. Most companies create and follow a naming convention for their calendar IDs, so the list sub-partitions can be created in advance, and it is simply a matter listing the calendar(s) that go into each partition. In some cases, for large companies, I have created a list sub-partition for each pay period.

No comments :