Friday, November 13, 2020

Retrofitting Partitioning into Existing Applications: Example 1. General Ledger

This post is part of a series about the partitioning of database objects.

If you were designing an application to use partitioning, you would write the code to reference the column by which the data was partitioned so that the database does partition elimination.  However, with a pre-existing or 3rd party application you have to look at how the application queries the data and match the partitioning to that.
I am going to look at a number of cases from real-life, and discuss the thought process behind partitioning decisions.  These examples happen to come from PeopleSoft ERP systems, but that does not make them unusual.  PeopleSoft is just another packaged application.  In every case, it is necessary to have some application knowledge when deciding whether and how to introduce partitioning.

General Ledger

GL is an example of where OLTP and DW activities clash on the same table.  GL is a data warehouse of transactional information about a business.  The rationale for partitioning ledger data is a very typical example of partitioning for SQL query performance.
Dimensions Attributes
BUSINESS_UNIT
LEDGER
ACCOUNT
DEPTID
OPERATING_UNIT
PRODUCT
AFFILIATE
CHARTFIELD1/2/3
PROJECT_ID
BOOK_CODE
FISCAL_YEAR/ACCOUNTING_PERIOD
CURRENCY_CD/BASE_CURRENCY
…and others
POSTED_TOTAL_AMT
POSTED_BASE_AMT
POSTED_TRANS_AMT

You can think of it as a star-schema.  The ledger table is the fact table.  Dimensions are generated from standing data in the application. The reports typically slice and dice that data by time, and various dimensions.  The exact dimensions vary from business to business, and from time to time. 

In PeopleSoft, you can optionally configure summary ledger tables that are pre-aggregations of ledger data by a limited set of dimensions.  These are generated by batch processes.  However, it is not a commonly used feature, as it introduces latency between a change being made, and not being able to report on it from the summary ledgers until the refresh process has run.
Business transactions post continuously to the ledger.  Meanwhile, the accountants also want to query ledger data.  Especially at month-end, they want to post adjustments and see the consequences immediately.
Here is a typical query from the PeopleSoft GL Reporting tool (nVision).  The queries vary widely, but some elements (in bold) are always present.
SELECT L.TREE_NODE_NUM,L2.TREE_NODE_NUM,SUM(A.POSTED_TOTAL_AMT)
FROM   PS_LEDGER A
,      PSTREESELECT05 L1
,      PSTREESELECT10 L
,      PSTREESELECT10 L2
WHERE  A.LEDGER='ACTUALS'
AND    A.FISCAL_YEAR=2020
AND    A.ACCOUNTING_PERIOD BETWEEN 1 AND 11
AND    L1.SELECTOR_NUM=30982 AND A.BUSINESS_UNIT=L1.RANGE_FROM_05
AND    L.SELECTOR_NUM=30985 AND A.CHARTFIELD1=L.RANGE_FROM_10
AND    L2.SELECTOR_NUM=30984 AND A.ACCOUNT=L2.RANGE_FROM_10
AND    A.CURRENCY_CD='GBP'
GROUP BY L.TREE_NODE_NUM,L2.TREE_NODE_NUM
  • Queries are always on a particular ledger or group of ledgers.
    • You can have different ledgers for different accounting standards or reporting requirements.
    • Sometimes you can have adjustment ledgers – that are usually much smaller than the actuals ledgers – and they are aggregated with the main ledger.
    • In the latest version of the application, the budget ledger can be stored in the same table rather than a separate table.  Budget data has a different shape to actuals data and is created up to a year earlier.  It is generally much smaller and has a different usage profile.
    • So, there is always an equality criterion or IN-list criterion on LEDGER
  • Queries are always for a particular fiscal year.  This year, last year, sometimes the year before.  Therefore, there is always an equality criterion on FISCAL_YEAR.
  • Queries may be for a particular period, in which case there is a single-period equality criterion.  Alternatively, they are for the year-to-date, in which case there is a BETWEEN 1 AND current period criterion.  Sometimes for a particular quarter.  It is common to see queries on the same year-to-date period in the previous fiscal year.
  • Queries always specify the reporting currency.  Therefore, there is always a criterion on CURRENCY_CD, although many multi-national customers only have single currency ledgers, so the criterion may not be selective.
  • There will be varying criteria on other dimension columns on LEDGER by joining to the PSTREESELECT dimension tables.

What should I partition by?

We have seen the shape of the SQL, we know which columns are candidate partitioning keys because we have seen which columns have criteria.  LEDGER is a candidate. I also profile the data volumes. The following is a typical example.
                                  Cum.
LEDGER          NUM_ROWS      %      %
---------- ------------- ------ ------
XXXXCORE     759,496,900   43.9   43.9
CORE         533,320,425   30.8   74.7
XXXXGAAP     152,563,325    8.8   83.5
GAAP_ADJ      74,371,775    4.3   87.8
ZZZZ_CORE     34,251,514    2.0   89.8
C_XXCORE      29,569,381    1.7   91.5
…
           -------------
sum        1,731,153,467
FISCAL_YEAR is an obvious choice.  
    Fiscal 
      Year      NUM_ROWS      %
---------- ------------- ------
      2016           121
      2017            32
      2018   510,168,673   29.5
      2019   574,615,980   33.2
      2020   646,336,579   37.3
      2021        32,082      
           -------------
sum        1,731,153,467
Most companies have monthly accounting periods (although some use other frequencies).  Then we have 12 accounting periods, plus bought forward (0), carry forward (998), and adjustments (999).
    Fiscal Accounting                     Cum.
      Year     Period   NUM_ROWS      %      %
---------- ---------- ---------- ------ ------
…
      2020          0   66237947    3.8   37.3
                    1   42865339    2.5   33.5
                    2   47042492    2.7   31.0
                    3   53680915    3.1   28.3
                    4   50113011    2.9   25.2
                    5   44700409    2.6   22.3
                    6   54983221    3.2   19.7
                    7   51982401    3.0   16.6
                    8   44851506    2.6   13.6
                    9   56528783    3.3   11.0
                   10   52266343    3.0    7.7
                   11   70541810    4.1    4.7
                   12   10542380     .6     .6
                  999         22     .0     .0
**********            ----------
sum                    646336579
…
CURRENCY_CD is usually not a candidate for most companies because they report in a single currency, so all the rows are the same currency.  But even then, each ledger is a particular currency.  It is usually more effective to partition by LEDGER.
It is very tempting to interval partition on FISCAL_YEAR and then range or list sub-partition on ACCOUNTING PERIOD into 14 partitions each year.  Then Oracle will automatically add the range partitions for each FISCAL_YEAR.
CREATE TABLE ps_ledger (...)
PARTITION BY RANGE (fiscal_year) INTERVAL (1) 
SUBPARTITION BY RANGE (accounting_period) 
SUBPARTITION TEMPLATE
 (SUBPARTITION p00 VALUES LESS THAN (1)
 ,SUBPARTITION p01 VALUES LESS THAN (2)
...
 ,SUBPARTITION p12 VALUES LESS THAN (13)
 ,SUBPARTITION pxx VALUES LESS THAN (MAXVALUE))
(PARTITION VALUES LESS THAN (2019));
However, I would usually counsel against this.  You can only partition in two dimensions, and LEDGER is a very attractive option for partitioning.  Instead, I would consider partitioning in one dimension on the combination of two columns.  I would range partition on the combination of FISCAL_YEAR and ACCOUNTING_PERIOD.  However, if you also wish to introduce materialized views, then bear in mind that Partition Change Tracking does not work with multi-column partitioning (see also Partition Change Tracking During Materialized View Refresh and Query Rewrite).
CREATE TABLE ps_ledger (...)
PARTITION BY RANGE (fiscal_year,accounting_period) 
(PARTITION ledger_2017     VALUES LESS THAN (2018,0)
,PARTITION ledger_2018_bf  VALUES LESS THAN (2018,1)
,PARTITION ledger_2018_p01 VALUES LESS THAN (2018,2)
…
,PARTITION ledger_2021_cf VALUES LESS THAN (2022,0)
);
  • The application never uses ACCOUNTING_PERIOD without also using FISCAL_YEAR.  Sometimes it uses FISCAL_YEAR without ACCOUNTING_PERIOD.
  • Partition elimination does work with multi-column partitions.
    • If you only specify a criterion on FISCAL_YEAR in a query you will still get partition elimination.
    • If you only specify a criterion on ACCOUNTING_PERIOD only you will not get partition elimination.
  • You cannot interval partition on multiple columns.  Therefore, you have to manage the annual addition of new partitions yourself.
  • Also, you cannot get partition change tracking for materialized view refresh to work with multi-column partitioning.
  • This leaves sub-partitioning to be used on a different column.

Should I create a MAXVALUE partition?

  • I deliberately haven't specified a MAXVALUE partition.  There are arguments for and against this.
    • The argument against MAXVALUE it is that you might forget to add the new partition for the new year, and then all the data for the next fiscal year goes into the same partition and over time the performance of the reports gradually decay.  By the time the performance issue is diagnosed several months may have piled up.  Then you need to split the partition into several partitions (or exchange it out, add the new partitions, and reinsert the data).  So not having a MAXVALUE partition forces the annual maintenance activity to be put in the diary, otherwise, the application will error when it tries to insert data for a FISCAL_YEAR for which there is currently no partition.
      • Now budget data is kept in the LEDGER table, you have do this before the budget ledger data is created, which is up to a year ahead of actuals data, so the risk of business interruption is minimal.
    • In favour of a MAXVALUE partition is that it prevents the error from occurring, but risks forgetting or deferring the maintenance for operational reasons.  
    • Of course, a MAXVALUE partition can be added at any time!

Should I Sub-partition?

It depends on the data.
  • The ledger table is a big table, and the LEDGER column is usually a selective low cardinality column.  So, it is a good candidate for sub-partitioning.  A single value list sub-partition for each of the largest actuals and budget ledgers, a default sub-partition for all other values.
  • This is not the case in summary ledger tables that are usually built on a single ledger.  So they are usually range partitioned on FISCAL_YEAR, ACCOUTING_PERIOD, and can then be sub-partitioned on a different dimension column
You can use a template if you want the same sub-partitions for every accounting period.
If you use interval partitioning, you have to use a subpartition template if you want to composite partition.
CREATE TABLE ps_ledger (…)
PARTITION BY RANGE (fiscal_year,accounting_period) INTERVAL (1) 
SUBPARTITION BY LIST (ledger) 
SUBPARTITION TEMPLATE
 (SUBPARTITION l_xxx VALUES LESS THAN ('XXX')
 ,SUBPARTITION l_yyy VALUES LESS THAN ('YYY')
…
,SUBPARTITION VALUES (DEFAULT))
(PARTITION VALUES LESS THAN (2019));
Sometimes, companies change their use of ledgers, in which case the sub-partitions need to reflect that.  You can still use the template to specify whatever is the currently required sub-partitioning.  If you ever recreate the table you end up explicitly specifying sub-partitions for every other partition.  The DDL becomes very verbose.  Although with deferred segment creation it wouldn't really matter if you had empty sub-partitions that had not been physically created for accounting periods where a ledger was not used.  
However, if I want to specify different tablespaces, no free space allowance, compression etc on certain partitions, then I need to use explicit partition and subpartition clauses, or come along afterwards and alter and rebuild them.  
I think explicit partition and subpartition names are administratively helpful when it comes to reporting on partition space usage, and when you archive/purge data by exchanging or dropping a partition.
CREATE TABLE ps_ledger (…)
PARTITION BY RANGE (fiscal_year,accounting_period) 
SUBPARTITION BY LIST (ledger) 
(PARTITION ledger_2018 VALUES LESS THAN (2019,0) PCTFREE 0 COMPRESS
 (SUBPARTITION ledger_2018_xxx      VALUES ('XXX')
 ,SUBPARTITION ledger_2018_yyy      VALUES ('YYY')
 ,SUBPARTITION ledger_2018_z_others VALUES (DEFAULT)
)
,PARTITION ledger_2019_bf VALUES LESS THAN (2019,1) PCTFREE 0 COMPRESS
 (SUBPARTITION ledger_2019_bf_xxx      VALUES ('XXX')
 ,SUBPARTITION ledger_2019_bf_yyy      VALUES ('YYY')
 ,SUBPARTITION ledger_2019_bf_z_others VALUES (DEFAULT)
)
…
;

Indexing

Indexes can be partitioned or not independently of the table.  
  • Local indexes are partitioned in the same way as the table they are built on.  Therefore, there is a 1:1 relationship of table partition/sub-partition to index partition/sub-partition.
  • Global indexes are not partitioned the same way as the table.  You can have
    • Global partitioned indexes
    • Global non-partitioned indexes
Local indexes are easier to build and maintain.  When you do a partition operation on a table partition (add, drop, merge, split or truncate) the same operation is applied to local indexes.  However, if you do an operation on a table partition, any global index will become unusable, unless the DDL is done with the UPDATE INDEXES clause.  Using this option, when you drop a partition, all the corresponding rows are deleted from the index.  The benefit is that the indexes do not become unusable (in which case they would have to be rebuilt), but dropping the table partition takes longer because the rows have to be deleted from the index (effectively a DML operation).
As a general rule, indexes that contain the partitioning key, and at least the first partitioning key column is near the front of the index (I usually reckon in the first three key columns), should be locally partitioned unless there is a reason not to. 
With the general ledger, I tend to create pairs of local indexes that match the reporting analysis criteria.  
  • One of each of the pair of indexes leads on LEDGER, FISCAL_YEAR, ACCOUNTING_PERIOD and then the other dimension columns.  This supports single period queries.
  • The other index leads on LEDGER, FISCAL_YEAR, then the other dimension columns and finally ACCOUNTING_PERIOD is last because we are interested in a range of periods.
To support single period queries To support year-to-date queries
CREATE INDEX psgledger ON ps_ledger
(ledger
,fiscal_year
,accounting_period
,business_unit
,account
,project_id
,book_code
)
LOCAL
CREATE INDEX pshledger ON ps_ledger
(ledger
,fiscal_year
,business_unit
,account
,project_id
,book_code
,accounting_period
)
LOCAL
The unique index on the ledger table does include the partitioning keys.  But FISCAL_YEAR and ACCOUNTING_PERIOD are the last 2 of 25 columns.  This index is really to support queries from the on-line application and batch processes that post to the ledger.  So a query on BUSINESS_UNIT would have to prove every partition.  Therefore, I generally don't partition this index. It would be reasonable to globally partition it on LEDGER only.
CREATE UNIQUE INDEX ps_ledger ON ps_ledger
(business_unit,ledger,account,altacct,deptid
,operating_unit,product,fund_code,class_fld,program_code
,budget_ref,affiliate,affiliate_intra1,affiliate_intra2,chartfield1
,chartfield2,chartfield3,project_id,book_code,gl_adjust_type
,date_code,currency_cd,statistics_code,fiscal_year,accounting_period
)…

Archiving

Taken together, FISCAL_YEAR and ACCOUNTING_PERIODY are effectively a proxy for the date of the accounting period.  So we will add partitions and can compress and later drop them after a period of time.
Once an accounting period has been closed it will not be written to again (or at least not much and not often), so it can then be compressed.  It can't be compressed before because the application is still applying ordinary DML (unless the Advanced Compression option has been licenced).  This applies to both conventional dictionary compression and Hybrid Columnar Compression on Exadata.
Most reports are on current and previous fiscal years.  Earlier years are candidates to be purged or archived by dropping or exchanging partitions.  When partitions are dropped, because you have global indexes, this should be with the UPDATE ALL INDEXES clause
ALTER TABLE ps_ledger DROP PARTITION ledger_2017 UPDATE INDEXES;

No comments :