Friday, February 16, 2024

Table Clusters: 4. Checking the Cluster Key

This post is part of a series that discusses table clustering in Oracle.
  1. Introduction and Ancient History
  2. Cluster & Cluster Key Design Considerations
  3. Populating the Cluster with DBMS_PARALLEL_EXECUTE
  4. Checking the Cluster Key
  5. Using the Cluster Key Index instead of the Primary/Unique Key Index
  6. Testing the Cluster & Conclusion (TL;DR)

This query calculates the frequency of each number of distinct blocks per cluster key.  It uses DBMS_ROWID to get the block number from the ROWID.  The query counts the number of distinct blocks per cluster key, and the number of times that number of blocks per key occurs.

with x as ( --cluster key and rowid of each row
  select emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
  ,      DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) block_no from ps_gp_rslt_abs
), y as ( --count number of rows per cluster key and block number
  select /*+MATERIALIZE*/ emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
  ,      block_no, count(*) num_rows 
  from   x   
  group by emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM, block_no
), z as ( --count number of blocks and rows per cluster key
  select /*+MATERIALIZE*/ emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
  ,      count(distinct block_no) num_blocks, sum(num_rows) num_rows 
  from   y
  group by emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
)
select num_blocks, count(distinct emplid) emplids
,      sum(num_rows) sum_rows
,      median(num_rows) median_rows
,      median(num_rows)/num_blocks median_rows_per_block
from   z
group by num_blocks
order by num_blocks
/

The answer you get depends on the data, so your mileage will vary.  

Initially, I built the cluster with 3 columns in the key.  In my case, 81% of rows were organised such that they have no more than 2 data blocks per cluster key.   

NUM_BLOCKS    EMPLIDS   SUM_ROWS MEDIAN_ROWS MEDIAN_ROWS_PER_BLOCK
---------- ---------- ---------- ----------- ---------------------
         1      69638   46809975          12                    12
         2      47629   78370682          34                    17
         3      12120   14330976          68            22.6666667
         4       4598    4395844          94                  23.5
         5       2376    6941389         124                  24.8
         6        652    2510790         155            25.8333333
         7         27      34527         185            26.4285714
         8         14      12330         217                27.125
         9          9      40633         248            27.5555556
        10          1      14607         279                  27.9
        11          1        310         310            28.1818182
        12          2       2212         310            25.8333333
        13          1       1476         372            28.6153846
        14          1        372         372            26.5714286
I rebuilt the cluster with 7 key columns.  Now no cluster key has more than 7 blocks, most of the keys are in a single block, and 85% are in no more than 2.  Increasing the length of the cluster key also resulted in the table being smaller because each cluster key is only stored once.
NUM_BLOCKS    EMPLIDS   SUM_ROWS MEDIAN_ROWS    MEDIAN_ROWS_PER_BLOCK
---------- ---------- ---------- -----------  ---------------------
         1      74545   71067239          14                    14
         2      52943   57481538          40                    20
         3      13553   11185685          73            24.3333333
         4       4567    8949787         120                    30
         5       1327    3251707         150                    30
         6        144      81977         160            26.6666667
         7          3       1197       204.5            29.2142857
There is now only a small number of employees whose data is spread across many cluster blocks.  They might be slower to access, but I think I have a reasonable balance.

No comments :