- Introduction and Ancient History
- Cluster & Cluster Key Design Considerations
- Populating the Cluster with DBMS_PARALLEL_EXECUTE
- Checking the Cluster Key
- Using the Cluster Key Index instead of the Primary/Unique Key Index
- Testing the Cluster & Conclusion (TL;DR)
This query calculates the frequency of each number of distinct blocks per cluster key. It uses DBMS_ROWID to get the block number from the ROWID. The query counts the number of distinct blocks per cluster key, and the number of times that number of blocks per key occurs.
with x as ( --cluster key and rowid of each row
select emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
, DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) block_no from ps_gp_rslt_abs
), y as ( --count number of rows per cluster key and block number
select /*+MATERIALIZE*/ emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
, block_no, count(*) num_rows
from x
group by emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM, block_no
), z as ( --count number of blocks and rows per cluster key
select /*+MATERIALIZE*/ emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
, count(distinct block_no) num_blocks, sum(num_rows) num_rows
from y
group by emplid, cal_run_id, empl_rcd, gp_paygroup, cal_id, ORIG_CAL_RUN_ID, RSLT_SEG_NUM
)
select num_blocks, count(distinct emplid) emplids
, sum(num_rows) sum_rows
, median(num_rows) median_rows
, median(num_rows)/num_blocks median_rows_per_block
from z
group by num_blocks
order by num_blocks
/
The answer you get depends on the data, so your mileage will vary.
Initially, I built the cluster with 3 columns in the key. In my case, 81% of rows were organised such that they have no more than 2 data blocks per cluster key.
NUM_BLOCKS EMPLIDS SUM_ROWS MEDIAN_ROWS MEDIAN_ROWS_PER_BLOCK
---------- ---------- ---------- ----------- ---------------------
1 69638 46809975 12 12
2 47629 78370682 34 17
3 12120 14330976 68 22.6666667
4 4598 4395844 94 23.5
5 2376 6941389 124 24.8
6 652 2510790 155 25.8333333
7 27 34527 185 26.4285714
8 14 12330 217 27.125
9 9 40633 248 27.5555556
10 1 14607 279 27.9
11 1 310 310 28.1818182
12 2 2212 310 25.8333333
13 1 1476 372 28.6153846
14 1 372 372 26.5714286
I rebuilt the cluster with 7 key columns. Now no cluster key has more than 7 blocks, most of the keys are in a single block, and 85% are in no more than 2. Increasing the length of the cluster key also resulted in the table being smaller because each cluster key is only stored once.NUM_BLOCKS EMPLIDS SUM_ROWS MEDIAN_ROWS MEDIAN_ROWS_PER_BLOCK
---------- ---------- ---------- ----------- ---------------------
1 74545 71067239 14 14
2 52943 57481538 40 20
3 13553 11185685 73 24.3333333
4 4567 8949787 120 30
5 1327 3251707 150 30
6 144 81977 160 26.6666667
7 3 1197 204.5 29.2142857
There is now only a small number of employees whose data is spread across many cluster blocks. They might be slower to access, but I think I have a reasonable balance.
No comments :
Post a Comment