Tuesday, December 19, 2023

Using Attribute Clustering to Improve Compression, Response Time and CPU Consumption: 2. An Example

Attribute Clustering reorders data in a table so that similar data values are clustered together.  This can improve both basic and columnar compression, resulting in better response time and lower CPU consumption.

This is the second of a two-part blog post.

  1. Introduction
  2. Example and Test Results

An Example of Attribute Clustering

This test illustrates the potential benefits of attribute clustering (the scripts are available on GitHub). It simulates the fact table in a data warehouse, or in my use case the General Ledger table in a Financials system. The table will have 20 million rows. Each dimension column will randomly have one of 256 distinct values, padded to 8 characters.  In this case, the distribution of data values is skewed by the square root function.  The alternative commented section produces uniform data. 
create table t0(a varchar2(8 char), b varchar2(8 char), c varchar2(8 char), x number);
truncate table t0;
BEGIN
  FOR i IN 1..2 LOOP
    insert  /*+APPEND PARALLEL*/ into t0
    select  /*+PARALLEL*/
/*--------------------------------------------------------------------------------------------------------------
            rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(dbms_random.value(0,255)),'XX')),2,'0'),8,'X') a
    ,       rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(dbms_random.value(0,255)),'XX')),2,'0'),8,'X') b
    ,       rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(dbms_random.value(0,255)),'XX')),2,'0'),8,'X') c
--------------------------------------------------------------------------------------------------------------*/
            rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(SQRT(dbms_random.value(0,65535))),'XX')),2,'0'),8,'X') a
    ,       rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(SQRT(dbms_random.value(0,65535))),'XX')),2,'0'),8,'X') b
    ,       rPAD(LPAD(LTRIM(TO_CHAR(FLOOR(SQRT(dbms_random.value(0,65535))),'XX')),2,'0'),8,'X') c
--------------------------------------------------------------------------------------------------------------*/
    ,       dbms_random.value(1,1e6)
    from dual connect by level <= 1e7;
    COMMIT;
  end loop;
end;
/
exec dbms_stats.gather_table_stats(user,'T0');
I will create an identical materialized view on that table
create table mv(a varchar2(8 char), b varchar2(8 char), c varchar2(8 char), x number);
create materialized view mv on prebuilt table enable query rewrite as select * from t0;
For each test, I can set different attributes and then fully refresh the materialized view in non-atomic mode.  The various attributes take effect as the materialized view is truncated and repopulated in direct-path mode.
truncate table MV drop storage;
--------------------------------------------------
rem set compression
--------------------------------------------------
--alter materialized view MV nocompress;
--alter materialized view MV compress;
alter materialized view MV compress for query low;
--------------------------------------------------
rem set in memory
--------------------------------------------------
alter table mv inmemory;
--------------------------------------------------
rem set clustering and number of clustering columns
--------------------------------------------------
alter table mv drop clustering;
--alter table mv add clustering by interleaved order (b);
alter table mv add clustering by interleaved order (b, c);
--alter table mv add clustering by interleaved order (b, c, a);
--------------------------------------------------
exec dbms_mview.refresh('MV',atomic_refresh=>FALSE);
exec dbms_inmemory.repopulate(user,'MV');
Then I can see how large the physical and In-Memory segments are.
select * from user_mviews where mview_name = 'MV';
select table_name, tablespace_name, num_rows, blocks, compression, compress_for, inmemory, inmemory_compression 
from user_tables where table_name IN('MV','T0');
select segment_name, segment_type, tablespace_name, bytes/1024/1024 table_MB, blocks, extents, inmemory, inmemory_compression
from user_Segments where segment_name IN('MV','T0');

with x as (
select segment_type, owner, segment_name, inmemory_compression, inmemory_priority
, count(distinct inst_id) instances
, count(distinct segment_type||':'||owner||'.'||segment_name||'.'||partition_name) segments
, sum(inmemory_size)/1024/1024 inmemory_mb, sum(bytes)/1024/1024 tablespace_Mb
from   gv$im_segments i
where segment_name = 'MV'
group by segment_type, owner, segment_name, inmemory_compression, inmemory_priority)
select x.*, inmemory_mb/tablespace_mb*100-100 pct from x
order by owner, segment_type, segment_name
/
I will use a simple test query to see how the performance changes
select b,c, count(a), sum(x) from t0 where b='2AXXXXXX' group by b,c fetch first 10 rows only;
I tested 
  • Uniformly distributed data -v- skewed data
  • Without table compression -v- basic compression -v- Hybrid Columnar Compression (HCC)
  • No attribute clustering -v- interleaved clustering on 1, 2, and 3 columns
Table      Tablespace                                                                                         
Name       Name         NUM_ROWS     BLOCKS COMPRESS COMPRESS_FOR                   INMEMORY INMEMORY_COMPRESS
---------- ---------- ---------- ---------- -------- ------------------------------ -------- -----------------
MV         PSDEFAULT    20000000      60280 ENABLED  QUERY LOW                      ENABLED  FOR QUERY LOW    
T0         PSDEFAULT    20000000     150183 DISABLED                                DISABLED                  

                      Tablespace      Table                                                 
Segment Na Segment Ty Name               MB     BLOCKS    EXTENTS INMEMORY INMEMORY_COMPRESS
---------- ---------- ---------- ---------- ---------- ---------- -------- -----------------
MV         TABLE      PSDEFAULT       472.0      60416        130 ENABLED  FOR QUERY LOW    
T0         TABLE      PSDEFAULT     1,220.0     156160        203 DISABLED                  

                                                                                 In Memory Tablespace           
Segment Ty OWNER    Segment Na INMEMORY_COMPRESS INMEMORY  INSTANCES   SEGMENTS         MB         MB        PCT
---------- -------- ---------- ----------------- -------- ---------- ---------- ---------- ---------- ----------
TABLE      SYSADM   MV         FOR QUERY LOW     NONE              2          1      829.2      936.6 -11.4662752
With query rewrite on the materialized view and the materialized view in the In-Memory store, we see Oracle rewrite the query from the underlying table to the materialized view and then to an in-memory query.
select b,c, sum(x) from t0 where b='2AXXXXXX' group by b,c;

Plan hash value: 389206685
-----------------------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |   182 |  7280 |   876  (31)| 00:00:01 |
|   1 |  HASH GROUP BY                         |      |   182 |  7280 |   876  (31)| 00:00:01 |
|*  2 |   MAT_VIEW REWRITE ACCESS INMEMORY FULL| MV   | 78125 |  3051K|   875  (31)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Test Results

Conclusions

  • Without any table compression, attribute clustering does not affect the size of the table on the tablespace, but the size of the table in the In-Memory store is reduced, and query performance is improved.
  • With either basic or Hybrid Columnar compression, attribute clustering reduces the size of the table both in the tablespace and in the in-memory store.
  • All forms of compression and attribute clustering increase the duration of the materialized view refresh. Degradation of the refresh due to clustering was less severe with HCC than with either no compression or simple compression.
  • I found that query performance degraded when using interleaved clustering in combination with simple compression although this resulted in a smaller in-memory segment than HCC, but performance improved with HCC.
  • Uniform data compressed marginally better than skewed.  Otherwise, they produced very similar results.  
  • You do not have to take advantage of compression on the physical segment to take advantage of the compression in In-Memory, but you may get better performance if you do.
  • With this test data set, optimal performance was achieved when clustering on 2 dimension columns. When clustering on all three columns I obtained worse compression and query performance. This varies with the data. With real-world data, I have had examples with better compression and performance with the maximum of 4 clustering column groups. Generally, the best performance corresponds to the attribute clustering that gives the best columnar compression. This is not always the case for simple compression.

No comments :