Monday, November 26, 2018

Data Warehouse Design: To Index, or Not to Index, that is the question

This post is part of a series that discusses some common issues in data warehouses.

When you query a star schema, you essentially have two choices;
  • bitmap index and star transformation 
  • full scan, bloom filter, and hash join

Star Transformation 

Star transformation was introduced in Oracle 8(see also Oracle Optimizer Blog: Optimizer Transformations: Star Transformation).   A star transformation requires:
  • A bitmap index on each foreign key column(s) on the FACT table
    • Though you don't actually need the foreign key to be defined
    • An index on primary key on the DIMENSION table
  • Set STAR_TRANSFORMATION_ENABLED=TRUE to enable the transformation.
    • Or use the /*+STAR_TRANSFORMATION*/ hint.
Each dimension is queried, and the results iterated through the bitmap index on the corresponding foreign key. The results for each bitmap are merged with a bitmap AND operation, and the result is converted back to rowids on the fact table.
If you were going to try to do this with conventional b-tree indexes, you would need an index for each permutation of foreign key combinations.
For example, this query reports US sales in 1999 by customer state:
SELECT  c.country_name
, u.cust_state_province
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
from  sales s
,   customers u
,   products p
,  times t
, countries c
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND u.country_id = c.country_id
AND c.country_iso_code = 'US'
AND t.fiscal_year = 1999
GROUP BY c.country_name
, u.cust_state_province
ORDER BY 1,2
/

The execution plan below shows:
  • Line 2: A temporary table that is the combination of countries and customers is created.
  • Line 18: Bitmap look-up of  TIMES dimensions on the SALES_TIME_BIX bitmap index.
  • Line 23: Bitmap look-up of the temporary table against the SALES_CUST_BIX bitmap index
  • Line 13: A bitmap AND is performed of the two bitmap results
  • Line 8: The temporary table is used again to join in the customer and state description.
Plan hash value: 3988189767

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                           |      1 |        |       |  2133 (100)|          |       |       |     45 |00:00:02.74 |   98078 |    195 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION               |                           |      1 |        |       |            |          |       |       |     45 |00:00:02.74 |   98078 |    195 |       |       |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D77CE_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.08 |    1525 |      0 |  1024 |  1024 |          |
|   3 |    HASH JOIN                             |                           |      1 |   2921 |   111K|   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.05 |    1524 |      0 |  1185K|  1185K|  599K (0)|
|   4 |     TABLE ACCESS FULL                    | COUNTRIES                 |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|   5 |     TABLE ACCESS FULL                    | CUSTOMERS                 |      1 |  55500 |  1138K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |      0 |       |       |          |
|   6 |   SORT GROUP BY                          |                           |      1 |   2359 |   101K|  1715   (1)| 00:00:01 |       |       |     45 |00:00:02.66 |   96552 |    195 |  6144 |  6144 | 6144  (0)|
|   7 |    HASH JOIN                             |                           |      1 |  12057 |   518K|  1713   (1)| 00:00:01 |       |       |    141K|00:00:02.49 |   96552 |    195 |  2391K|  1595K| 2015K (0)|
|   8 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D77CE_A4BC21 |      1 |   2921 | 75946 |     6   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |     VIEW                                 | VW_ST_B6F3B15C            |      1 |  12057 |   211K|  1707   (1)| 00:00:01 |       |       |    141K|00:00:02.18 |   96552 |    195 |       |       |          |
|  10 |      NESTED LOOPS                        |                           |      1 |  12057 |   553K|  1684   (1)| 00:00:01 |       |       |    141K|00:00:02.08 |   96552 |    195 |       |       |          |
|  11 |       PARTITION RANGE SUBQUERY           |                           |      1 |  12056 |   294K|   322   (1)| 00:00:01 |KEY(SQ)|KEY(SQ)|    141K|00:00:01.05 |   96102 |      0 |       |       |          |
|  12 |        BITMAP CONVERSION TO ROWIDS       |                           |      5 |  12056 |   294K|   322   (1)| 00:00:01 |       |       |    141K|00:00:00.94 |   96047 |      0 |       |       |          |
|  13 |         BITMAP AND                       |                           |      5 |        |       |            |          |       |       |      5 |00:00:00.87 |   96047 |      0 |       |       |          |
|  14 |          BITMAP MERGE                    |                           |      5 |        |       |            |          |       |       |      5 |00:00:00.02 |    1914 |      0 |  1024K|   512K|40960  (0)|
|  15 |           BITMAP KEY ITERATION           |                           |      5 |        |       |            |          |       |       |    364 |00:00:00.02 |    1914 |      0 |       |       |          |
|  16 |            BUFFER SORT                   |                           |      5 |        |       |            |          |       |       |   1820 |00:00:00.01 |      55 |      0 | 73728 | 73728 |          |
|  17 |             TABLE ACCESS FULL            | TIMES                     |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|  18 |            BITMAP INDEX RANGE SCAN       | SALES_TIME_BIX            |   1820 |        |       |            |          |KEY(SQ)|KEY(SQ)|    364 |00:00:00.01 |    1859 |      0 |       |       |          |
|  19 |          BITMAP MERGE                    |                           |      5 |        |       |            |          |       |       |      5 |00:00:00.84 |   94133 |      0 |  8624K|  1078K|  319K (0)|
|  20 |           BITMAP KEY ITERATION           |                           |      5 |        |       |            |          |       |       |   6566 |00:00:00.81 |   94133 |      0 |       |       |          |
|  21 |            BUFFER SORT                   |                           |      5 |        |       |            |          |       |       |  92600 |00:00:00.14 |       0 |      0 |    28M|  1930K|  928K (0)|
|  22 |             TABLE ACCESS FULL            | SYS_TEMP_0FD9D77CE_A4BC21 |      1 |   2921 | 14605 |     6   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |      0 |       |       |          |
|  23 |            BITMAP INDEX RANGE SCAN       | SALES_CUST_BIX            |  92600 |        |       |            |          |KEY(SQ)|KEY(SQ)|   6566 |00:00:00.54 |   94133 |      0 |       |       |          |
|  24 |       TABLE ACCESS BY USER ROWID         | SALES                     |    141K|      1 |    22 |  1385   (1)| 00:00:01 | ROWID | ROWID |    141K|00:00:00.76 |     450 |    195 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - cbqt star transformation used for this statement
   - rely constraint used for this statement
The temporary table optimisation seen above for the join back to the dimension table was introduced in 10g. It can be suppressed by setting:
  • STAR_TRANSFORMATION_ENABLED=TEMP_DISABLE 
Without this optimisation, Oracle needs to visit both customers and countries table a second time, whereas before we only visited the temporary table twice.
Plan hash value: 2782741738

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name           | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                |      1 |        |       |  2950 (100)|          |       |       |     45 |00:00:02.91 |   99599 |       |       |          |
|   1 |  SORT GROUP BY                     |                |      1 |     45 |  2565 |  2950   (1)| 00:00:01 |       |       |     45 |00:00:02.91 |   99599 |  6144 |  6144 | 6144  (0)|
|*  2 |   HASH JOIN                        |                |      1 |    635 | 36195 |  2949   (1)| 00:00:01 |       |       |    141K|00:00:02.77 |   99599 |    12M|  2994K|   16M (0)|
|   3 |    MERGE JOIN CARTESIAN            |                |      1 |  12057 |   423K|  2533   (1)| 00:00:01 |       |       |    141K|00:00:02.39 |   98078 |       |       |          |
|*  4 |     TABLE ACCESS FULL              | COUNTRIES      |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|   5 |     BUFFER SORT                    |                |      1 |  12057 |   211K|  2531   (1)| 00:00:01 |       |       |    141K|00:00:02.29 |   98076 |  6219K|  1010K| 5527K (0)|
|   6 |      VIEW                          | VW_ST_958ED0D5 |      1 |  12057 |   211K|  2531   (1)| 00:00:01 |       |       |    141K|00:00:02.03 |   98076 |       |       |          |
|   7 |       NESTED LOOPS                 |                |      1 |  12057 |   553K|  2097   (1)| 00:00:01 |       |       |    141K|00:00:01.92 |   98076 |       |       |          |
|   8 |        PARTITION RANGE SUBQUERY    |                |      1 |  12056 |   294K|   734   (1)| 00:00:01 |KEY(SQ)|KEY(SQ)|    141K|00:00:01.16 |   97626 |       |       |          |
|   9 |         BITMAP CONVERSION TO ROWIDS|                |      5 |  12056 |   294K|   734   (1)| 00:00:01 |       |       |    141K|00:00:01.05 |   97571 |       |       |          |
|  10 |          BITMAP AND                |                |      5 |        |       |            |          |       |       |      5 |00:00:00.98 |   97571 |       |       |          |
|  11 |           BITMAP MERGE             |                |      5 |        |       |            |          |       |       |      5 |00:00:00.02 |    1914 |  1024K|   512K|40960  (0)|
|  12 |            BITMAP KEY ITERATION    |                |      5 |        |       |            |          |       |       |    364 |00:00:00.02 |    1914 |       |       |          |
|  13 |             BUFFER SORT            |                |      5 |        |       |            |          |       |       |   1820 |00:00:00.01 |      55 | 73728 | 73728 |          |
|* 14 |              TABLE ACCESS FULL     | TIMES          |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |       |       |          |
|* 15 |             BITMAP INDEX RANGE SCAN| SALES_TIME_BIX |   1820 |        |       |            |          |KEY(SQ)|KEY(SQ)|    364 |00:00:00.01 |    1859 |       |       |          |
|  16 |           BITMAP MERGE             |                |      5 |        |       |            |          |       |       |      5 |00:00:00.96 |   95657 |  8624K|  1078K|  319K (0)|
|  17 |            BITMAP KEY ITERATION    |                |      5 |        |       |            |          |       |       |   6566 |00:00:00.92 |   95657 |       |       |          |
|  18 |             BUFFER SORT            |                |      5 |        |       |            |          |       |       |  92600 |00:00:00.21 |    1524 |    36M|  2148K| 1180K (0)|
|* 19 |              HASH JOIN             |                |      1 |   2921 | 52578 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.04 |    1524 |  1922K|  1922K|  565K (0)|
|* 20 |               TABLE ACCESS FULL    | COUNTRIES      |      1 |      1 |     8 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|  21 |               TABLE ACCESS FULL    | CUSTOMERS      |      1 |  55500 |   541K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |       |       |          |
|* 22 |             BITMAP INDEX RANGE SCAN| SALES_CUST_BIX |  92600 |        |       |            |          |KEY(SQ)|KEY(SQ)|   6566 |00:00:00.58 |   94133 |       |       |          |
|  23 |        TABLE ACCESS BY USER ROWID  | SALES          |    141K|      1 |    22 |  1797   (1)| 00:00:01 | ROWID | ROWID |    141K|00:00:00.48 |     450 |       |       |          |
|  24 |    TABLE ACCESS FULL               | CUSTOMERS      |      1 |  55500 |  1138K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.03 |    1521 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Without Star Transformation: Full Scan-Bloom Filter 

A Bloom filter is a form of bitmap. Each dimension value is hashed and maps to a bit in the map. The fact table is pushed through the same hash, and if the bit in the map was set by hashing the dimensions, then it is a possible match.  It is only a possible match because multiple values might hash to the same value. So, you can get false positives that Oracle then filters out by exactly joining them with a hash-join.  However, this hash is much smaller than if the Bloom filter had not been used.
Bloom Filters were originally introduced into Oracle in 10gR2, but there was no documentation. See also:
Bloom Filters are frequently more effective for larger results set even on conventional systems
Plan hash value: 2588754131

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |           |      1 |        |       |  1586 (100)|          |       |       |     45 |00:00:00.46 |    2063 |    472 |       |       |          |
|   1 |  SORT GROUP BY                 |           |      1 |     45 |  3105 |  1586   (3)| 00:00:01 |       |       |     45 |00:00:00.46 |    2063 |    472 |  6144 |  6144 | 6144  (0)|
|*  2 |   HASH JOIN                    |           |      1 |  12057 |   812K|  1585   (3)| 00:00:01 |       |       |    141K|00:00:00.31 |    2063 |    472 |  1995K|  1995K| 1655K (0)|
|   3 |    PART JOIN FILTER CREATE     | :BF0000   |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  4 |     TABLE ACCESS FULL          | TIMES     |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  5 |    HASH JOIN                   |           |      1 |  48360 |  2691K|  1568   (3)| 00:00:01 |       |       |    169K|00:00:00.27 |    2007 |    472 |  2391K|  1595K| 1990K (0)|
|*  6 |     HASH JOIN                  |           |      1 |   2921 |   111K|   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.03 |    1524 |      0 |  1236K|  1236K|  728K (0)|
|*  7 |      TABLE ACCESS FULL         | COUNTRIES |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|   8 |      TABLE ACCESS FULL         | CUSTOMERS |      1 |  55500 |  1138K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |      0 |       |       |          |
|   9 |     PARTITION RANGE JOIN-FILTER|           |      1 |    918K|    15M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.15 |     482 |    472 |       |       |          |
|  10 |      TABLE ACCESS FULL         | SALES     |      5 |    918K|    15M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.15 |     482 |    472 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Which is better? 

It depends. Often Bloom filter performance is better, but on queries that return very few values, you might find the star transformation outperforms the Bloom filter.  The Oracle documentation says that star transformation is a cost-based optimisation.  It is only used if the plan is cheaper.  Although I have found some cases where it star transforms to a more expensive plan.
In the next test, I will query the software sales for 1999 by region in a named country, from the largest (USA) to the smallest (New Zealand). I will test with and without STAR_TRANSFORMATION_ENABLED and record the actual runtime and number of buffers for the execution plan.
select  c.country_name
, u.cust_state_province
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
from  sales s
,   customers u
,   products p
,  times t
, countries c
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND u.country_id = c.country_id
AND c.country_iso_code = '&&iso_country_code'
AND p.prod_category_id = 205
and  t.fiscal_year = 1999
GROUP BY c.country_name
, u.cust_state_province
ORDER BY 1,2
/
The data by country is highly skewed. That will let me compare star transformation -v- Bloom filter for different data volumes. As the number of sales decreases the elapsed time of the star transformation falls below that of the Bloom filter. The faster query is shown in bold.
ISO Country Code
Country Name
Number of Sales in 1999
Number of Sales
Star Transformation
Full Scan-Bloom Filter
A-Time
Buffers
A-Time
Buffers
US
United States of America
2662
526212
1.76
97999
0.55
2068
DE
Germany
561
81978
0.51
45572
0.29
2068
JP
Japan
281
60183
0.16
7118
0.23
2068
GB
United Kingdom
391
58638
0.38
42339
0.25
2068
IT
Italy
257
42570
0.53
43526
0.38
2068
AU
Australia
228
33685
0.14
8113
0.26
2068
FR
France
161
33078
0.3
23401
0.27
2068
SG
Singapore
80
25253
0.16
6928
0.4
2068
CA
Canada
90
22858
0.27
14118
0.32
2068
ES
Spain
85
17136
0.16
14282
0.22
2068
DK
Denmark
89
16651
0.08
5844
0.24
2068
AR
Argentina
3
202
0.07
5724
0.21
2068
BR
Brazil
9
180
0.09
7912
0.26
2068
TR
Turkey
1
168
0.05
4127
0.19
2068
CN
China
4
19
0.1
7265
0.18
2068
PL
Poland
2
18
0.11
7267
0.23
2068
SA
Saudi Arabia
0
7
0.07
4037
0.26
2068
The exact balance point between the two will depend on many factors, the point here is the trend.

The Trouble with Bitmap Indexes 

"There is a locking implication by design in bitmap indexes, the keys point to hundreds of rows and if I lock a single key entry in a bitmap index, I've locked the key entry for hundreds of other rows.
That is WHY bitmaps are not recommended - if you update a bitmap indexed column OR you insert/delete from the table (which will modify the bitmap index), you'll find serialization at best and deadlocks all over the place at worst" (Tom Kyte, 2007).
You may also see the size of the bitmap index grow significantly and quickly.
Jonathan Lewis has written extensively on the subject of Bitmap Indexes on his blog:
So bitmap indexes were really only ever an option where you bulk load the data, and then the data is static while you query it. You might even drop the index prior to loading data and then rebuild it again afterward. They are fast to build because there is no sort involved.

In choosing whether to still use bitmaps indexes instead of Bloom filters, you have to decide whether the improved performance for a probably small number of queries that return a small volume of data is worth the challenges associated with bitmap indexes.  In general, I think it probably isn't.

No comments :