Tuesday, January 28, 2020

Data Warehouse Design: Snowflake Dimensions and Lost Skew Trap

This post is part of a series that discusses some common issues in data warehouses. Originally written in 2018, but I never got round to publishing it.
While I was experimenting with the previous query I noticed that the cost of the execution plans didn't change as I changed the COUNTRY_ISO_CODE, yet the data volumes for different countries are very different.
select c.country_name
,      u.cust_state_province
,      COUNT(*) num_sales
,      SUM(s.amount_sold) total_amount_sold
from   sales s
,      customers u
,      products p
,      times t
,      countries c
WHERE  s.time_id = t.time_id
AND    s.prod_id = p.prod_id
AND    u.cust_id = s.cust_id
AND    u.country_id = c.country_id
AND    c.country_iso_code = '&&iso_country_code'
AND    p.prod_category_id = 205
and    t.fiscal_year = 1999
GROUP BY c.country_name, u.cust_state_province
ORDER BY 1,2
/
Plan hash value: 3095970037
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   Id  | Operation                                 | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT                          |                           |      1 |        |       |  1473 (100)|          |       |       |     45 |00:00:01.77 |     101K|       |       |          |
|     1 |  TEMP TABLE TRANSFORMATION                |                           |      1 |        |       |            |          |       |       |     45 |00:00:01.77 |     101K|       |       |          |
|     2 |   LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D7C68_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.13 |    1889 |  1024 |  1024 |          |
|  *  3 |    HASH JOIN                              |                           |      1 |   2413 | 94107 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.10 |    1888 |  1185K|  1185K|  639K (0)|
|  *  4 |     TABLE ACCESS FULL                     | COUNTRIES                 |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|     5 |     TABLE ACCESS FULL                     | CUSTOMERS                 |      1 |  55500 |  1138K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |       |       |          |
|     6 |   SORT GROUP BY                           |                           |      1 |   2359 |   101K|  1055   (1)| 00:00:01 |       |       |     45 |00:00:01.65 |   99111 |  6144 |  6144 | 6144  (0)|
|  *  7 |    HASH JOIN                              |                           |      1 |   3597 |   154K|  1054   (1)| 00:00:01 |       |       |  64818 |00:00:01.58 |   99111 |  2391K|  1595K| 2025K (0)|
|     8 |     TABLE ACCESS FULL                     | SYS_TEMP_0FD9D7C68_A4BC21 |      1 |   2413 | 62738 |     5   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|     9 |     VIEW                                  | VW_ST_C525CEF3            |      1 |   3597 | 64746 |  1048   (1)| 00:00:01 |       |       |  64818 |00:00:01.44 |   99111 |       |       |          |
…
Note:
  • There are 55500 rows on CUSTOMERS
  • There are 23 rows on COUNTRIES
  • Oracle expects 2413 rows on joining those tables
    • 55500÷23= 2413.04, so Oracle assumes the data is evenly distributed between countries, although there are histograms on COUNTRY_ISO_CODE and COUNTRY_ID. 
    • This is sometimes called 'lost skew'. The skew of a dimension does not pass into the cardinality calculation on the fact table.
If I replace the predicate on COUNTRY_ISO_CODE with a predicate on COUNTRY_ID then the estimate of the number of rows from customers is correctly 18520 rows. The cost of the star transformation has gone up from 1473 to 6922.
Plan hash value: 1339390240

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                           |      1 |        |       |  6922 (100)|          |       |       |     45 |00:00:01.50 |   97998 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION                |                           |      1 |        |       |            |          |       |       |     45 |00:00:01.50 |   97998 |       |       |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D7C6A_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.06 |    1524 |  1024 |  1024 |          |
|   3 |    NESTED LOOPS                           |                           |      1 |  18520 |   651K|   417   (1)| 00:00:01 |       |       |  18520 |00:00:00.04 |    1523 |       |       |          |
|   4 |     TABLE ACCESS BY INDEX ROWID           | COUNTRIES                 |      1 |      1 |    15 |     1   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|   5 |      INDEX UNIQUE SCAN                    | COUNTRIES_PK              |      1 |      1 |       |     0   (0)|          |       |       |      1 |00:00:00.01 |       1 |       |       |          |
|   6 |     TABLE ACCESS FULL                     | CUSTOMERS                 |      1 |  18520 |   379K|   416   (1)| 00:00:01 |       |       |  18520 |00:00:00.03 |    1521 |       |       |          |
|   7 |   SORT GROUP BY                           |                           |      1 |   2359 |   101K|  6505   (1)| 00:00:01 |       |       |     45 |00:00:01.43 |   96473 |  6144 |  6144 | 6144  (0)|
|   8 |    HASH JOIN                              |                           |      1 |  82724 |  3554K|  6499   (1)| 00:00:01 |       |       |  64818 |00:00:01.37 |   96473 |  2391K|  1595K| 2002K (0)|
|   9 |     TABLE ACCESS FULL                     | SYS_TEMP_0FD9D7C6A_A4BC21 |      1 |  18520 |   470K|    25   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
…
In fact, I only get the star transformation if I force the issue with a STAR_TRANSFORMATION hint. Otherwise, I get the full scan plan which is much cheaper, but again the cardinality calculation on CUSTOMERS is correct.
Plan hash value: 3784979335
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name         | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |              |      1 |        |       |  1595 (100)|          |       |       |     45 |00:00:00.54 |    2065 |    472 |       |       |          |
|   1 |  SORT GROUP BY                    |              |      1 |     45 |  3510 |  1595   (3)| 00:00:01 |       |       |     45 |00:00:00.54 |    2065 |    472 |  6144 |  6144 | 6144  (0)|
|   2 |   HASH JOIN                       |              |      1 |  81133 |  6180K|  1589   (3)| 00:00:01 |       |       |  64818 |00:00:00.43 |    2065 |    472 |  2337K|  2200K| 2221K (0)|
|   3 |    TABLE ACCESS FULL              | CUSTOMERS    |      1 |  18520 |   379K|   416   (1)| 00:00:01 |       |       |  18520 |00:00:00.02 |    1521 |      0 |       |       |          |
|   4 |    HASH JOIN                      |              |      1 |  81133 |  4516K|  1172   (3)| 00:00:01 |       |       |    110K|00:00:00.35 |     544 |    472 |  2546K|  2546K| 1610K (0)|
|   5 |     TABLE ACCESS FULL             | PRODUCTS     |      1 |     26 |   208 |     3   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       4 |      0 |       |       |          |
|   6 |     HASH JOIN                     |              |      1 |    229K|    10M|  1167   (3)| 00:00:01 |       |       |    246K|00:00:00.30 |     539 |    472 |  1133K|  1133K| 1698K (0)|
|   7 |      PART JOIN FILTER CREATE      | :BF0000      |      1 |    364 |  9828 |    17   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      57 |      0 |       |       |          |
|   8 |       NESTED LOOPS                |              |      1 |    364 |  9828 |    17   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      57 |      0 |       |       |          |
|   9 |        TABLE ACCESS BY INDEX ROWID| COUNTRIES    |      1 |      1 |    15 |     1   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|  10 |         INDEX UNIQUE SCAN         | COUNTRIES_PK |      1 |      1 |       |     0   (0)|          |       |       |      1 |00:00:00.01 |       1 |      0 |       |       |          |
|  11 |        TABLE ACCESS FULL          | TIMES        |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|  12 |      PARTITION RANGE JOIN-FILTER  |              |      1 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.21 |     482 |    472 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SALES        |      5 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.20 |     482 |    472 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

No comments :