Monday, November 19, 2018

Data Warehouse Design Mistakes 3: Date Dimensions Without Date Primary Keys

This post is part of a series that discusses some common issues in data warehouses.

Good Practice 

It is not uncommon to see a time dimension with one row for every day. This approach saves putting functions on the date column of the fact table.  For example, in the Oracle Sales History sample schema:
  • The primary key on the time dimension is a date.
  • There are 37 different attribute columns. This saves coding complex SQL functions to group dates.
Name                                      Null?    Type
----------------------------------------- -------- -----------
TIME_ID                                   NOT NULL DATE
DAY_NAME                                  NOT NULL VARCHAR2(9)
DAY_NUMBER_IN_WEEK                        NOT NULL NUMBER(1)
DAY_NUMBER_IN_MONTH                       NOT NULL NUMBER(2)
CALENDAR_WEEK_NUMBER                      NOT NULL NUMBER(2)
FISCAL_WEEK_NUMBER                        NOT NULL NUMBER(2)
WEEK_ENDING_DAY                           NOT NULL DATE
WEEK_ENDING_DAY_ID                        NOT NULL NUMBER
CALENDAR_MONTH_NUMBER                     NOT NULL NUMBER(2)
FISCAL_MONTH_NUMBER                       NOT NULL NUMBER(2)
CALENDAR_MONTH_DESC                       NOT NULL VARCHAR2(8)
CALENDAR_MONTH_ID                         NOT NULL NUMBER
FISCAL_MONTH_DESC                         NOT NULL VARCHAR2(8)
FISCAL_MONTH_ID                           NOT NULL NUMBER
DAYS_IN_CAL_MONTH                         NOT NULL NUMBER
DAYS_IN_FIS_MONTH                         NOT NULL NUMBER
END_OF_CAL_MONTH                          NOT NULL DATE
END_OF_FIS_MONTH                          NOT NULL DATE
CALENDAR_MONTH_NAME                       NOT NULL VARCHAR2(9)
FISCAL_MONTH_NAME                         NOT NULL VARCHAR2(9)
CALENDAR_QUARTER_DESC                     NOT NULL CHAR(7)
CALENDAR_QUARTER_ID                       NOT NULL NUMBER
FISCAL_QUARTER_DESC                       NOT NULL CHAR(7)
FISCAL_QUARTER_ID                         NOT NULL NUMBER
DAYS_IN_CAL_QUARTER                       NOT NULL NUMBER
DAYS_IN_FIS_QUARTER                       NOT NULL NUMBER
END_OF_CAL_QUARTER                        NOT NULL DATE
END_OF_FIS_QUARTER                        NOT NULL DATE
CALENDAR_QUARTER_NUMBER                   NOT NULL NUMBER(1)
FISCAL_QUARTER_NUMBER                     NOT NULL NUMBER(1)
CALENDAR_YEAR                             NOT NULL NUMBER(4)
CALENDAR_YEAR_ID                          NOT NULL NUMBER
FISCAL_YEAR                               NOT NULL NUMBER(4)
FISCAL_YEAR_ID                            NOT NULL NUMBER
DAYS_IN_CAL_YEAR                          NOT NULL NUMBER
DAYS_IN_FIS_YEAR                          NOT NULL NUMBER
END_OF_CAL_YEAR                           NOT NULL DATE
END_OF_FIS_YEAR                           NOT NULL DATE
I have seen some time dimensions with 200 years of dates (eg 1900 – 2100), this is fine if your queries specify a date range on the dimension, but if you have an unbounded inequality such as simply after a date then you will get a lot of rows from the time dimension. This can also bring challenges, so keep it sensible.

Not Good Practice 

You should represent a date as a date and not as a string or a number.
  • 31st December 2018
  • Not 20181231 
  • Not '20181231' 
If you represent a date as a string you can get miscalculations in the optimizer.
  • For example, the difference between 31st December 2018 and 1st January 2019 should be 1 day.
  • However, if you use the string representation of a day it is 20190101-20181231=8870 
  • This can lead to bad cardinality estimates, and in more complex queries it can cause subsequent bad optimizer decisions later in the query. 
I will create a copy of the TIMES dimension, but with a numeric primary key, and I will add the original date in a new attribute column CALENDAR_DATE.
INSERT /*+APPEND*/ INTO bad_times
( TIME_ID, CALENDAR_DATE, DAY_NAME
...
) 
SELECT
  TO_NUMBER(TO_CHAR(TIME_ID,'YYYYMMDD')) time_id, TIME_ID calendar_date, DAY_NAME
...
FROM times
/
TIME_ID is now a number that contains the date string, and CALENDAR_DATE is the original date value
   TIME_ID CALENDAR_
---------- ---------
  19980101 01-JAN-98
  19980102 02-JAN-98
  19980103 03-JAN-98
  19980104 04-JAN-98
  19980105 05-JAN-98
  19980106 06-JAN-98
  19980107 07-JAN-98
  19980108 08-JAN-98
  19980109 09-JAN-98
  19980110 10-JAN-98
...
I will also create a copy of sales where I have used the same numeric format for TIME_ID. The foreign key between these is still on TIME_ID, but that column is now a number.
INSERT /*+APPEND*/ INTO bad_sales
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
SELECT  prod_id, cust_id, TO_NUMBER(TO_CHAR(time_id,'yyyymmdd')), channel_id, promo_id, quantity_sold, amount_sold
FROM  sales
/
I will demonstrate the problem with a simple monthly sales analysis report for FY 1999.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold 
FROM  sales s
,   customers u
,   products p
,  times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.time_id >= TO_DATE('27121998','DDMMYYYY') 
AND  t.time_id <  TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 3667272686

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name     | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |          |      1 |        |       |   165 (100)|       |          |       |     13 |00:00:00.28 |     500 |    303 |       |       |          |
|   1 |  SORT GROUP BY                         |          |      1 |    365 | 21170 |   165  (14)| 00:00:01 |       |       |     13 |00:00:00.28 |     500 |    303 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                            |          |      1 |    365 | 21170 |   164  (13)| 00:00:01 |       |       |    365 |00:00:00.28 |     500 |    303 |  1298K|  1298K| 1527K (0)|
|   3 |    PART JOIN FILTER CREATE             | :BF0000  |      1 |    366 |  8784 |    14   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      14 |      0 |       |       |          |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| TIMES    |      1 |    366 |  8784 |    14   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      14 |      0 |       |       |          |
|*  5 |      INDEX RANGE SCAN                  | TIMES_PK |      1 |    366 |       |     3   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |       3 |      0 |       |       |          |
|   6 |    VIEW                                | VW_GBC_5 |      1 |    366 | 12444 |   150  (14)| 00:00:01 |       |       |    365 |00:00:00.28 |     486 |    303 |       |       |          |
|   7 |     HASH GROUP BY                      |          |      1 |    366 |  4758 |   150  (14)| 00:00:01 |       |       |    365 |00:00:00.28 |     486 |    303 |  1063K|  1063K| 2552K (0)|
|   8 |      PARTITION RANGE AND               |          |      1 |    230K|  2924K|   133   (4)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.21 |     486 |    303 |       |       |          |
|*  9 |       TABLE ACCESS FULL                | SALES    |      5 |    230K|  2924K|   133   (4)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.21 |     486 |    303 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ITEM_1"="T"."TIME_ID")
   5 - access("T"."TIME_ID">=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "T"."TIME_ID"=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
The literal predicate on T.TIME_ID is repeated on S.TIME_ID in the predicate section of the execution plan due to the optimizer performing transitive closure.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
from  bad_sales s
,   customers u
,   products p
,  bad_times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.calendar_date >= TO_DATE('27121998','DDMMYYYY') 
AND  t.calendar_date <  TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 4232725394

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |           |      1 |        |       |   509 (100)|          |       |       |     13 |00:00:00.15 |     540 |       |       |          |
|   1 |  SORT GROUP BY                   |           |      1 |    366 | 22692 |   509  (19)| 00:00:01 |       |       |     13 |00:00:00.15 |     540 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                      |           |      1 |    366 | 22692 |   508  (18)| 00:00:01 |       |       |    365 |00:00:00.15 |     540 |  1335K|  1335K| 1207K (0)|
|   3 |    JOIN FILTER CREATE            | :BF0001   |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   4 |     PART JOIN FILTER CREATE      | :BF0000   |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|*  5 |      TABLE ACCESS FULL           | BAD_TIMES |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   6 |    VIEW                          | VW_GBC_5  |      1 |   1460 | 46720 |   492  (19)| 00:00:01 |       |       |    370 |00:00:00.15 |     483 |       |       |          |
|   7 |     HASH GROUP BY                |           |      1 |   1460 | 16060 |   492  (19)| 00:00:01 |       |       |    370 |00:00:00.15 |     483 |  1079K|  1079K| 2561K (0)|
|   8 |      JOIN FILTER USE             | :BF0001   |      1 |    918K|  9870K|   423   (6)| 00:00:01 |       |       |    250K|00:00:00.10 |     483 |       |       |          |
|   9 |       PARTITION RANGE JOIN-FILTER|           |      1 |    918K|  9870K|   423   (6)| 00:00:01 |:BF0000|:BF0000|    250K|00:00:00.09 |     483 |       |       |          |
|* 10 |        TABLE ACCESS FULL         | BAD_SALES |      5 |    918K|  9870K|   423   (6)| 00:00:01 |:BF0000|:BF0000|    250K|00:00:00.09 |     483 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
If we do the same query on the new CALENDAR_DATE attribute column on BAD_TIMES dimension we still expect 366 rows from TIMES, but then we expect 918K rows from BAD_SALES instead of 230K rows from SALES.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
FROM bad_sales s
,   customers u
,   products p
,  bad_times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.time_id >= 19982712
AND  t.time_id <  19992712
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 1098183223

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |       |   176 (100)|          |       |       |     13 |00:00:00.16 |     459 |       |       |          |
|   1 |  SORT GROUP BY            |           |      1 |    445 | 24030 |   176  (18)| 00:00:01 |       |       |     13 |00:00:00.16 |     459 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN               |           |      1 |    445 | 24030 |   175  (18)| 00:00:01 |       |       |    365 |00:00:00.16 |     459 |  1355K|  1355K| 1540K (0)|
|   3 |    PART JOIN FILTER CREATE| :BF0000   |      1 |    445 |  9790 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|*  4 |     TABLE ACCESS FULL     | BAD_TIMES |      1 |    445 |  9790 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   5 |    VIEW                   | VW_GBC_5  |      1 |    470 | 15040 |   159  (19)| 00:00:01 |       |       |    365 |00:00:00.16 |     402 |       |       |          |
|   6 |     HASH GROUP BY         |           |      1 |    470 |  5170 |   159  (19)| 00:00:01 |       |       |    365 |00:00:00.16 |     402 |  1079K|  1079K| 2576K (0)|
|   7 |      PARTITION RANGE AND  |           |      1 |    295K|  3177K|   138   (7)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.10 |     402 |       |       |          |
|*  8 |       TABLE ACCESS FULL   | BAD_SALES |      4 |    295K|  3177K|   138   (7)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.09 |     402 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But, now if I use the numeric value of the date in the predicate it incorrectly calculates the cardinality of TIMES as 445 instead of 366. The estimate of rows from BAD_SALES is not as far off as the previous query at only 296K rows, but it is still higher than the original 230K.
We see that if you query the BAD_TIMES dimension by an attribute column, then things will mostly work quite well, but if you put a criterion directly on the TIME_ID column, you will get cardinality misestimates. The queries in this example are very simple, but if more tables were joined, cardinality misestimates would have more opportunity to cause plan regressions.

Workaround

If you are stuck with a numeric primary key on your time dimension that you cannot change, you might be able to change the value you actually put into it. If the TIME_ID was the Julian date value all the arithmetic would correctly because there would be no gaps between months and years.

From
To
Difference
Function
Date
31st December 2018
1st January 2019
1

Number
20181231
20190101
8070
TO_CHAR(…,'YYYMMDD')
Julian
2458484
2458485
1
TO_CHAR(….'J')

No comments :