Good Practice
It is not uncommon to see a time dimension with one row for every day. This approach saves putting functions on the date column of the fact table. For example, in the Oracle Sales History sample schema:- The primary key on the time dimension is a date.
- There are 37 different attribute columns. This saves coding complex SQL functions to group dates.
Name Null? Type
----------------------------------------- -------- -----------
TIME_ID NOT NULL DATE
DAY_NAME NOT NULL VARCHAR2(9)
DAY_NUMBER_IN_WEEK NOT NULL NUMBER(1)
DAY_NUMBER_IN_MONTH NOT NULL NUMBER(2)
CALENDAR_WEEK_NUMBER NOT NULL NUMBER(2)
FISCAL_WEEK_NUMBER NOT NULL NUMBER(2)
WEEK_ENDING_DAY NOT NULL DATE
WEEK_ENDING_DAY_ID NOT NULL NUMBER
CALENDAR_MONTH_NUMBER NOT NULL NUMBER(2)
FISCAL_MONTH_NUMBER NOT NULL NUMBER(2)
CALENDAR_MONTH_DESC NOT NULL VARCHAR2(8)
CALENDAR_MONTH_ID NOT NULL NUMBER
FISCAL_MONTH_DESC NOT NULL VARCHAR2(8)
FISCAL_MONTH_ID NOT NULL NUMBER
DAYS_IN_CAL_MONTH NOT NULL NUMBER
DAYS_IN_FIS_MONTH NOT NULL NUMBER
END_OF_CAL_MONTH NOT NULL DATE
END_OF_FIS_MONTH NOT NULL DATE
CALENDAR_MONTH_NAME NOT NULL VARCHAR2(9)
FISCAL_MONTH_NAME NOT NULL VARCHAR2(9)
CALENDAR_QUARTER_DESC NOT NULL CHAR(7)
CALENDAR_QUARTER_ID NOT NULL NUMBER
FISCAL_QUARTER_DESC NOT NULL CHAR(7)
FISCAL_QUARTER_ID NOT NULL NUMBER
DAYS_IN_CAL_QUARTER NOT NULL NUMBER
DAYS_IN_FIS_QUARTER NOT NULL NUMBER
END_OF_CAL_QUARTER NOT NULL DATE
END_OF_FIS_QUARTER NOT NULL DATE
CALENDAR_QUARTER_NUMBER NOT NULL NUMBER(1)
FISCAL_QUARTER_NUMBER NOT NULL NUMBER(1)
CALENDAR_YEAR NOT NULL NUMBER(4)
CALENDAR_YEAR_ID NOT NULL NUMBER
FISCAL_YEAR NOT NULL NUMBER(4)
FISCAL_YEAR_ID NOT NULL NUMBER
DAYS_IN_CAL_YEAR NOT NULL NUMBER
DAYS_IN_FIS_YEAR NOT NULL NUMBER
END_OF_CAL_YEAR NOT NULL DATE
END_OF_FIS_YEAR NOT NULL DATE
I have seen some time dimensions with 200 years of dates (eg 1900 – 2100), this is fine if your queries specify a date range on the dimension, but if you have an unbounded inequality such as simply after a date then you will get a lot of rows from the time dimension. This can also bring challenges, so keep it sensible.Not Good Practice
You should represent a date as a date and not as a string or a number.- 31st December 2018
- Not 20181231
- Not '20181231'
- For example, the difference between 31st December 2018 and 1st January 2019 should be 1 day.
- However, if you use the string representation of a day it is 20190101-20181231=8870
- This can lead to bad cardinality estimates, and in more complex queries it can cause subsequent bad optimizer decisions later in the query.
INSERT /*+APPEND*/ INTO bad_times
( TIME_ID, CALENDAR_DATE, DAY_NAME
...
)
SELECT
TO_NUMBER(TO_CHAR(TIME_ID,'YYYYMMDD')) time_id, TIME_ID calendar_date, DAY_NAME
...
FROM times
/
TIME_ID is now a number that contains the date string, and CALENDAR_DATE is the original date value TIME_ID CALENDAR_
---------- ---------
19980101 01-JAN-98
19980102 02-JAN-98
19980103 03-JAN-98
19980104 04-JAN-98
19980105 05-JAN-98
19980106 06-JAN-98
19980107 07-JAN-98
19980108 08-JAN-98
19980109 09-JAN-98
19980110 10-JAN-98
...
I will also create a copy of sales where I have used the same numeric format for TIME_ID. The foreign key between these is still on TIME_ID, but that column is now a number.INSERT /*+APPEND*/ INTO bad_sales
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
SELECT prod_id, cust_id, TO_NUMBER(TO_CHAR(time_id,'yyyymmdd')), channel_id, promo_id, quantity_sold, amount_sold
FROM sales
/
I will demonstrate the problem with a simple monthly sales analysis report for FY 1999.SELECT t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
, COUNT(*) num_sales
, SUM(s.amount_sold) total_amount_sold
FROM sales s
, customers u
, products p
, times t
WHERE s.time_id = t.time_id
AND s.prod_id = p.prod_id
AND u.cust_id = s.cust_id
AND t.time_id >= TO_DATE('27121998','DDMMYYYY')
AND t.time_id < TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/
Plan hash value: 3667272686
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | Pstart| Pstop | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 165 (100)| | | | 13 |00:00:00.28 | 500 | 303 | | | |
| 1 | SORT GROUP BY | | 1 | 365 | 21170 | 165 (14)| 00:00:01 | | | 13 |00:00:00.28 | 500 | 303 | 2048 | 2048 | 2048 (0)|
|* 2 | HASH JOIN | | 1 | 365 | 21170 | 164 (13)| 00:00:01 | | | 365 |00:00:00.28 | 500 | 303 | 1298K| 1298K| 1527K (0)|
| 3 | PART JOIN FILTER CREATE | :BF0000 | 1 | 366 | 8784 | 14 (0)| 00:00:01 | | | 365 |00:00:00.01 | 14 | 0 | | | |
| 4 | TABLE ACCESS BY INDEX ROWID BATCHED| TIMES | 1 | 366 | 8784 | 14 (0)| 00:00:01 | | | 365 |00:00:00.01 | 14 | 0 | | | |
|* 5 | INDEX RANGE SCAN | TIMES_PK | 1 | 366 | | 3 (0)| 00:00:01 | | | 365 |00:00:00.01 | 3 | 0 | | | |
| 6 | VIEW | VW_GBC_5 | 1 | 366 | 12444 | 150 (14)| 00:00:01 | | | 365 |00:00:00.28 | 486 | 303 | | | |
| 7 | HASH GROUP BY | | 1 | 366 | 4758 | 150 (14)| 00:00:01 | | | 365 |00:00:00.28 | 486 | 303 | 1063K| 1063K| 2552K (0)|
| 8 | PARTITION RANGE AND | | 1 | 230K| 2924K| 133 (4)| 00:00:01 |KEY(AP)|KEY(AP)| 247K|00:00:00.21 | 486 | 303 | | | |
|* 9 | TABLE ACCESS FULL | SALES | 5 | 230K| 2924K| 133 (4)| 00:00:01 |KEY(AP)|KEY(AP)| 247K|00:00:00.21 | 486 | 303 | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ITEM_1"="T"."TIME_ID")
5 - access("T"."TIME_ID">=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "T"."TIME_ID"=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
The literal predicate on T.TIME_ID is repeated on S.TIME_ID in the predicate section of the execution plan due to the optimizer performing transitive closure.SELECT t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
, COUNT(*) num_sales
, SUM(s.amount_sold) total_amount_sold
from bad_sales s
, customers u
, products p
, bad_times t
WHERE s.time_id = t.time_id
AND s.prod_id = p.prod_id
AND u.cust_id = s.cust_id
AND t.calendar_date >= TO_DATE('27121998','DDMMYYYY')
AND t.calendar_date < TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/
Plan hash value: 4232725394
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | Pstart| Pstop | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 509 (100)| | | | 13 |00:00:00.15 | 540 | | | |
| 1 | SORT GROUP BY | | 1 | 366 | 22692 | 509 (19)| 00:00:01 | | | 13 |00:00:00.15 | 540 | 2048 | 2048 | 2048 (0)|
|* 2 | HASH JOIN | | 1 | 366 | 22692 | 508 (18)| 00:00:01 | | | 365 |00:00:00.15 | 540 | 1335K| 1335K| 1207K (0)|
| 3 | JOIN FILTER CREATE | :BF0001 | 1 | 366 | 10980 | 16 (0)| 00:00:01 | | | 365 |00:00:00.01 | 56 | | | |
| 4 | PART JOIN FILTER CREATE | :BF0000 | 1 | 366 | 10980 | 16 (0)| 00:00:01 | | | 365 |00:00:00.01 | 56 | | | |
|* 5 | TABLE ACCESS FULL | BAD_TIMES | 1 | 366 | 10980 | 16 (0)| 00:00:01 | | | 365 |00:00:00.01 | 56 | | | |
| 6 | VIEW | VW_GBC_5 | 1 | 1460 | 46720 | 492 (19)| 00:00:01 | | | 370 |00:00:00.15 | 483 | | | |
| 7 | HASH GROUP BY | | 1 | 1460 | 16060 | 492 (19)| 00:00:01 | | | 370 |00:00:00.15 | 483 | 1079K| 1079K| 2561K (0)|
| 8 | JOIN FILTER USE | :BF0001 | 1 | 918K| 9870K| 423 (6)| 00:00:01 | | | 250K|00:00:00.10 | 483 | | | |
| 9 | PARTITION RANGE JOIN-FILTER| | 1 | 918K| 9870K| 423 (6)| 00:00:01 |:BF0000|:BF0000| 250K|00:00:00.09 | 483 | | | |
|* 10 | TABLE ACCESS FULL | BAD_SALES | 5 | 918K| 9870K| 423 (6)| 00:00:01 |:BF0000|:BF0000| 250K|00:00:00.09 | 483 | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
If we do the same query on the new CALENDAR_DATE attribute column on BAD_TIMES dimension we still expect 366 rows from TIMES, but then we expect 918K rows from BAD_SALES instead of 230K rows from SALES.SELECT t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
, COUNT(*) num_sales
, SUM(s.amount_sold) total_amount_sold
FROM bad_sales s
, customers u
, products p
, bad_times t
WHERE s.time_id = t.time_id
AND s.prod_id = p.prod_id
AND u.cust_id = s.cust_id
AND t.time_id >= 19982712
AND t.time_id < 19992712
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/
Plan hash value: 1098183223
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | Pstart| Pstop | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 176 (100)| | | | 13 |00:00:00.16 | 459 | | | |
| 1 | SORT GROUP BY | | 1 | 445 | 24030 | 176 (18)| 00:00:01 | | | 13 |00:00:00.16 | 459 | 2048 | 2048 | 2048 (0)|
|* 2 | HASH JOIN | | 1 | 445 | 24030 | 175 (18)| 00:00:01 | | | 365 |00:00:00.16 | 459 | 1355K| 1355K| 1540K (0)|
| 3 | PART JOIN FILTER CREATE| :BF0000 | 1 | 445 | 9790 | 16 (0)| 00:00:01 | | | 365 |00:00:00.01 | 56 | | | |
|* 4 | TABLE ACCESS FULL | BAD_TIMES | 1 | 445 | 9790 | 16 (0)| 00:00:01 | | | 365 |00:00:00.01 | 56 | | | |
| 5 | VIEW | VW_GBC_5 | 1 | 470 | 15040 | 159 (19)| 00:00:01 | | | 365 |00:00:00.16 | 402 | | | |
| 6 | HASH GROUP BY | | 1 | 470 | 5170 | 159 (19)| 00:00:01 | | | 365 |00:00:00.16 | 402 | 1079K| 1079K| 2576K (0)|
| 7 | PARTITION RANGE AND | | 1 | 295K| 3177K| 138 (7)| 00:00:01 |KEY(AP)|KEY(AP)| 247K|00:00:00.10 | 402 | | | |
|* 8 | TABLE ACCESS FULL | BAD_SALES | 4 | 295K| 3177K| 138 (7)| 00:00:01 |KEY(AP)|KEY(AP)| 247K|00:00:00.09 | 402 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But, now if I use the numeric value of the date in the predicate it incorrectly calculates the cardinality of TIMES as 445 instead of 366. The estimate of rows from BAD_SALES is not as far off as the previous query at only 296K rows, but it is still higher than the original 230K.We see that if you query the BAD_TIMES dimension by an attribute column, then things will mostly work quite well, but if you put a criterion directly on the TIME_ID column, you will get cardinality misestimates. The queries in this example are very simple, but if more tables were joined, cardinality misestimates would have more opportunity to cause plan regressions.
Workaround
If you are stuck with a numeric primary key on your time dimension that you cannot change, you might be able to change the value you actually put into it. If the TIME_ID was the Julian date value all the arithmetic would correctly because there would be no gaps between months and years.
From
|
To
|
Difference
|
Function
|
|
Date
|
31st December 2018
|
1st January 2019
|
1
|
|
Number
|
20181231
|
20190101
|
8070
|
TO_CHAR(…,'YYYMMDD')
|
Julian
|
2458484
|
2458485
|
1
|
TO_CHAR(….'J')
|
No comments :
Post a Comment