Monday, November 19, 2018

Data Warehouse Design Mistakes 3: Date Dimensions Without Date Primary Keys

This post is part of a series that discusses some common issues in data warehouses.

Good Practice 

It is not uncommon to see a time dimension with one row for every day. This approach saves putting functions on the date column of the fact table.  For example, in the Oracle Sales History sample schema:
  • The primary key on the time dimension is a date.
  • There are 37 different attribute columns. This saves coding complex SQL functions to group dates.
Name                                      Null?    Type
----------------------------------------- -------- -----------
TIME_ID                                   NOT NULL DATE
DAY_NAME                                  NOT NULL VARCHAR2(9)
DAY_NUMBER_IN_WEEK                        NOT NULL NUMBER(1)
DAY_NUMBER_IN_MONTH                       NOT NULL NUMBER(2)
CALENDAR_WEEK_NUMBER                      NOT NULL NUMBER(2)
FISCAL_WEEK_NUMBER                        NOT NULL NUMBER(2)
WEEK_ENDING_DAY                           NOT NULL DATE
WEEK_ENDING_DAY_ID                        NOT NULL NUMBER
CALENDAR_MONTH_NUMBER                     NOT NULL NUMBER(2)
FISCAL_MONTH_NUMBER                       NOT NULL NUMBER(2)
CALENDAR_MONTH_DESC                       NOT NULL VARCHAR2(8)
CALENDAR_MONTH_ID                         NOT NULL NUMBER
FISCAL_MONTH_DESC                         NOT NULL VARCHAR2(8)
FISCAL_MONTH_ID                           NOT NULL NUMBER
DAYS_IN_CAL_MONTH                         NOT NULL NUMBER
DAYS_IN_FIS_MONTH                         NOT NULL NUMBER
END_OF_CAL_MONTH                          NOT NULL DATE
END_OF_FIS_MONTH                          NOT NULL DATE
CALENDAR_MONTH_NAME                       NOT NULL VARCHAR2(9)
FISCAL_MONTH_NAME                         NOT NULL VARCHAR2(9)
CALENDAR_QUARTER_DESC                     NOT NULL CHAR(7)
CALENDAR_QUARTER_ID                       NOT NULL NUMBER
FISCAL_QUARTER_DESC                       NOT NULL CHAR(7)
FISCAL_QUARTER_ID                         NOT NULL NUMBER
DAYS_IN_CAL_QUARTER                       NOT NULL NUMBER
DAYS_IN_FIS_QUARTER                       NOT NULL NUMBER
END_OF_CAL_QUARTER                        NOT NULL DATE
END_OF_FIS_QUARTER                        NOT NULL DATE
CALENDAR_QUARTER_NUMBER                   NOT NULL NUMBER(1)
FISCAL_QUARTER_NUMBER                     NOT NULL NUMBER(1)
CALENDAR_YEAR                             NOT NULL NUMBER(4)
CALENDAR_YEAR_ID                          NOT NULL NUMBER
FISCAL_YEAR                               NOT NULL NUMBER(4)
FISCAL_YEAR_ID                            NOT NULL NUMBER
DAYS_IN_CAL_YEAR                          NOT NULL NUMBER
DAYS_IN_FIS_YEAR                          NOT NULL NUMBER
END_OF_CAL_YEAR                           NOT NULL DATE
END_OF_FIS_YEAR                           NOT NULL DATE
I have seen some time dimensions with 200 years of dates (eg 1900 – 2100), this is fine if your queries specify a date range on the dimension, but if you have an unbounded inequality such as simply after a date then you will get a lot of rows from the time dimension. This can also bring challenges, so keep it sensible.

Not Good Practice 

You should represent a date as a date and not as a string or a number.
  • 31st December 2018
  • Not 20181231 
  • Not '20181231' 
If you represent a date as a string you can get miscalculations in the optimizer.
  • For example, the difference between 31st December 2018 and 1st January 2019 should be 1 day.
  • However, if you use the string representation of a day it is 20190101-20181231=8870 
  • This can lead to bad cardinality estimates, and in more complex queries it can cause subsequent bad optimizer decisions later in the query. 
I will create a copy of the TIMES dimension, but with a numeric primary key, and I will add the original date in a new attribute column CALENDAR_DATE.
INSERT /*+APPEND*/ INTO bad_times
( TIME_ID, CALENDAR_DATE, DAY_NAME
...
) 
SELECT
  TO_NUMBER(TO_CHAR(TIME_ID,'YYYYMMDD')) time_id, TIME_ID calendar_date, DAY_NAME
...
FROM times
/
TIME_ID is now a number that contains the date string, and CALENDAR_DATE is the original date value
   TIME_ID CALENDAR_
---------- ---------
  19980101 01-JAN-98
  19980102 02-JAN-98
  19980103 03-JAN-98
  19980104 04-JAN-98
  19980105 05-JAN-98
  19980106 06-JAN-98
  19980107 07-JAN-98
  19980108 08-JAN-98
  19980109 09-JAN-98
  19980110 10-JAN-98
...
I will also create a copy of sales where I have used the same numeric format for TIME_ID. The foreign key between these is still on TIME_ID, but that column is now a number.
INSERT /*+APPEND*/ INTO bad_sales
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
SELECT  prod_id, cust_id, TO_NUMBER(TO_CHAR(time_id,'yyyymmdd')), channel_id, promo_id, quantity_sold, amount_sold
FROM  sales
/
I will demonstrate the problem with a simple monthly sales analysis report for FY 1999.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold 
FROM  sales s
,   customers u
,   products p
,  times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.time_id >= TO_DATE('27121998','DDMMYYYY') 
AND  t.time_id <  TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 3667272686

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name     | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |          |      1 |        |       |   165 (100)|       |          |       |     13 |00:00:00.28 |     500 |    303 |       |       |          |
|   1 |  SORT GROUP BY                         |          |      1 |    365 | 21170 |   165  (14)| 00:00:01 |       |       |     13 |00:00:00.28 |     500 |    303 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                            |          |      1 |    365 | 21170 |   164  (13)| 00:00:01 |       |       |    365 |00:00:00.28 |     500 |    303 |  1298K|  1298K| 1527K (0)|
|   3 |    PART JOIN FILTER CREATE             | :BF0000  |      1 |    366 |  8784 |    14   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      14 |      0 |       |       |          |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| TIMES    |      1 |    366 |  8784 |    14   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      14 |      0 |       |       |          |
|*  5 |      INDEX RANGE SCAN                  | TIMES_PK |      1 |    366 |       |     3   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |       3 |      0 |       |       |          |
|   6 |    VIEW                                | VW_GBC_5 |      1 |    366 | 12444 |   150  (14)| 00:00:01 |       |       |    365 |00:00:00.28 |     486 |    303 |       |       |          |
|   7 |     HASH GROUP BY                      |          |      1 |    366 |  4758 |   150  (14)| 00:00:01 |       |       |    365 |00:00:00.28 |     486 |    303 |  1063K|  1063K| 2552K (0)|
|   8 |      PARTITION RANGE AND               |          |      1 |    230K|  2924K|   133   (4)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.21 |     486 |    303 |       |       |          |
|*  9 |       TABLE ACCESS FULL                | SALES    |      5 |    230K|  2924K|   133   (4)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.21 |     486 |    303 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ITEM_1"="T"."TIME_ID")
   5 - access("T"."TIME_ID">=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "T"."TIME_ID"=TO_DATE(' 1998-12-27 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
The literal predicate on T.TIME_ID is repeated on S.TIME_ID in the predicate section of the execution plan due to the optimizer performing transitive closure.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
from  bad_sales s
,   customers u
,   products p
,  bad_times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.calendar_date >= TO_DATE('27121998','DDMMYYYY') 
AND  t.calendar_date <  TO_DATE('27121999','DDMMYYYY')
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 4232725394

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |           |      1 |        |       |   509 (100)|          |       |       |     13 |00:00:00.15 |     540 |       |       |          |
|   1 |  SORT GROUP BY                   |           |      1 |    366 | 22692 |   509  (19)| 00:00:01 |       |       |     13 |00:00:00.15 |     540 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                      |           |      1 |    366 | 22692 |   508  (18)| 00:00:01 |       |       |    365 |00:00:00.15 |     540 |  1335K|  1335K| 1207K (0)|
|   3 |    JOIN FILTER CREATE            | :BF0001   |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   4 |     PART JOIN FILTER CREATE      | :BF0000   |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|*  5 |      TABLE ACCESS FULL           | BAD_TIMES |      1 |    366 | 10980 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   6 |    VIEW                          | VW_GBC_5  |      1 |   1460 | 46720 |   492  (19)| 00:00:01 |       |       |    370 |00:00:00.15 |     483 |       |       |          |
|   7 |     HASH GROUP BY                |           |      1 |   1460 | 16060 |   492  (19)| 00:00:01 |       |       |    370 |00:00:00.15 |     483 |  1079K|  1079K| 2561K (0)|
|   8 |      JOIN FILTER USE             | :BF0001   |      1 |    918K|  9870K|   423   (6)| 00:00:01 |       |       |    250K|00:00:00.10 |     483 |       |       |          |
|   9 |       PARTITION RANGE JOIN-FILTER|           |      1 |    918K|  9870K|   423   (6)| 00:00:01 |:BF0000|:BF0000|    250K|00:00:00.09 |     483 |       |       |          |
|* 10 |        TABLE ACCESS FULL         | BAD_SALES |      5 |    918K|  9870K|   423   (6)| 00:00:01 |:BF0000|:BF0000|    250K|00:00:00.09 |     483 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
If we do the same query on the new CALENDAR_DATE attribute column on BAD_TIMES dimension we still expect 366 rows from TIMES, but then we expect 918K rows from BAD_SALES instead of 230K rows from SALES.
SELECT  t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
,  COUNT(*) num_sales
,  SUM(s.amount_sold) total_amount_sold
FROM bad_sales s
,   customers u
,   products p
,  bad_times t
WHERE  s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND t.time_id >= 19982712
AND  t.time_id <  19992712
GROUP BY t.fiscal_year, t.fiscal_month_id, t.fiscal_month_desc
ORDER BY 1
/

Plan hash value: 1098183223

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |       |   176 (100)|          |       |       |     13 |00:00:00.16 |     459 |       |       |          |
|   1 |  SORT GROUP BY            |           |      1 |    445 | 24030 |   176  (18)| 00:00:01 |       |       |     13 |00:00:00.16 |     459 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN               |           |      1 |    445 | 24030 |   175  (18)| 00:00:01 |       |       |    365 |00:00:00.16 |     459 |  1355K|  1355K| 1540K (0)|
|   3 |    PART JOIN FILTER CREATE| :BF0000   |      1 |    445 |  9790 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|*  4 |     TABLE ACCESS FULL     | BAD_TIMES |      1 |    445 |  9790 |    16   (0)| 00:00:01 |       |       |    365 |00:00:00.01 |      56 |       |       |          |
|   5 |    VIEW                   | VW_GBC_5  |      1 |    470 | 15040 |   159  (19)| 00:00:01 |       |       |    365 |00:00:00.16 |     402 |       |       |          |
|   6 |     HASH GROUP BY         |           |      1 |    470 |  5170 |   159  (19)| 00:00:01 |       |       |    365 |00:00:00.16 |     402 |  1079K|  1079K| 2576K (0)|
|   7 |      PARTITION RANGE AND  |           |      1 |    295K|  3177K|   138   (7)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.10 |     402 |       |       |          |
|*  8 |       TABLE ACCESS FULL   | BAD_SALES |      4 |    295K|  3177K|   138   (7)| 00:00:01 |KEY(AP)|KEY(AP)|    247K|00:00:00.09 |     402 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But, now if I use the numeric value of the date in the predicate it incorrectly calculates the cardinality of TIMES as 445 instead of 366. The estimate of rows from BAD_SALES is not as far off as the previous query at only 296K rows, but it is still higher than the original 230K.
We see that if you query the BAD_TIMES dimension by an attribute column, then things will mostly work quite well, but if you put a criterion directly on the TIME_ID column, you will get cardinality misestimates. The queries in this example are very simple, but if more tables were joined, cardinality misestimates would have more opportunity to cause plan regressions.

Workaround

If you are stuck with a numeric primary key on your time dimension that you cannot change, you might be able to change the value you actually put into it. If the TIME_ID was the Julian date value all the arithmetic would correctly because there would be no gaps between months and years.

From
To
Difference
Function
Date
31st December 2018
1st January 2019
1

Number
20181231
20190101
8070
TO_CHAR(…,'YYYMMDD')
Julian
2458484
2458485
1
TO_CHAR(….'J')

Thursday, November 15, 2018

Data Warehouse Design Mistakes 2: Effective Dating Dimensions

This post is part of a series that discusses some common issues in data warehouses.

I have seen situations where some dimensions are effective-dated.  That is to say that there are multiple rows in the dimension table for the same main dimension id, but for different date ranges.  At least one of the date columns has to become part of the primary key.
The dimensions in the Sales History sample schema have got effective from and to dates, but these columns are not part of the primary key.  Let's imagine that the products get updated every year and a new version of the product is sold, requiring a new dimension row.

The Wrong Way

I am going to build a copy of the PRODUCTS table with a new effective dated row for each calendar year in which the product was sold.  The structure and indexing of the BAD_PROUCTS dimension table is unchanged, except that the primary key is not just on PROD_ID but now additionally includes PROD_EFF_FROM
CREATE TABLE bad_products (
    prod_id                     NUMBER(6)       NOT NULL,
    prod_name                   VARCHAR2(50)    NOT NULL,
    prod_desc                   VARCHAR2(4000)  NOT NULL,
    prod_subcategory            VARCHAR2(50)    NOT NULL,
    prod_subcategory_id         NUMBER          NOT NULL,
    prod_subcategory_desc       VARCHAR2(2000)  NOT NULL,
    prod_category               VARCHAR2(50)    NOT NULL,
    prod_category_id            NUMBER          NOT NULL,
    prod_category_desc          VARCHAR2(2000)  NOT NULL,
    prod_weight_class           NUMBER(3)       NOT NULL,
    prod_unit_of_measure        VARCHAR2(20)    ,
    prod_pack_size              VARCHAR2(30)    NOT NULL,
    supplier_id                 NUMBER(6)       NOT NULL,
    prod_status                 VARCHAR2(20)    NOT NULL,
    prod_list_price             NUMBER(8,2)     NOT NULL,
    prod_min_price              NUMBER(8,2)     NOT NULL,
    prod_total                  VARCHAR2(13)    NOT NULL,
    prod_total_id               NUMBER          NOT NULL,
    prod_src_id                 NUMBER          ,
    prod_eff_from               DATE            ,
    prod_eff_to                 DATE            ,
    prod_valid                  VARCHAR2(1)     )
/
ALTER TABLE bad_products
ADD CONSTRAINT bad_products_pk PRIMARY KEY (prod_id, prod_eff_from)
/
Now, I will populate BAD_PRODUCTS by copying PRODUCTS.  I will create rows for each product for each year in which it was sold.  I started with 72 rows in PRODUCTS, but I have 275 rows in BAD_PRODUCTS.
INSERT INTO bad_products
WITH s AS (
SELECT s.prod_id
, TRUNC(s.time_id,'YYYY') time_id
from sales s
GROUP BY s.prod_id, TRUNC(s.time_id,'YYYY')
)
SELECT p.prod_id
,p.prod_name
,p.prod_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_subcategory
,p.prod_subcategory_id
,p.prod_subcategory_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_category
,p.prod_category_id
,p.prod_category_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_weight_class
,p.prod_unit_of_measure
,p.prod_pack_size
,p.supplier_id
,p.prod_status
,p.prod_list_price
,p.prod_min_price
,p.prod_total
,p.prod_total_id
,p.prod_src_id
,s.time_id PROD_EFF_FROM
,ADD_MONTHS(s.time_id,12)-1 PROD_EFF_TO
,p.prod_valid
from products p
,    s
where s.prod_id = p.prod_id
/
Finally, I will build the same indexes as on the PRODUCTS table.
CREATE INDEX bad_products_prod_cat_ix ON bad_products (prod_category);
CREATE BITMAP INDEX bad_products_prod_status_bix ON bad_products (prod_status);
CREATE INDEX bad_products_prod_subcat_ix ON bad_products (prod_subcategory);
I can't now create a foreign key for the product on the SALES table.
I can’t use PROD_ID only because that does not match the primary key.
ALTER TABLE sales
  ADD CONSTRAINT sales_bad_product_fk
  FOREIGN KEY (prod_id) REFERENCES bad_products (prod_id)
;
ERROR at line 3:
ORA-02270: no matching unique or primary key for this column-list
I can't build the foreign key on PROD_ID and TIME_ID, because the TIME_IDs are merely inside the effective range but are not on the PRODUCTS table
ALTER TABLE sales
  ADD CONSTRAINT sales_bad_product_fk
  FOREIGN KEY (prod_id, time_id) REFERENCES bad_products (prod_id, prod_eff_from)
;
ERROR at line 2:
ORA-02298: cannot validate (SH.SALES_BAD_PRODUCT_FK) - parent keys not found
When I come to query the data, I have no choice but to code two inequality conditions on the product table with one of them on a column that is not part of the foreign key.
…
from  sales s
,     bad_products p
,     times t
WHERE s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   t.time_id >= p.prod_eff_from 
AND  (t.time_id <= p.prod_eff_to OR p.prod_eff_to IS NULL)
So, even on Oracle 12.2, even if I create the above multi-column constraint NOVALIDATE RELY and set QUERY_REWRITE_INTEGRITY to TRUSTED, I have no chance of being able to eliminate the product table with foreign key join elimination.

The Right Way 

If the product changes and we need a new dimension row for an updated product we will have a new product, with a new product ID, but will with effective dates for that year. Here, I have created an exact copy of the SALES table, called SALES2, but now the product ID is prefixed with the calendar year.
INSERT /*+APPEND*/ INTO sales2
SELECT 
 prod_id+1000*TO_NUMBER(TO_CHAR(time_id,'YYYY')) prod_id
,cust_id
,time_id
,channel_id
,promo_id
,quantity_sold
,amount_sold
FROM sales
/
And similarly, I have created a copy of the PRODUCTS dimension table, called PRODUCTS2 with additional product rows for each year that the product was sold, but again the product ID has been prefixed with the calendar year. The primary key is still just PROD_ID and all the indexes and foreign keys are exactly the same as on the original tables.  Both BAD_PRODUCTS and PRODUCTS2 have the 275 rows.
INSERT INTO products2
WITH s AS (
SELECT s.prod_id
, TRUNC(s.time_id,'YYYY') time_id
FROM sales s
GROUP BY s.prod_id, TRUNC(s.time_id,'YYYY')
)
SELECT p.PROD_ID+(1000*TO_NUMBER(TO_CHAR(s.time_id,'YYYY')))
,p.prod_name
,p.prod_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_subcategory
,p.prod_subcategory_id
,p.prod_subcategory_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_category
,p.prod_category_id
,p.prod_category_desc||' '||TO_CHAR(s.time_id,'YYYY')
,p.prod_weight_class
,p.prod_unit_of_measure
,p.prod_pack_size
,p.supplier_id
,p.prod_status
,p.prod_list_price
,p.prod_min_price
,p.prod_total
,p.prod_total_id
,p.prod_src_id
,s.time_id PROD_EFF_FROM
,ADD_MONTHS(s.time_id,12)-1 PROD_EFF_TO
,p.prod_valid
FROM  products p
,     s
WHERE s.prod_id = p.prod_id
/
This is effectively the same query on SALES2 and PRODUCTS2, and it returns the same results. I have kept the inequality criteria on the production effective dates with both SALES and TIMES, although they are not needed here. However, we will see that the way the tables are structured and accessed will have a cost implication.
This time I want to analyse software sales (product category 205) for the USA by fiscal year.
SELECT c.country_name
, p.prod_category_id
, p.prod_category
, t.fiscal_year
, COUNT(*) num_sales
, SUM(s.amount_sold) total_amount_sold
FROM  sales2 s
,     customers u
,     products2 p
,     times t
,     countries c
WHERE s.time_id = t.time_id
AND   s.prod_id = p.prod_id
AND   u.cust_id = s.cust_id
AND   u.country_id = c.country_id
AND   c.country_iso_code = 'US'
AND   p.prod_category_id = 205
AND   t.time_id >= p.prod_eff_from 
AND  (t.time_id <= p.prod_eff_to OR p.prod_eff_to IS NULL)
AND   s.time_id >= p.prod_eff_from 
AND  (s.time_id <= p.prod_eff_to OR p.prod_eff_to IS NULL)
GROUP BY c.country_name
, p.prod_category_id
, p.prod_category
, t.fiscal_year
ORDER BY 1,2
/

With Query Transformation

First the wrong way with the BAD_PROUCTS table
Plan hash value: 2522704620

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                        | Name                         | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                 |                              |      1 |        |       |  2000 (100)|          |       |       |      5 |00:00:06.60 |     305K|       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION                       |                              |      1 |        |       |            |          |       |       |      5 |00:00:06.60 |     305K|       |       |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)        | SYS_TEMP_0FD9D7230_A4BC21    |      1 |        |       |            |          |       |       |      0 |00:00:00.08 |    1525 |  1024 |  1024 |          |
|*  3 |    HASH JOIN                                     |                              |      1 |   2921 | 81788 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.05 |    1524 |  1185K|  1185K|  721K (0)|
|*  4 |     TABLE ACCESS FULL                            | COUNTRIES                    |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|   5 |     TABLE ACCESS FULL                            | CUSTOMERS                    |      1 |  55500 |   541K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |       |       |          |
|   6 |   SORT GROUP BY                                  |                              |      1 |      5 |   495 |  1582   (1)| 00:00:01 |       |       |      5 |00:00:06.52 |     303K|  2048 |  2048 | 2048  (0)|
|*  7 |    HASH JOIN                                     |                              |      1 |  11568 |  1118K|  1580   (1)| 00:00:01 |       |       |    237K|00:00:06.29 |     303K|  2290K|  1666K| 2158K (0)|
|   8 |     TABLE ACCESS FULL                            | SYS_TEMP_0FD9D7230_A4BC21    |      1 |   2921 | 43815 |     4   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|*  9 |     HASH JOIN                                    |                              |      1 |  11568 |   948K|  1576   (1)| 00:00:01 |       |       |    237K|00:00:06.18 |     303K|  1744K|  1744K| 1704K (0)|
|  10 |      TABLE ACCESS FULL                           | TIMES                        |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |       |       |          |
|* 11 |      HASH JOIN                                   |                              |      1 |  30764 |  2163K|  1559   (1)| 00:00:01 |       |       |    237K|00:00:06.09 |     303K|  1115K|  1115K| 1352K (0)|
|* 12 |       TABLE ACCESS FULL                          | BAD_PRODUCTS                 |      1 |    102 |  4284 |     5   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |      14 |       |       |          |
|  13 |       VIEW                                       | VW_ST_F7251F01               |      1 |  48359 |  1416K|  1554   (1)| 00:00:01 |       |       |    237K|00:00:05.02 |     303K|       |       |          |
|  14 |        NESTED LOOPS                              |                              |      1 |  48359 |  2361K|  1526   (1)| 00:00:01 |       |       |    237K|00:00:04.82 |     303K|       |       |          |
|  15 |         PARTITION RANGE ALL                      |                              |      1 |  48359 |   991K|   397   (2)| 00:00:01 |     1 |    28 |    237K|00:00:03.36 |     302K|       |       |          |
|  16 |          BITMAP CONVERSION TO ROWIDS             |                              |     28 |  48359 |   991K|   397   (2)| 00:00:01 |       |       |    237K|00:00:03.15 |     302K|       |       |          |
|  17 |           BITMAP AND                             |                              |     28 |        |       |            |          |       |       |     16 |00:00:03.01 |     302K|       |       |          |
|  18 |            BITMAP MERGE                          |                              |     28 |        |       |            |          |       |       |     16 |00:00:00.04 |    1819 |  1024K|   512K|36864  (0)|
|  19 |             BITMAP KEY ITERATION                 |                              |     28 |        |       |            |          |       |       |   1572 |00:00:00.03 |    1819 |       |       |          |
|  20 |              BUFFER SORT                         |                              |     28 |        |       |            |          |       |       |   2856 |00:00:00.01 |       9 | 73728 | 73728 |          |
|* 21 |               TABLE ACCESS BY INDEX ROWID BATCHED| BAD_PRODUCTS                 |      1 |    102 |   816 |    24   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |       9 |       |       |          |
|  22 |                BITMAP CONVERSION TO ROWIDS       |                              |      1 |        |       |            |          |       |       |    275 |00:00:00.01 |       1 |       |       |          |
|  23 |                 BITMAP INDEX FULL SCAN           | BAD_PRODUCTS_PROD_STATUS_BIX |      1 |        |       |            |          |       |       |      1 |00:00:00.01 |       1 |       |       |          |
|* 24 |              BITMAP INDEX RANGE SCAN             | SALES_PROD_BIX               |   2856 |        |       |            |          |     1 |    28 |   1572 |00:00:00.02 |    1810 |       |       |          |
|  25 |            BITMAP MERGE                          |                              |     28 |        |       |            |          |       |       |     16 |00:00:02.97 |     301K|  8316K|  1039K|  317K (0)|
|  26 |             BITMAP KEY ITERATION                 |                              |     28 |        |       |            |          |       |       |  19186 |00:00:02.85 |     301K|       |       |          |
|  27 |              BUFFER SORT                         |                              |     28 |        |       |            |          |       |       |    296K|00:00:00.42 |       0 |    26M|  1871K|  865K (0)|
|  28 |               TABLE ACCESS FULL                  | SYS_TEMP_0FD9D7230_A4BC21    |      5 |   2921 | 14605 |     4   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|* 29 |              BITMAP INDEX RANGE SCAN             | SALES_CUST_BIX               |    296K|        |       |            |          |     1 |    28 |  19186 |00:00:01.97 |     301K|       |       |          |
|  30 |         TABLE ACCESS BY USER ROWID               | SALES                        |    237K|      1 |    29 |  1157   (1)| 00:00:01 | ROWID | ROWID |    237K|00:00:00.92 |     829 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Predicate Information (identified by operation id):
---------------------------------------------------
…
  11 - access("ITEM_3"="P"."PROD_ID")
       filter(("ITEM_1">="P"."PROD_EFF_FROM" AND ("ITEM_1"<="P"."PROD_EFF_TO" OR "P"."PROD_EFF_TO" IS NULL)))
…
And now the right way with PRODUCTS2. At first glance the plans are not very different, both plans do a star transformation, but the devil is in the detail.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                        | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                 |                           |      1 |        |       |  1207 (100)|          |       |       |      5 |00:00:05.08 |     304K|       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION                       |                           |      1 |        |       |            |          |       |       |      5 |00:00:05.08 |     304K|       |       |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)        | SYS_TEMP_0FD9D7231_A4BC21 |      1 |        |       |            |          |       |       |      0 |00:00:00.07 |    1525 |  1024 |  1024 |          |
|*  3 |    HASH JOIN                                     |                           |      1 |   2921 | 81788 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.05 |    1524 |  1185K|  1185K|  689K (0)|
|*  4 |     TABLE ACCESS FULL                            | COUNTRIES                 |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|   5 |     TABLE ACCESS FULL                            | CUSTOMERS                 |      1 |  55500 |   541K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |       |       |          |
|   6 |   SORT GROUP BY                                  |                           |      1 |      5 |   435 |   789   (1)| 00:00:01 |       |       |      5 |00:00:05.01 |     302K|  2048 |  2048 | 2048  (0)|
|*  7 |    HASH JOIN                                     |                           |      1 |  17849 |  1516K|   787   (1)| 00:00:01 |       |       |    237K|00:00:04.83 |     302K|  2290K|  1666K| 2123K (0)|
|   8 |     TABLE ACCESS FULL                            | SYS_TEMP_0FD9D7231_A4BC21 |      1 |   2921 | 43815 |     4   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|*  9 |     HASH JOIN                                    |                           |      1 |  17849 |  1255K|   782   (1)| 00:00:01 |       |       |    237K|00:00:04.74 |     302K|  1744K|  1744K| 1704K (0)|
|  10 |      TABLE ACCESS FULL                           | TIMES                     |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |       |       |          |
|* 11 |      HASH JOIN                                   |                           |      1 |  17849 |  1045K|   766   (1)| 00:00:01 |       |       |    237K|00:00:04.68 |     302K|  1298K|  1298K| 1658K (0)|
|* 12 |       TABLE ACCESS FULL                          | PRODUCTS2                 |      1 |    102 |  2856 |     5   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |      14 |       |       |          |
|  13 |       VIEW                                       | VW_ST_F7251F01            |      1 |  17937 |   560K|   761   (1)| 00:00:01 |       |       |    237K|00:00:04.18 |     302K|       |       |          |
|  14 |        NESTED LOOPS                              |                           |      1 |  17937 |   963K|   733   (1)| 00:00:01 |       |       |    237K|00:00:03.99 |     302K|       |       |          |
|  15 |         PARTITION RANGE ALL                      |                           |      1 |  17936 |   402K|   395   (1)| 00:00:01 |     1 |    28 |    237K|00:00:02.76 |     302K|       |       |          |
|  16 |          BITMAP CONVERSION TO ROWIDS             |                           |     28 |  17936 |   402K|   395   (1)| 00:00:01 |       |       |    237K|00:00:02.60 |     302K|       |       |          |
|  17 |           BITMAP AND                             |                           |     28 |        |       |            |          |       |       |     16 |00:00:02.49 |     302K|       |       |          |
|  18 |            BITMAP MERGE                          |                           |     28 |        |       |            |          |       |       |     16 |00:00:00.03 |    1700 |  1024K|   512K| 9216  (0)|
|  19 |             BITMAP KEY ITERATION                 |                           |     28 |        |       |            |          |       |       |    399 |00:00:00.02 |    1700 |       |       |          |
|  20 |              BUFFER SORT                         |                           |     28 |        |       |            |          |       |       |   2856 |00:00:00.01 |       9 | 73728 | 73728 |          |
|* 21 |               TABLE ACCESS BY INDEX ROWID BATCHED| PRODUCTS2                 |      1 |    102 |  1020 |    24   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |       9 |       |       |          |
|  22 |                BITMAP CONVERSION TO ROWIDS       |                           |      1 |        |       |            |          |       |       |    275 |00:00:00.01 |       1 |       |       |          |
|  23 |                 BITMAP INDEX FULL SCAN           | PRODUCTS2_PROD_STATUS_BIX |      1 |        |       |            |          |       |       |      1 |00:00:00.01 |       1 |       |       |          |
|* 24 |              BITMAP INDEX RANGE SCAN             | SALES2_PROD_BIX           |   2856 |        |       |            |          |     1 |    28 |    399 |00:00:00.01 |    1691 |       |       |          |
|  25 |            BITMAP MERGE                          |                           |     28 |        |       |            |          |       |       |     16 |00:00:02.46 |     300K|  8400K|  1050K|  312K (0)|
|  26 |             BITMAP KEY ITERATION                 |                           |     28 |        |       |            |          |       |       |  19192 |00:00:02.36 |     300K|       |       |          |
|  27 |              BUFFER SORT                         |                           |     28 |        |       |            |          |       |       |    296K|00:00:00.33 |       0 |    26M|  1871K|  865K (0)|
|  28 |               TABLE ACCESS FULL                  | SYS_TEMP_0FD9D7231_A4BC21 |      5 |   2921 | 14605 |     4   (0)| 00:00:01 |       |       |  18520 |00:00:00.01 |       0 |       |       |          |
|* 29 |              BITMAP INDEX RANGE SCAN             | SALES2_CUST_BIX           |    296K|        |       |            |          |     1 |    28 |  19192 |00:00:01.63 |     300K|       |       |          |
|  30 |         TABLE ACCESS BY USER ROWID               | SALES2                    |    237K|      1 |    32 |   366   (1)| 00:00:01 | ROWID | ROWID |    237K|00:00:00.76 |     830 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Predicate Information (identified by operation id):
---------------------------------------------------
…
   9 - access("ITEM_3"="T"."TIME_ID")
  11 - access("ITEM_2"="P"."PROD_ID")
…
The key differences are:
  • At line 24 of both plans. 1572 rows are returned from the SALES_PROD_BIX index because it is returning every effective dated software product, but only 399 rows are returned from SALES2_PROD_BID because the PROD_ID is different for every fiscal year.
  • At line 11, hashing the BAD_PRODUCTS table took 1.07s because it was necessary to evaluate the inequality conditions, whereas it only took 0.50s to hash PRODUCTS2 with a simple equality join. 
  • The optimizer cost with PRODUCTS2 is much lower, and the overall response time is 5.08s instead of 6.60s.

Without Query Transformation 

If I repeat the test without star transformation enabled, I again see differences in the hash joining of the product dimensions. First on BAD_PRODUCTS
Plan hash value: 2324880021

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name         | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |              |      1 |        |       |  1592 (100)|          |       |       |      5 |00:00:01.94 |    5805 |   4128 |       |       |          |
|   1 |  SORT GROUP BY                  |              |      1 |     11 |  1144 |  1592   (3)| 00:00:01 |       |       |      5 |00:00:01.94 |    5805 |   4128 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                     |              |      1 |   9845 |   999K|  1590   (3)| 00:00:01 |       |       |    237K|00:00:01.74 |    5805 |   4128 |  1744K|  1744K| 1620K (0)|
|   3 |    PART JOIN FILTER CREATE      | :BF0000      |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |      0 |       |       |          |
|   4 |     TABLE ACCESS FULL           | TIMES        |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |      0 |       |       |          |
|*  5 |    HASH JOIN                    |              |      1 |  26182 |  2352K|  1574   (3)| 00:00:01 |       |       |    237K|00:00:01.65 |    5700 |   4128 |  1115K|  1115K| 1390K (0)|
|*  6 |     TABLE ACCESS FULL           | BAD_PRODUCTS |      1 |    102 |  4284 |     5   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |      14 |      0 |       |       |          |
|*  7 |     HASH JOIN                   |              |      1 |  48360 |  2361K|  1568   (3)| 00:00:01 |       |       |    526K|00:00:01.07 |    5685 |   4128 |  2290K|  1666K| 2216K (0)|
|*  8 |      HASH JOIN                  |              |      1 |   2921 | 81788 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.04 |    1524 |      0 |  1236K|  1236K|  771K (0)|
|*  9 |       TABLE ACCESS FULL         | COUNTRIES    |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|  10 |       TABLE ACCESS FULL         | CUSTOMERS    |      1 |  55500 |   541K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.02 |    1521 |      0 |       |       |          |
|  11 |      PARTITION RANGE JOIN-FILTER|              |      1 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    918K|00:00:00.77 |    4160 |   4128 |       |       |          |
|  12 |       TABLE ACCESS FULL         | SALES        |     20 |    918K|    19M|  1142   (3)| 00:00:01 |:BF0000|:BF0000|    918K|00:00:00.76 |    4160 |   4128 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("S"."TIME_ID"="T"."TIME_ID")
       filter((("T"."TIME_ID"<="P"."PROD_EFF_TO" OR "P"."PROD_EFF_TO" IS NULL) AND "T"."TIME_ID">="P"."PROD_EFF_FROM"))
   5 - access("S"."PROD_ID"="P"."PROD_ID")
       filter(("S"."TIME_ID">="P"."PROD_EFF_FROM" AND ("S"."TIME_ID"<="P"."PROD_EFF_TO" OR "P"."PROD_EFF_TO" IS NULL)))
And then on PRODUCTS2
Plan hash value: 1297664275

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |           |      1 |        |       |   875 (100)|          |       |       |      5 |00:00:00.92 |    3126 |       |       |          |
|   1 |  SORT GROUP BY                  |           |      1 |     11 |  1012 |   875   (4)| 00:00:01 |       |       |      5 |00:00:00.92 |    3126 |  2048 |  2048 | 2048  (0)|
|*  2 |   HASH JOIN                     |           |      1 |  17849 |  1603K|   873   (4)| 00:00:01 |       |       |    237K|00:00:00.75 |    3126 |  1744K|  1744K| 1621K (0)|
|   3 |    PART JOIN FILTER CREATE      | :BF0000   |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |       |       |          |
|   4 |     TABLE ACCESS FULL           | TIMES     |      1 |   1826 | 21912 |    16   (0)| 00:00:01 |       |       |   1826 |00:00:00.01 |      55 |       |       |          |
|*  5 |    HASH JOIN                    |           |      1 |  17849 |  1394K|   857   (4)| 00:00:01 |       |       |    237K|00:00:00.70 |    3070 |  1298K|  1298K| 1703K (0)|
|*  6 |     TABLE ACCESS FULL           | PRODUCTS2 |      1 |    102 |  2856 |     5   (0)| 00:00:01 |       |       |    102 |00:00:00.01 |      14 |       |       |          |
|*  7 |     HASH JOIN                   |           |      1 |  48360 |  2455K|   852   (4)| 00:00:01 |       |       |    526K|00:00:00.59 |    3055 |  2290K|  1666K| 2216K (0)|
|*  8 |      HASH JOIN                  |           |      1 |   2921 | 81788 |   418   (1)| 00:00:01 |       |       |  18520 |00:00:00.04 |    1524 |  1236K|  1236K|  774K (0)|
|*  9 |       TABLE ACCESS FULL         | COUNTRIES |      1 |      1 |    18 |     2   (0)| 00:00:01 |       |       |      1 |00:00:00.01 |       2 |       |       |          |
|  10 |       TABLE ACCESS FULL         | CUSTOMERS |      1 |  55500 |   541K|   416   (1)| 00:00:01 |       |       |  55500 |00:00:00.03 |    1521 |       |       |          |
|  11 |      PARTITION RANGE JOIN-FILTER|           |      1 |    918K|    21M|   426   (6)| 00:00:01 |:BF0000|:BF0000|    918K|00:00:00.35 |    1530 |       |       |          |
|  12 |       TABLE ACCESS FULL         | SALES2    |     20 |    918K|    21M|   426   (6)| 00:00:01 |:BF0000|:BF0000|    918K|00:00:00.35 |    1530 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • One difference is where we hash join the product dimension. BAD_PRODUCTS takes 0.58s, PRODUCTS2 takes just 0.11s.
  • The Bloom filter on SALES2 takes just 0.35s, while the filter on SALES takes 0.76s and requires fewer buffers because the PROD_ID equality join is more selective.

Summary 

  • Effective dating leads to inequality conditions between fact and dimension tables.
    • Can't express inequalities in foreign key – it has to be an equality join to primary key/unique key.
    • It is impossible to achieve foreign key join elimination on these dimensions.
    • You have to do more work evaluating the inequality conditions when you eventually hash join these dimensions into the result set. 
    • Bloom Filters only work with equality predicates. They rough match data, but there can be false positive results. The subsequent hash join does an exact match on a reduced volume of data. The Bloom cannot reduce the data on the inequality conditions, so they return more data to the hash operation that then takes longer to execute.
  • Conclusion
    • You should rigorously follow the star schema principles so you can always define a foreign key between facts and dimensions.

Tuesday, November 13, 2018

Data Warehouse Design Mistakes 1: Lack of Foreign Key Constraints

This post is part of a series that discusses some common issues in data warehouses.

What is the Purpose of Foreign Keys?

  • In the context of relational databases, a foreign key is a column, or group of columns, on one table that uniquely identifies a row in another table.  So, a foreign key on a child record uniquely identifies a row on the parent table.
  • If a foreign key constraint is enforced (which they are when the foreign key constraint is enabled, which they are by default) then you cannot insert a child for which you cannot find a parent, and you cannot delete a parent for which children exist.  This guarantees that data is, and remains, referentially integral.
A foreign key constraint can also permit the Oracle optimizer to perform certain optimisations in executing a SQL query.  
  • If you are querying a child and parent table (or if you prefer a fact and dimension table) without referencing any attribute on the parent (or dimension), then the optimizer can rely on the foreign key constraint to know that the parent row will always be present and so omit the parent table from the query.  This is called 'foreign key join elimination', and I will discuss this in more detail below.
I think that documenting relationships between tables through foreign keys helps developers build sensible SQL code that follows the data model, performs better, and is easier to understand and therefore easier to maintain.
If you have foreign key constraints defined in an OLTP system, it is typical to see them enabled exactly because they enforce referential integrity.
In a data warehouse there is less emphasis on revalidating data, but saving effort though join elimination is important.  Unfortunately, it is all too common to see downstream data warehouses without foreign key constraints at all.  It seems to be very easy to find an excuse not to build them. While researching this blog I came across this blog posting that I thought had the ring of truth of about it.

"9 Reasons Why There Are No Foreign Keys in Your Database" (Piotr Kononow)

  1. Performance: degrades DML performance as foreign keys are validated
  2. Legacy data is not referentially integral in the first place.
  3. Full Table Reload.  Should disable, reload, and then re-enable and revalidate constraints.
  4. High-Level Framework doesn't create foreign keys.
  5. Cross-Database relations
  6. Database platform agnosticism (eg. PeopleSoft)
  7. Open for Change
  8. Lazy Architect
  9. Table relationships are not clear/revealed.
I'll add one more unacceptable excuse this list.
  • Extracting referentially integral data from an OLTP system into the data warehouse, so there is no need for more foreign keys to revalidate it there again.

Foreign Key Join Elimination

If you are querying a child and parent table, without referencing any attribute on the parent, then the optimizer can rely on the foreign key constraint to know that the parent row will always be present and so omit the parent table from the query.  
Let's start with a very simple demonstration, on the Sales History demo schema, of a query of the fact table (SALES), and three dimension tables (PRODUCTS, TIMES, CUSTOMERS).  I am referencing attribute columns on the PRODUCT and TIMES tables, but nothing on the CUSTOMERS table other than the primary key column CUST_ID in a join predicate.

Without Foreign Key Constraints

I have disabled the foreign key constraints between these tables to model the situation without foreign key constraints.
ALTER TABLE sales MODIFY CONSTRAINT sales_channel_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_product_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_promo_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_time_fk DISABLE NOVALIDATE;
This is a query for certain products (Electronics) in a single year (1999).
SELECT  p.prod_category
,  t.fiscal_year
,  COUNT(*)
FROM  sales s
, products p
,  times t
, customers c
WHERE  s.time_id = t.time_id
AND  s.prod_id = p.prod_id
AND   t.fiscal_year = 2001
AND c.cust_id = s.cust_id
AND p.prod_category = 'Electronics'
AND p.prod_category = 'Software/Other'
GROUP BY p.prod_category
,  t.fiscal_year
ORDER BY 1
/
Throughout these tests, having set STATISTICS_LEVEL=ALL in order to collect additional runtime statistics, I have will extract execution plans, as follows
select * from table(dbms_xplan.display_cursor(null,null,'ADVANCED +ADAPTIVE PROJECTION +ALLSTATS LAST, IOSTATS'));
You can see that all 4 tables (or indexes on them) referenced in the SQL appear in the execution plan (in bold).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                      |      1 |        |       |  1167 (100)|          |       |       |      1 |00:00:00.21 |    3386 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT             |                      |      1 |      1 |    44 |  1167   (3)| 00:00:01 |       |       |      1 |00:00:00.21 |    3386 |    472 |       |       |          |
|   2 |   NESTED LOOPS                    |                      |      1 |   2840 |   122K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.21 |    3386 |    472 |       |       |          |
|   3 |    VIEW                           | VW_GBF_35            |      1 |   2840 |   108K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.19 |     544 |    472 |       |       |          |
|   4 |     HASH GROUP BY                 |                      |      1 |   2840 |   138K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.19 |     544 |    472 |  1137K|  1137K| 1403K (0)|
|*  5 |      HASH JOIN                    |                      |      1 |  41362 |  2019K|  1163   (3)| 00:00:01 |       |       |  23678 |00:00:00.17 |     544 |    472 |  1695K|  1695K| 1571K (0)|
|   6 |       PART JOIN FILTER CREATE     | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  7 |        TABLE ACCESS FULL          | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  8 |       HASH JOIN                   |                      |      1 |    165K|  6156K|  1146   (3)| 00:00:01 |       |       |  26637 |00:00:00.17 |     488 |    472 |  1572K|  1572K| 1390K (0)|
|*  9 |        VIEW                       | index$_join$_002     |      1 |     13 |   273 |     2   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       5 |      0 |       |       |          |
|* 10 |         HASH JOIN                 |                      |      1 |        |       |            |          |       |       |     13 |00:00:00.01 |       5 |      0 |  1355K|  1355K| 1376K (0)|
|* 11 |          INDEX RANGE SCAN         | PRODUCTS_PROD_CAT_IX |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       1 |      0 |       |       |          |
|  12 |          INDEX FAST FULL SCAN     | PRODUCTS_PK          |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|  13 |        PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    14M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.13 |     482 |    472 |       |       |          |
|  14 |         TABLE ACCESS FULL         | SALES                |      5 |    918K|    14M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.13 |     482 |    472 |       |       |          |
|* 15 |    INDEX UNIQUE SCAN              | CUSTOMERS_PK         |   2840 |      1 |     5 |     0   (0)|          |       |       |   2840 |00:00:00.01 |    2842 |      0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With Foreign Key Constraints

I have re-enabled the foreign key constraints
ALTER TABLE sales MODIFY CONSTRAINT sales_channel_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_product_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_promo_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_time_fk enable validate;
Now the CUSTOMERS table is no longer referenced due to foreign key join elimination.  The nested loop operation (at line 2 of the previous plan) to join the CUSTOMERS table is no longer required because the query does not need anything from that dimension table.  The foreign key constraint tells the optimizer that there will always be a corresponding row for every SALES record.  Thus not joining to it does not change the result of the query.  There is one less operation requiring PGA memory, and the number of buffers required, and the optimizer cost has also dropped.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                      |      1 |        |       |  1163 (100)|          |       |       |      1 |00:00:00.87 |     544 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT          |                      |      1 |      1 |    45 |  1163   (3)| 00:00:01 |       |       |      1 |00:00:00.87 |     544 |    472 |       |       |          |
|*  2 |   HASH JOIN                    |                      |      1 |  41362 |  1817K|  1163   (3)| 00:00:01 |       |       |  23678 |00:00:00.85 |     544 |    472 |  1695K|  1695K| 1669K (0)|
|   3 |    PART JOIN FILTER CREATE     | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  4 |     TABLE ACCESS FULL          | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  5 |    HASH JOIN                   |                      |      1 |    165K|  5346K|  1146   (3)| 00:00:01 |       |       |  26637 |00:00:00.79 |     488 |    472 |  1572K|  1572K| 1331K (0)|
|*  6 |     VIEW                       | index$_join$_002     |      1 |     13 |   273 |     2   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       5 |      0 |       |       |          |
|*  7 |      HASH JOIN                 |                      |      1 |        |       |            |          |       |       |     13 |00:00:00.01 |       5 |      0 |  1355K|  1355K| 1377K (0)|
|*  8 |       INDEX RANGE SCAN         | PRODUCTS_PROD_CAT_IX |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       1 |      0 |       |       |          |
|   9 |       INDEX FAST FULL SCAN     | PRODUCTS_PK          |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|  10 |     PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.43 |     482 |    472 |       |       |          |
|  11 |      TABLE ACCESS FULL         | SALES                |      5 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.25 |     482 |    472 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Note
-----
   - rely constraint used for this statement
From Oracle 12.2, when a constraint has been used to remove a table from an execution plan, you get a note in the execution plan "rely constraint used for this statement".  Though it doesn't tell you which table has been eliminated, nor due to which constraint.
Admittedly, the actual run time of the query has gone up because there is more work for the optimizer to do.  However, this is an unusually small test, so it has swamped the savings in not visiting the CUSTOMERS table.  Foreign key join elimination should generally deliver a performance improvement because the query processes less data.

With Disabled Reliable Constraints

If you don't want the overhead and complexity of enforced foreign key constraints, you can still get the benefit of foreign key join elimination.  It is possible to disable the constraint from SALES to CUSTOMER so that it is not enforced, but it can still tell the optimizer that it can RELY upon the referential integrity of the data as if the constraint were enforced.
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk RELY DISABLE NOVALIDATE;
This behaviour has been available since constraints were introduced in Oracle 8 (c. 1998), but there has been a change in Oracle 12c.  Foreign key join elimination on disabled RELY constraints does not occur if QUERY_REWRITE_INTEGRITY is set to its default value of ENFORCED.  It must be set to either TRUSTED or STALE_TOLERATED.  However, bear in mind that setting it to STALE_TOLERATED also affects how materialized views can be used by query rewrite.
ALTER SESSION SET query_rewrite_integrity = TRUSTED;
And now we are back to the same plan as before without the CUSTOMERS table, and with the "rely constraint used for this statement" note.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                      |      1 |        |       |  1164 (100)|          |       |       |      1 |00:00:01.55 |     544 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT          |                      |      1 |      1 |    45 |  1164   (3)| 00:00:01 |       |       |      1 |00:00:01.55 |     544 |    472 |       |       |          |
|*  2 |   HASH JOIN                    |                      |      1 |  81133 |  3565K|  1164   (3)| 00:00:01 |       |       |    110K|00:00:01.46 |     544 |    472 |  1476K|  1476K| 1528K (0)|
|*  3 |    VIEW                        | index$_join$_002     |      1 |     26 |   546 |     2   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       5 |      0 |       |       |          |
|*  4 |     HASH JOIN                  |                      |      1 |        |       |            |          |       |       |     26 |00:00:00.01 |       5 |      0 |  1298K|  1298K| 1612K (0)|
|*  5 |      INDEX RANGE SCAN          | PRODUCTS_PROD_CAT_IX |      1 |     26 |   546 |     1   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       1 |      0 |       |       |          |
|   6 |      INDEX FAST FULL SCAN      | PRODUCTS_PK          |      1 |     26 |   546 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|*  7 |    HASH JOIN                   |                      |      1 |    229K|  5369K|  1160   (3)| 00:00:01 |       |       |    246K|00:00:01.00 |     538 |    472 |  1695K|  1695K| 1683K (0)|
|   8 |     PART JOIN FILTER CREATE    | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  9 |      TABLE ACCESS FULL         | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|  10 |     PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.47 |     482 |    472 |       |       |          |
|  11 |      TABLE ACCESS FULL         | SALES                |      5 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.28 |     482 |    472 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
…
Note
-----
   - rely constraint used for this statement

Multi-Column Foreign Key Join Elimination 

It has always been possible to create foreign keys on multiple columns, sometimes called composite foreign keys.  From Oracle 12.2, it is possible to get join elimination on composite foreign keys. Jonathan Lewis has published a blog with demonstration scripts.
ALTER TABLE child 
add constraint child_fk_parent foreign key (id_g, id_p) 
references parent (id_g, id);
However, he has also found a case where join elimination depends on the order of tables in the from clause – (currently unpublished) bug 22228669.  The workaround is to list parents in the from clause before their children.
The need for single column keys inevitably leads to meaningless keys, often generated from a sequence.  Another of Jonathan's blogs and its comments discusses the pros and cons. 
Updated 19th November: It depends how important foreign key join elimination is to you.  How much use a system makes of this feature will depend upon how it is written.  If you think that you may need to use this feature then, at least until this bug is resolved, I would stick with single column keys in Oracle 12.2.

Summary of Good Practice

  • Primary keys (with unique indexes) on all tables.
  • Define foreign key constraints on all dimension columns of fact tables referencing the primary or unique keys on the dimension tables through equality joins only. 
    • I think that even from Oracle 12.2 I would still avoid creating multi-column primary and foreign keys because there appears to be a bug with this feature.
  • If foreign key constraints are to be enforced, then the key columns should also be indexed avoid TM locking during DML operations.
  • Otherwise, in order to achieve join elimination mark the constraints as reliable:
  • ALTER TABLE … MODIFY CONSTRAINT … RELY NOVALIDATE DISABLE
    • From Oracle 12 you must also set QUERY_REWRITE_INTEGRITY to TRUSTED or STALE_TOLERATED.

Sunday, November 11, 2018

How Not to Build A(n Autonomous) Data Warehouse

My day job involves investigating and resolving performance problems, so I get to see a lot of bad stuff.  Often, these problems have their roots in poor design.  It is not surprising. but is nonetheless disappointing, that when I point this out I am told that the system is either delivered this way by the vendor, or it has already been built and it is too late to change.
In the last couple of years, I have worked on several data warehouse applications that have provided the inspiration for a new presentation that I am giving at the DOAG and UKOUG conferences this year.
The presentation and this series of related blogs have several objectives:
  • Partly, it is an attempt to get some advice out in the hope that some of these poor design decisions are not made in the first place. 
  • I get to have a bit of a rant about how to design a data warehouse properly, and I will feel better at the end it.
  • I get to attend two really good conferences where I will learn (or be reminded of) much.
This particular blog post is an index to various topics within the presentation, that I have turned into separate blog postings.

Oracle Sales History (SH) Sample Schema

I cannot illustrate issues with actual examples from client systems, no matter how well anonymised they are. So, I have recreated some of the most egregious mistakes using the Oracle Sales History example schema, SH.  Everything I am going to show you does have an origin in the real world!  The SH sample schema is a simple and very typical star schema.  It is well designed and implemented, and represents good practice. I encourage you to install and play with it for yourself.
My examples use a single fact table (SALES), that has five dimensions around it (CUSTOMERS, CHANNELS, PRODUCTS, PROMOTIONS and TIMES), although I will only use three. COUNTRIES is a dimension on the CUSTOMERS dimension, sometimes called a 'snowflake'.
In some of my examples, I have deliberately broken the data model in exactly the same way that I have seen it broken in real life. So, if you find yourself saying "nobody would actually do that!", let me assure you that they did!

Common Mistakes in Data Warehouse Design and Build

As each blog post is published, I will add a link to them in this section.

Tuesday, August 21, 2018

Parallel Execution of PL/SQL

A recent experience with text searching showed up some limitations on parallel execution of PL/SQL functions, and how to work around them.

Text Searching can be CPU Intensive?

It started with a requirement to find names that roughly matched a search string. The roughness of the match was determined by calculating the Levenshtein distance between each customer name and the search string. Informally, the Levenshtein distance is a mathematical function that calculates the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. Oracle has implemented it as a function in a delivered package UTL_MATCH.EDIT_DISTANCE().  This function isn't directly relevant to the rest of the discussion about parallelism, except that it is an example of CPU intensive processing that doesn't alter the database where parallel execution may be a legitimate tactic for improving performance.
The examples in this article use the SH schema in the Oracle sample schemas (available on Github).

The following query searches case-insensitively for customer names within a Levenshtein distance of 3 from 'Stephen'. It finds 'Steven' and 'Staphany'.
set autotrace on timi on pages 99 lines 200
with x as (
select c.cust_first_name
,      utl_match.edit_distance(upper(cust_first_name),'STEPHEN') edit_distance
from customers c
)
select * from x
where edit_distance <= 3
/

CUST_FIRST_NAME      EDIT_DISTANCE
-------------------- -------------
Staphany                         3
Steven                           2
Steven                           2
However, to do so Oracle had to full scan the table and execute the function for every row in the table.
Plan hash value: 2008213504
-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  2775 | 19425 |   429   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| CUSTOMERS |  2775 | 19425 |   429   (4)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("UTL_MATCH"."EDIT_DISTANCE"(UPPER("CUST_FIRST_NAME"),'STEPHEN')<=3)
Oracle implemented Levenshtein as a C function that is then called from PL/SQL, so it just consumes CPU and doesn't do anything else to the database. You can start to see why the user complained that the query with UTL_MATCH was slow.
However, the first question is how much time do we spend on the full scan and how much time do we spend executing this function?

Follow the Time with Instrumentation 

For test purposes, I am going to build my own packaged functions with session instrumentation. Then I can use Active Session History (ASH) to work out where the time went.
NB: ASH requires a licence for the Diagnostics Pack
  • One function levenshtein() calls to UTL_MATCH.EDIT_DISTANCE().
  • The other dolittle() is a control function that does nothing except instrumentation. It is used to measure the intrusion effect of the instrumentation.
CREATE OR REPLACE package dmk AS
function levenshtein(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER;
function dolittle(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER;
END;
/

CREATE OR REPLACE package body dmk AS
function levenshtein(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER IS
  l_distance INTEGER;
  l_module VARCHAR2(64);
  l_action VARCHAR2(64);
BEGIN
  dbms_application_info.read_module(l_module,l_action); /*save current module/action*/
  dbms_application_info.set_action('levenshtein()'); /*set action for function*/
  l_distance := utl_match.edit_distance(UPPER(p1),UPPER(p2));
  dbms_application_info.set_action(l_action); /*restore previous action*/
  RETURN l_distance;
END levenshtein;

function dolittle(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER IS
  l_distance INTEGER;
  l_module VARCHAR2(64);
  l_action VARCHAR2(64);
BEGIN
  dbms_application_info.read_module(l_module,l_action);
  dbms_application_info.set_action('dolittle()');
  l_distance := 1;
  dbms_application_info.set_action(l_action);
  RETURN l_distance;
END dolittle;
END dmk;
/
Now, I will run a test query that executes UTL_MATCH for each of the 55500 rows in the CUSTOMERS table and executes that all 10 times inside a PL/SQL block.
set timi on autotrace on lines 500 serveroutput on
DECLARE
  l_counter INTEGER := 0;
BEGIN
  dbms_application_info.set_module('DMK LEVENSHTEIN TEST',null);
  FOR j IN 1..10 LOOP
    FOR i IN (
      select sum(dmk.levenshtein(cust_first_name,'STEPHEN')) a
      ,      sum(dmk.dolittle(cust_first_name,'STEPHEN')) b
      from   customers c
    ) LOOP
      l_counter := l_counter+i.b;
    END LOOP;
  END LOOP;
  dbms_application_info.set_module(null,null);
  dbms_output.put_line('Executions: '||l_counter);
END;
/
Executions: 555000

PL/SQL procedure successfully completed.

Elapsed: 00:00:25.08
We can see that the Levenshtein function was executed 555000 times.  Now I will query the ASH for this test and group it by ACTION.
set timi on autotrace off lines 500
column module format a25
column action format a25
select module, action, sql_id, sum(1) ash_Secs
from v$active_Session_History
where module = 'DMK LEVENSHTEIN TEST'
and sample_time >= SYSDATE-1/1440
group by module, action, sql_id
/

MODULE                    ACTION                    SQL_ID          ASH_SECS
------------------------- ------------------------- ------------- ----------
DMK LEVENSHTEIN TEST      dolittle()                7gp0w6qdvxrd2          2
DMK LEVENSHTEIN TEST      levenshtein()             7gp0w6qdvxrd2          3
DMK LEVENSHTEIN TEST                                7gp0w6qdvxrd2         20
The runtime of the Levenshtein function took 3 seconds, and the function that does nothing except instrumentation is 2 seconds, so the overhead of UTL_MATCH is only about 1 second, and there are 20 seconds in the SQL. In this test, the overhead of Levenshtein is low, but it would still be worth doing the full scan in parallel.

I Can't Get No Parallelism 

But there is a problem! I specified parallelism with a hint, but I don't get a parallel plan.
select /*+PARALLEL(C 4)*/ avg(utl_match.edit_distance(UPPER(cust_first_name),'STEPHEN'))
from customers c
/

Plan hash value: 1978308596
---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |     1 |     7 |   115   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |           |     1 |     7 |            |          |
|   2 |   SORT AGGREGATE    |           |     1 |     7 |            |          |
|   3 |    TABLE ACCESS FULL| CUSTOMERS | 55500 |   379K|   115   (0)| 00:00:01 |
---------------------------------------------------------------------------------
However, if I use a simple SQL function, then I do get parallelism.
select /*+PARALLEL(C 4)*/ max(cust_first_name)
from customers c
/

Plan hash value: 1221513835
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |           |     1 |     7 |   115   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |           |     1 |     7 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |           |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000  |     1 |     7 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |           |     1 |     7 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |           | 55500 |   379K|   115   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| CUSTOMERS | 55500 |   379K|   115   (0)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 4 because of table property
So, UTL_MATCH.EDIT_FUNCTION must be preventing parallelism in some way. There is a closed Oracle bug 169587070 that details exactly this problem with UTL_MATCH.

PARALLEL_ENABLE and PRAGMA RESTRICT_REFERENCES 

This is also documented default behaviour in PL/SQL. Database VLDB and Partitioning Guide: About Parallel Execution of Functions, Functions in Parallel Queries:
"A user-written function may be executed in parallel in any of the following cases: 
  • If it has been declared with the PARALLEL_ENABLE keyword
  • If it is declared in a package or type and has a PRAGMA RESTRICT_REFERENCES clause that indicates all of WNDS, RNPS, and WNPS 
  • If it is declared with CREATE FUNCTION and the system can analyze the body of the PL/SQL code and determine that the code neither writes to the database nor reads or modifies package variables" 
Oracle only wraps the bodies of their delivered package, but not the package specifications. So, you can see for yourself that the delivered UTL_MATCH package does not contain these specifications.

Workarounds 

There are various workarounds for this.
  • I could add the PARALLEL_ENABLE declarations to the functions in the package specification of the delivered UTL_MATCH package. Although it does work, I would certainly not be happy to alter any delivered Oracle package in any serious database, without approval from Oracle support.
  • Or, I could add the RESTRICT_REFERENCES pragma to the package specification. Again, although this works, it involves altering a delivered package.
  • However, I can wrap the delivered package in my own packaged function with either PARALLEL_ENABLE (my personal preferrance).
CREATE OR REPLACE package dmk AS
FUNCTION levenshtein(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER PARALLEL_ENABLE;
FUNCTION dolittle(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER PARALLEL_ENABLE;
END;
/
  • Or you can use RESTRICT_REFERENCES in the package specification, but you must include the TRUST pragma to over-ride the lack of a pragma definition in the called packaged function.
CREATE OR REPLACE package dmk AS
FUNCTION levenshtein(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER;
PRAGMA RESTRICT_REFERENCES (levenshtein,WNDS,RNDS,WNPS,RNPS,TRUST);
FUNCTION dolittle(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER;
PRAGMA RESTRICT_REFERENCES (do_little,WNDS,RNDS,WNPS,RNPS,TRUST);
END;
/
Now, I get parallel execution of the unaltered delivered UTL_MATCH.EDIT_DISTANCE() function.
select /*+PARALLEL(C 4) FULL(C)*/ sum(dmk.levenshtein(cust_first_name,'STEPHEN')) a
,      sum(dmk.dolittle(cust_first_name,'STEPHEN')) b
from   customers c
/

Plan hash value: 1221513835
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |           |     1 |     7 |   115   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |           |     1 |     7 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |           |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000  |     1 |     7 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |           |     1 |     7 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |           | 55500 |   379K|   115   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| CUSTOMERS | 55500 |   379K|   115   (0)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 4 because of table property

Conclusion: Parallel PL/SQL 

I can now repeat the earlier test, but with a parallel hint, the runtime goes down from 25 to 9 seconds, although about the same amount of database time is recorded by ASH. So parallelism can improve the performance for the end user, but will not reduce the total CPU overhead. If anything is likely to increase overall CPU time.
set timi on lines 500
DECLARE
  l_counter INTEGER := 0;
BEGIN
  dbms_application_info.set_module('DMK LEVENSHTEIN TEST',null);
  FOR j IN 1..10 LOOP
    FOR i IN (
      select /*+PARALLEL(C 4)*/
             sum(dmk.levenshtein(cust_first_name,'STEPHEN')) a
      ,      sum(dmk.dolittle(cust_first_name,'STEPHEN')) b
      from   customers c
    ) LOOP
      l_counter := l_counter+i.b;
    END LOOP;
  END LOOP;
  dbms_application_info.set_module(null,null);
  dbms_output.put_line('Executions: '||l_counter);
END;
/
Executions: 555000

PL/SQL procedure successfully completed.

Elapsed: 00:00:09.34

set timi on autotrace off lines 500
column module format a32
column action format a32
select module, action, sql_id, sum(1) ash_Secs
from v$active_Session_History
where module = 'DMK LEVENSHTEIN TEST'
and sample_time >= SYSDATE-1/1440
group by module, action, sql_id
/

MODULE                           ACTION                           SQL_ID          ASH_SECS
-------------------------------- -------------------------------- ------------- ----------
DMK LEVENSHTEIN TEST                                              3391gqpzu5g2k         21
DMK LEVENSHTEIN TEST             levenshtein()                    3391gqpzu5g2k          5
DMK LEVENSHTEIN TEST             dolittle()                       3391gqpzu5g2k          4
NB: I have not been able to get parallelism to work in a PL/SQL function defined in a WITH clause because you cannot specify PARALLEL_ENABLE, and a pragma can only be specified in a package specification.
WITH
  FUNCTION levenshtein(p1 VARCHAR2, p2 VARCHAR2) RETURN INTEGER PARALLEL_ENABLE IS
  BEGIN
    RETURN utl_match.edit_distance(UPPER(p1),UPPER(p2));
  END;
select /*+PARALLEL(C 4)*/
       sum(levenshtein(cust_first_name,'STEPHEN')) a
from   customers c
/
ORA-06553: PLS-712: illegal option for subprogram LEVENSHTEIN