Tuesday, November 13, 2018

Data Warehouse Design Mistakes 1: Lack of Foreign Key Constraints

This post is part of a series that discusses some common issues in data warehouses.

What is the Purpose of Foreign Keys?

  • In the context of relational databases, a foreign key is a column, or group of columns, on one table that uniquely identifies a row in another table.  So, a foreign key on a child record uniquely identifies a row on the parent table.
  • If a foreign key constraint is enforced (which they are when the foreign key constraint is enabled, which they are by default) then you cannot insert a child for which you cannot find a parent, and you cannot delete a parent for which children exist.  This guarantees that data is, and remains, referentially integral.
A foreign key constraint can also permit the Oracle optimizer to perform certain optimisations in executing a SQL query.  
  • If you are querying a child and parent table (or if you prefer a fact and dimension table) without referencing any attribute on the parent (or dimension), then the optimizer can rely on the foreign key constraint to know that the parent row will always be present and so omit the parent table from the query.  This is called 'foreign key join elimination', and I will discuss this in more detail below.
I think that documenting relationships between tables through foreign keys helps developers build sensible SQL code that follows the data model, performs better, and is easier to understand and therefore easier to maintain.
If you have foreign key constraints defined in an OLTP system, it is typical to see them enabled exactly because they enforce referential integrity.
In a data warehouse there is less emphasis on revalidating data, but saving effort though join elimination is important.  Unfortunately, it is all too common to see downstream data warehouses without foreign key constraints at all.  It seems to be very easy to find an excuse not to build them. While researching this blog I came across this blog posting that I thought had the ring of truth of about it.

"9 Reasons Why There Are No Foreign Keys in Your Database" (Piotr Kononow)

  1. Performance: degrades DML performance as foreign keys are validated
  2. Legacy data is not referentially integral in the first place.
  3. Full Table Reload.  Should disable, reload, and then re-enable and revalidate constraints.
  4. High-Level Framework doesn't create foreign keys.
  5. Cross-Database relations
  6. Database platform agnosticism (eg. PeopleSoft)
  7. Open for Change
  8. Lazy Architect
  9. Table relationships are not clear/revealed.
I'll add one more unacceptable excuse this list.
  • Extracting referentially integral data from an OLTP system into the data warehouse, so there is no need for more foreign keys to revalidate it there again.

Foreign Key Join Elimination

If you are querying a child and parent table, without referencing any attribute on the parent, then the optimizer can rely on the foreign key constraint to know that the parent row will always be present and so omit the parent table from the query.  
Let's start with a very simple demonstration, on the Sales History demo schema, of a query of the fact table (SALES), and three dimension tables (PRODUCTS, TIMES, CUSTOMERS).  I am referencing attribute columns on the PRODUCT and TIMES tables, but nothing on the CUSTOMERS table other than the primary key column CUST_ID in a join predicate.

Without Foreign Key Constraints

I have disabled the foreign key constraints between these tables to model the situation without foreign key constraints.
ALTER TABLE sales MODIFY CONSTRAINT sales_channel_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_product_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_promo_fk DISABLE NOVALIDATE;
ALTER TABLE sales MODIFY CONSTRAINT sales_time_fk DISABLE NOVALIDATE;
This is a query for certain products (Electronics) in a single year (1999).
SELECT  p.prod_category
,  t.fiscal_year
,  COUNT(*)
FROM  sales s
, products p
,  times t
, customers c
WHERE  s.time_id = t.time_id
AND  s.prod_id = p.prod_id
AND   t.fiscal_year = 2001
AND c.cust_id = s.cust_id
AND p.prod_category = 'Electronics'
AND p.prod_category = 'Software/Other'
GROUP BY p.prod_category
,  t.fiscal_year
ORDER BY 1
/
Throughout these tests, having set STATISTICS_LEVEL=ALL in order to collect additional runtime statistics, I have will extract execution plans, as follows
select * from table(dbms_xplan.display_cursor(null,null,'ADVANCED +ADAPTIVE PROJECTION +ALLSTATS LAST, IOSTATS'));
You can see that all 4 tables (or indexes on them) referenced in the SQL appear in the execution plan (in bold).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                      |      1 |        |       |  1167 (100)|          |       |       |      1 |00:00:00.21 |    3386 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT             |                      |      1 |      1 |    44 |  1167   (3)| 00:00:01 |       |       |      1 |00:00:00.21 |    3386 |    472 |       |       |          |
|   2 |   NESTED LOOPS                    |                      |      1 |   2840 |   122K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.21 |    3386 |    472 |       |       |          |
|   3 |    VIEW                           | VW_GBF_35            |      1 |   2840 |   108K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.19 |     544 |    472 |       |       |          |
|   4 |     HASH GROUP BY                 |                      |      1 |   2840 |   138K|  1167   (3)| 00:00:01 |       |       |   2840 |00:00:00.19 |     544 |    472 |  1137K|  1137K| 1403K (0)|
|*  5 |      HASH JOIN                    |                      |      1 |  41362 |  2019K|  1163   (3)| 00:00:01 |       |       |  23678 |00:00:00.17 |     544 |    472 |  1695K|  1695K| 1571K (0)|
|   6 |       PART JOIN FILTER CREATE     | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  7 |        TABLE ACCESS FULL          | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  8 |       HASH JOIN                   |                      |      1 |    165K|  6156K|  1146   (3)| 00:00:01 |       |       |  26637 |00:00:00.17 |     488 |    472 |  1572K|  1572K| 1390K (0)|
|*  9 |        VIEW                       | index$_join$_002     |      1 |     13 |   273 |     2   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       5 |      0 |       |       |          |
|* 10 |         HASH JOIN                 |                      |      1 |        |       |            |          |       |       |     13 |00:00:00.01 |       5 |      0 |  1355K|  1355K| 1376K (0)|
|* 11 |          INDEX RANGE SCAN         | PRODUCTS_PROD_CAT_IX |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       1 |      0 |       |       |          |
|  12 |          INDEX FAST FULL SCAN     | PRODUCTS_PK          |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|  13 |        PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    14M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.13 |     482 |    472 |       |       |          |
|  14 |         TABLE ACCESS FULL         | SALES                |      5 |    918K|    14M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.13 |     482 |    472 |       |       |          |
|* 15 |    INDEX UNIQUE SCAN              | CUSTOMERS_PK         |   2840 |      1 |     5 |     0   (0)|          |       |       |   2840 |00:00:00.01 |    2842 |      0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With Foreign Key Constraints

I have re-enabled the foreign key constraints
ALTER TABLE sales MODIFY CONSTRAINT sales_channel_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_product_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_promo_fk enable validate;
ALTER TABLE sales MODIFY CONSTRAINT sales_time_fk enable validate;
Now the CUSTOMERS table is no longer referenced due to foreign key join elimination.  The nested loop operation (at line 2 of the previous plan) to join the CUSTOMERS table is no longer required because the query does not need anything from that dimension table.  The foreign key constraint tells the optimizer that there will always be a corresponding row for every SALES record.  Thus not joining to it does not change the result of the query.  There is one less operation requiring PGA memory, and the number of buffers required, and the optimizer cost has also dropped.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                      |      1 |        |       |  1163 (100)|          |       |       |      1 |00:00:00.87 |     544 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT          |                      |      1 |      1 |    45 |  1163   (3)| 00:00:01 |       |       |      1 |00:00:00.87 |     544 |    472 |       |       |          |
|*  2 |   HASH JOIN                    |                      |      1 |  41362 |  1817K|  1163   (3)| 00:00:01 |       |       |  23678 |00:00:00.85 |     544 |    472 |  1695K|  1695K| 1669K (0)|
|   3 |    PART JOIN FILTER CREATE     | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  4 |     TABLE ACCESS FULL          | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  5 |    HASH JOIN                   |                      |      1 |    165K|  5346K|  1146   (3)| 00:00:01 |       |       |  26637 |00:00:00.79 |     488 |    472 |  1572K|  1572K| 1331K (0)|
|*  6 |     VIEW                       | index$_join$_002     |      1 |     13 |   273 |     2   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       5 |      0 |       |       |          |
|*  7 |      HASH JOIN                 |                      |      1 |        |       |            |          |       |       |     13 |00:00:00.01 |       5 |      0 |  1355K|  1355K| 1377K (0)|
|*  8 |       INDEX RANGE SCAN         | PRODUCTS_PROD_CAT_IX |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     13 |00:00:00.01 |       1 |      0 |       |       |          |
|   9 |       INDEX FAST FULL SCAN     | PRODUCTS_PK          |      1 |     13 |   273 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|  10 |     PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.43 |     482 |    472 |       |       |          |
|  11 |      TABLE ACCESS FULL         | SALES                |      5 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.25 |     482 |    472 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
…
Note
-----
   - rely constraint used for this statement
From Oracle 12.2, when a constraint has been used to remove a table from an execution plan, you get a note in the execution plan "rely constraint used for this statement".  Though it doesn't tell you which table has been eliminated, nor due to which constraint.
Admittedly, the actual run time of the query has gone up because there is more work for the optimizer to do.  However, this is an unusually small test, so it has swamped the savings in not visiting the CUSTOMERS table.  Foreign key join elimination should generally deliver a performance improvement because the query processes less data.

With Disabled Reliable Constraints

If you don't want the overhead and complexity of enforced foreign key constraints, you can still get the benefit of foreign key join elimination.  It is possible to disable the constraint from SALES to CUSTOMER so that it is not enforced, but it can still tell the optimizer that it can RELY upon the referential integrity of the data as if the constraint were enforced.
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk RELY DISABLE NOVALIDATE;
This behaviour has been available since constraints were introduced in Oracle 8 (c. 1998), but there has been a change in Oracle 12c.  Foreign key join elimination on disabled RELY constraints does not occur if QUERY_REWRITE_INTEGRITY is set to its default value of ENFORCED.  It must be set to either TRUSTED or STALE_TOLERATED.  However, bear in mind that setting it to STALE_TOLERATED also affects how materialized views can be used by query rewrite.
ALTER SESSION SET query_rewrite_integrity = TRUSTED;
And now we are back to the same plan as before without the CUSTOMERS table, and with the "rely constraint used for this statement" note.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                 | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                      |      1 |        |       |  1164 (100)|          |       |       |      1 |00:00:01.55 |     544 |    472 |       |       |          |
|   1 |  SORT GROUP BY NOSORT          |                      |      1 |      1 |    45 |  1164   (3)| 00:00:01 |       |       |      1 |00:00:01.55 |     544 |    472 |       |       |          |
|*  2 |   HASH JOIN                    |                      |      1 |  81133 |  3565K|  1164   (3)| 00:00:01 |       |       |    110K|00:00:01.46 |     544 |    472 |  1476K|  1476K| 1528K (0)|
|*  3 |    VIEW                        | index$_join$_002     |      1 |     26 |   546 |     2   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       5 |      0 |       |       |          |
|*  4 |     HASH JOIN                  |                      |      1 |        |       |            |          |       |       |     26 |00:00:00.01 |       5 |      0 |  1298K|  1298K| 1612K (0)|
|*  5 |      INDEX RANGE SCAN          | PRODUCTS_PROD_CAT_IX |      1 |     26 |   546 |     1   (0)| 00:00:01 |       |       |     26 |00:00:00.01 |       1 |      0 |       |       |          |
|   6 |      INDEX FAST FULL SCAN      | PRODUCTS_PK          |      1 |     26 |   546 |     1   (0)| 00:00:01 |       |       |     72 |00:00:00.01 |       4 |      0 |       |       |          |
|*  7 |    HASH JOIN                   |                      |      1 |    229K|  5369K|  1160   (3)| 00:00:01 |       |       |    246K|00:00:01.00 |     538 |    472 |  1695K|  1695K| 1683K (0)|
|   8 |     PART JOIN FILTER CREATE    | :BF0000              |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|*  9 |      TABLE ACCESS FULL         | TIMES                |      1 |    364 |  4368 |    16   (0)| 00:00:01 |       |       |    364 |00:00:00.01 |      55 |      0 |       |       |          |
|  10 |     PARTITION RANGE JOIN-FILTER|                      |      1 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.47 |     482 |    472 |       |       |          |
|  11 |      TABLE ACCESS FULL         | SALES                |      5 |    918K|    10M|  1136   (2)| 00:00:01 |:BF0000|:BF0000|    296K|00:00:00.28 |     482 |    472 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
…
Note
-----
   - rely constraint used for this statement

Multi-Column Foreign Key Join Elimination 

It has always been possible to create foreign keys on multiple columns, sometimes called composite foreign keys.  From Oracle 12.2, it is possible to get join elimination on composite foreign keys. Jonathan Lewis has published a blog with demonstration scripts.
ALTER TABLE child 
add constraint child_fk_parent foreign key (id_g, id_p) 
references parent (id_g, id);
However, he has also found a case where join elimination depends on the order of tables in the from clause – (currently unpublished) bug 22228669.  The workaround is to list parents in the from clause before their children.
The need for single column keys inevitably leads to meaningless keys, often generated from a sequence.  Another of Jonathan's blogs and its comments discusses the pros and cons. 
Updated 19th November: It depends how important foreign key join elimination is to you.  How much use a system makes of this feature will depend upon how it is written.  If you think that you may need to use this feature then, at least until this bug is resolved, I would stick with single column keys in Oracle 12.2.

Summary of Good Practice

  • Primary keys (with unique indexes) on all tables.
  • Define foreign key constraints on all dimension columns of fact tables referencing the primary or unique keys on the dimension tables through equality joins only. 
    • I think that even from Oracle 12.2 I would still avoid creating multi-column primary and foreign keys because there appears to be a bug with this feature.
  • If foreign key constraints are to be enforced, then the key columns should also be indexed avoid TM locking during DML operations.
  • Otherwise, in order to achieve join elimination mark the constraints as reliable:
  • ALTER TABLE … MODIFY CONSTRAINT … RELY NOVALIDATE DISABLE
    • From Oracle 12 you must also set QUERY_REWRITE_INTEGRITY to TRUSTED or STALE_TOLERATED.

2 comments :

Unknown said...

Thanks for blogging about this David.

What I'm struggling with is that I can't think of a genuine reason why you would ever be querying a parent table without referencing any attributes on it. So, in your example, the inclusion of the customer table in the FROM and the WHERE clauses isn't necessary. Am I missing something?

Patrick Hurley
Twitter: @phurley

David Kurtz said...

Thank you, Patrick. It is a thought provoking question. Yes, you would expect developers not to include unnecessary tables. However, I think it depends on how the SQL is written.
For example, if you created reporting views to pre-join the tables but didn't filter or report on one of the tables in the view then you would get join elimination.
So, I think there are cases where this feature is of value, but some systems will make more use of it than others.