Monday, February 04, 2019

Regular Expression Functions are Considered To Be Non-Deterministic from Oracle 12.2

You cannot put non-deterministic functions into virtual columns or function-based index.  Regular expressions have always been sensitive to National Language Support (NLS) settings because what is a character is different in different languages.  However, from Oracle 12.2. regular expressions are considered to be non-deterministic because Oracle has fixed bug 20804063.

Background 

A developer came to me with the following query, complaining it was slow.
SELECT …
FROM   …
AND    LOWER(REGEXP_REPLACE(ASCIISTR(my_column), '[^a-zA-Z0-9]')) ='2732018'
…
Obviously, functions on columns in predicates prevent the use of indexes, so I suggested creating and indexing a virtual column (slightly more elegant and almost the same as a function-based index but without needing the function in the code to match the index).
This works fine in Oracle 12.1, but they came back saying it didn't in 12.2.  Note the error messages in bold below.
DROP TABLE t PURGE
/
CREATE TABLE t AS 
SELECT TO_CHAR(sysdate-level,'YYYY/DDD') a
FROM   dual
CONNECT BY LEVEL < 1000
/

CREATE INDEX t1 on t (REGEXP_REPLACE(a,'[^a-zA-Z0-9]'))
/
ORA-01743: only pure functions can be indexed

DROP INDEX t1
/
ALTER TABLE t 
ADD b AS (REGEXP_REPLACE(a,'[^a-zA-Z0-9]')) VIRTUAL
/
ORA-54002: only pure functions can be specified in a virtual column expression
I eventually came across this question and answer by Connor McDonald on AskTom: Adding constraint with REGEXP_REPLACE fails that explains that Oracle has fixed bug 20804063 in 12.2.  This fix affects virtual columns, function-based indexes (because they also create virtual columns), and constraints.
There are a number of workarounds, but none are completely satisfactory.

Workaround 1: Deterministic Function 

As Connor suggests, you can work around this unwanted behaviour by creating your own PL/SQL function and telling the database it is deterministic. This is not simply lying to the database.  Not all regular expressions are actually NLS sensitive, it depends on what you are doing.  However, it does appear that the validation is applied regardless. If the function is NLS sensitive that might have unwanted consequences, including incorrect query results. You would have to decide whether you can live with the risk depending on what is actually in the column and function in question.
CREATE OR REPLACE FUNCTION my_regexp_replace(p1 VARCHAR2, p2 VARCHAR2) 
RETURN varchar2 
DETERMINISTIC IS
BEGIN
RETURN REGEXP_REPLACE(p1, p2);
END;
/
show error

CREATE INDEX t1 ON t (my_regexp_replace(a,'[^a-zA-Z0-9]'))
/
DROP INDEX t1
/
ALTER TABLE t 
ADD b AS (my_REGEXP_REPLACE(a,'[^a-zA-Z0-9]')) VIRTUAL
/
CREATE INDEX t1 ON t (b)
/

Workaround 2: Refresh on Commit Materialized View 

Another possible workaround would be a materialized view that refreshes on commit, with the expression creating an additional column that is then indexed.
CREATE TABLE t AS 
SELECT rownum n
,      TO_CHAR(sysdate-level,'YYYY/DDD') a
FROM  dual
CONNECT BY LEVEL < 1000
/
ALTER TABLE t ADD CONSTRAINT t_pk PRIMARY KEY (n)
/
CREATE MATERIALIZED VIEW LOG ON t
/

CREATE MATERIALIZED VIEW T1_MV 
REFRESH ON COMMIT 
FAST 
WITH PRIMARY KEY 
ENABLE QUERY REWRITE AS 
SELECT t.*
,REGEXP_REPLACE(a,'[^a-zA-Z0-9]') b
FROM t
/

CREATE INDEX t1_mv_b ON t1_mv(b);

INSERT INTO t
SELECT ROWNUM+1000 n
,      TO_CHAR(sysdate-level-1000,'YYYY/DDD') a
FROM  dual
CONNECT BY LEVEL < 1000
/
commit
/

set autotrace on
SELECT * FROM t
WHERE REGEXP_REPLACE(a,'[^a-zA-Z0-9]') like '201720%';
And Oracle does indeed rewrite the query to use the materialised view and then uses the index on the materialized view.
Plan hash value: 3543552962

----------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |         |     1 |    21 |     3   (0)| 00:00:01 |
|   1 |  MAT_VIEW REWRITE ACCESS BY INDEX ROWID BATCHED| T1_MV   |     1 |    21 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                             | T1_MV_B |     2 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
However, this solution has the same weakness of being dependent on NLS settings – it is just that there is no validation to stop you! There is the further threat that the refresh on commit could become a performance problem if there is intense DDL on the underlying table.

Workaround 3: Use a Non-NLS Dependent Function 

Ideally, it would be better to use a non-NLS dependent function instead of a REGEXP% function. However, this may be easier said than done.  Regular expressions solve problems that are hard to do with basic string handling functions and may require a PL/SQL function anyway to hold procedural code.  Although in this relatively simple example it is possible to remove the unwanted characters with the translate and replace functions. There is no NLS restriction here.
SELECT a
,      REGEXP_REPLACE(a,'[^0-9]')
,      REPLACE(TRANSLATE(a,'/',' '),' ','')
FROM   t
WHERE rownum <= 10
/

A        REGEXP_REPLACE(A,'[^0-9]')       REPLACE(TRANSLATE(A,'/',''),'','
-------- -------------------------------- --------------------------------
2019/031 2019031                          2019031
2019/030 2019030                          2019030
2019/029 2019029                          2019029
2019/028 2019028                          2019028
2019/027 2019027                          2019027
2019/026 2019026                          2019026
2019/025 2019025                          2019025
2019/024 2019024                          2019024
2019/023 2019023                          2019023
2019/022 2019022                          2019022
…
And you can put this into a function-based index or virtual column thus
DROP INDEX t1 
/ 
ALTER TABLE t ADD b AS (REPLACE(TRANSLATE(a,'/',' '),' ','')) VIRTUAL 
/ 
CREATE INDEX t1 ON t (b) 
/