Tuesday, April 06, 2021

Spatial Data 6: Text Searching Areas by their Name, and the Names of Parent Areas

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

Now I have loaded all the areas, I want to be able to search for them by name.  I am going to create an Oracle Text Index, but I need to index more than just the name of each area.  I must index the full hierarchy of each area so I can search on combinations of names in different types of areas.  For example, I might search for a village and county (e.g. Streatley and Berkshire), to distinguish it from a village of the same name in a different county (e.g. Streatley in Bedfordshire).

I can generate the full hierarchy of an area with a PL/SQL function (strava_pkg.name_heirarchy_fn) by navigating up the linked list and discarding repeated names.  I could make that available in a virtual column.  However, I cannot build a text index on a function or a virtual column.  

Text Index Option 1: Store Hierarchy on Table, and Create a Multi-Column Text Index

I could store the hierarchy of an area on the my_areas table, and generate the area from PL/SQL function strava_pkg. name_heirarchy_fn.

DECLARE
  l_clob CLOB;
  l_my_areas my_areas%ROWTYPE;
BEGIN
  select m.*
  into   l_my_areas
  FROM   my_areas m
  WHERE  area_code = 'CPC'
  And    area_number = '40307';

  dbms_output.put_line(strava_pkg.name_heirarchy_fn(l_my_areas.area_code,l_my_areas.area_number));
  dbms_output.put_line(strava_pkg.name_heirarchy_fn(l_my_areas.parent_area_code,l_my_areas.parent_area_number));
END;
/

If I pass the code and number for a particular area, I can get its full hierarchy including its name.  I can see the parish of Streatley, is in the Unitary Authority of West Berkshire, which is in England, and England is in the United Kingdom.  If I pass the code and number of its parent, I just get the hierarchy up to its parent.  

Streatley, West Berkshire, England, United Kingdom
West Berkshire, England, United Kingdom

I can store the hierarchy on my_areas, though I have to store results on a temporary table, rather than update it directly.  Otherwise, I get a mutation error.

ALTER TABLE my_areas add name_heirarchy VARCHAR(4000)
/
CREATE GLOBAL TEMPORARY TABLE my_areas_temp ON COMMIT PRESERVE ROWS AS 
SELECT area_code, area_number, strava_pkg.name_heirarchy_fn(parent_area_code,parent_area_number) name_heirarchy
FROM my_areas WHERE parent_area_code IS NOT NULL AND parent_area_number IS NOT NULL
/
MERGE INTO my_areas u 
USING (SELECT * FROM my_areas_temp) s
ON (u.area_code = s.area_code AND u.area_number = s.area_number)
WHEN MATCHED THEN UPDATE
SET u.name_heirarchy = s.name_heirarchy
/

Then I can create a multi-column text index on the name

begin
 ctx_ddl.create_preference('my_areas_lexer', 'BASIC_LEXER');  
 ctx_ddl.set_attribute('my_areas_lexer', 'mixed_case', 'NO'); 
 ctx_ddl.create_preference('my_areas_datastore', 'MULTI_COLUMN_DATASTORE'); 
 ctx_ddl.set_attribute('my_areas_datastore', 'columns', 'name, name_heirarchy'); 
end;
/
CREATE INDEX my_areas_name_txtidx ON my_areas (name) INDEXTYPE IS ctxsys.context 
PARAMETERS ('datastore my_areas_datastore lexer my_areas_lexer sync(on commit)');

The index will sync if I have cause to update the hierarchy.

Text Index Option 2: Index a user_datastore based on the result of a PL/SQL function

Alternatively, I can build a text index on a combination of data from various sources by creating a PL/SQL procedure that combines the data and returns the string to be indexed.  

I have created a procedure (strava_pkg.name_heirarchy_txtidx) that returns a string containing the hierarchy of a given area, and then I will create a text index on that.  The format of the parameters must be exactly as follows: 

  • The rowid of the row being indexed is passed to the procedure; 
  • The string to be indexed is passed back as a CLOB parameter.
See also: Oracle Text Indexing Elements: USER_DATASTORE Attributes

…
PROCEDURE name_heirarchy_txtidx
(p_rowid in rowid
,p_dataout IN OUT NOCOPY CLOB
) IS
  l_count INTEGER := 0;
BEGIN
  FOR i IN (
    SELECT area_code, area_number, name, matchable
    FROM   my_areas m
    START WITH rowid = p_rowid
    CONNECT BY NOCYCLE prior m.parent_area_code   = m.area_code
                   AND prior m.parent_area_number = m.area_number
  ) LOOP
    IF i.matchable >= 1 THEN
      l_count := l_count + 1;
      IF l_count > 1 THEN
        p_dataout := p_dataout ||', '|| i.name;
      ELSE
        p_dataout := i.name;
      END IF;
    END IF;
  END LOOP;
END name_heirarchy_txtidx;
…

As an example, if I pass a particular rowid to the procedure, I obtain the full hierarchy of areas as before.

set serveroutput on
DECLARE
  l_rowid ROWID;
  l_clob CLOB;
BEGIN
  select rowid
  into   l_rowid
  FROM   my_areas m
  WHERE  area_code = 'CPC'
  And    area_number = '40307';

  strava_pkg.name_heirarchy_txtidx(l_rowid, l_clob);
  dbms_output.put_line(l_clob);
END;
/

Streatley, West Berkshire, England, United Kingdom

PL/SQL procedure successfully completed.

The procedure is referenced as an attribute to a user datastore, I can then build a text index on the user datastore.

BEGIN
  ctx_ddl.create_preference('my_areas_lexer', 'BASIC_LEXER');  
  ctx_ddl.set_attribute('my_areas_lexer', 'mixed_case', 'NO'); 
  ctx_ddl.create_preference('my_areas_datastore', 'user_datastore'); 
  ctx_ddl.set_attribute('my_areas_datastore', 'procedure', 'strava_pkg.name_heirarchy_txtidx'); 
  ctx_ddl.set_attribute('my_areas_datastore', 'output_type', 'CLOB');
END;
/

CREATE INDEX my_areas_name_txtidx on my_areas (name) INDEXTYPE IS ctxsys.context 
PARAMETERS ('datastore my_areas_datastore lexer my_areas_lexer');
I have not been able to combine a multi-column datastore with a user datastore.

Text Search examples

Both options produce an index that I can use in the same way.  I can search for a particular name, for example, the village of Streatley.

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'streatley',1)>0
/

I get the two Streatleys, one in Berkshire, and the other in Bedfordshire.  

  SCORE(1) AREA AREA_NUMBER NAME                 SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- -------------------- ---------- ------------------------------------------------------------
        16 CPC        41076 Streatley            CP         Streatley, Central Bedfordshire, England, United Kingdom
        16 CPC        40307 Streatley            CP         Streatley, West Berkshire, England, United Kingdom

As I have indexed the full hierarchy, I can be more precise and search for both the village and the county, even though they are two different columns in the my_areas table.

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'streatley and berks%',1)>0
/

Now I just get one result.  The Streatley in Berkshire.

  SCORE(1) AREA AREA_NUMBER NAME                 SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- -------------------- ---------- ------------------------------------------------------------
        11 CPC        40307 Streatley            CP         Streatley, West Berkshire, England, United Kingdom

Searching For the Top of Hierarchies

My search query works satisfactorily if my search identifies areas with no children, but supposing I search for something higher up the hierarchy, like Berkshire?  

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'berkshire',1)>0
/

I get 184 areas, of different types within the areas called Berkshire, because the name of the parent area appears in the hierarchy of all the children and so is returned by the text index.

           Area        Area
  SCORE(1) Code      Number NAME                      SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- ------------------------- ---------- -----------------------------------------------------------
        11 UTA       101678 Windsor and Maidenhead    (B)        Windsor and Maidenhead, Berkshire, England, United Kingdom
        11 UTA       101680 Wokingham                 (B)        Wokingham, Berkshire, England, United Kingdom
        11 UTA       101681 Reading                   (B)        Reading, Berkshire, England, United Kingdom
        11 UTA       101685 West Berkshire                       West Berkshire, England, United Kingdom
        11 UTW        40258 Norreys                   Ward       Norreys, Wokingham, Berkshire, England, United Kingdom
        11 UTW        40261 Barkham                   Ward       Barkham, Wokingham, Berkshire, England, United Kingdom
…

However, I am just interested in the highest points on each part of the hierarchy I have identified.  So, I exclude any result where its parent is also in the result set.

WITH x AS (
SELECT area_code, area_number, parent_area_code, parent_area_number, name, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'berkshire',1)>0
) SELECT * FROM x WHERE NOT EXISTS (
  SELECT 'x' FROM x x1
  WHERE  x1.area_code = x.parent_area_code
  AND    x1.area_number = x.parent_area_number
)
/

In this case, I still get two results because the boundaries of the unitary authority of West Berkshire are not entirely within the ceremonial county of Berkshire (some parts of Hungerford and Lambourne were exchanged with Wiltshire in 1990), hence I could not make Berkshire the parent of West Berkshire.

Area        Area
Code      Number      SCORE NAME_HEIRARCHY
---- ----------- ---------- ------------------------------------------------------------
UTA       101685         11 West Berkshire, England, United Kingdom
CCTY           7         11 Berkshire, England, United Kingdom

Text Searching for Activities that pass through Areas

It is a simple extension to join the pre-processed areas through which activities pass to the areas found by the text search, and then exclude areas whose parent was also found in the same activity.

WITH x AS (
SELECT aa.activity_id, m.area_code, m.area_number, m.parent_area_code, m.parent_area_number, m.name, m.name_heirarchy
FROM   my_areas m, activity_areas aa
WHERE  m.area_Code = aa.area_code
AND    m.area_number = aa.area_number
AND    CONTAINS(name,'berkshire',1)>0
) 
SELECT a.activity_id, a.activity_date, a.activity_name, a.activity_type, a.distance_km
,      x.area_Code, x.area_number, x.name, x.name_heirarchy
FROM   x, activities a
WHERE  x.activity_id = a.activity_id
AND    a.activity_date between TO_DATE('01022019','DDMMYYYY') and TO_DATE('28022019','DDMMYYYY')
AND NOT EXISTS (
  SELECT 'x' FROM x x1
  WHERE  x1.area_code = x.parent_area_code
  AND    x1.area_number = x.parent_area_number
  AND    x1.activity_id = x.activity_id)
ORDER BY a.activity_date
/

Now I can see the rides in Berkshire in February 2019.  I get two rows returned for the ride that was in both Berkshire and West Berkshire.  

  Activity Activity                                                Activity Distance Area   Area
        ID Date      ACTIVITY_NAME                                 Type         (km) Code Number NAME            NAME_HEIRARCHY
---------- --------- --------------------------------------------- -------- -------- ---- ------ --------------- -------------------------
2156308823 17-FEB-19 MV - Aldworth, CLCTC Aldworth-Reading         Ride       120.86 CCTY      7 Berkshire       England, United Kingdom
2156308823 17-FEB-19 MV - Aldworth, CLCTC Aldworth-Reading         Ride       120.86 UTA  101685 West Berkshire  England, United Kingdom
2172794879 24-FEB-19 MV - Maidenhead                               Ride        48.14 CCTY      7 Berkshire       England, United Kingdom
2173048214 24-FEB-19 CLCTC: Maidenhead - Turville Heath            Ride        53.15 CCTY      7 Berkshire       England, United Kingdom
2173048406 24-FEB-19 Maidenhead - Burnham Beeches - West Drayton   Ride        27.92 CCTY      7 Berkshire       England, United Kingdom
…

References

I found these references useful while creating the Text index: