Tuesday, April 06, 2021

Spatial Data 6: Text Searching Areas by their Name, and the Names of Parent Areas

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

Now I have loaded all the areas, I want to be able to search for them by name.  I am going to create an Oracle Text Index, but I need to index more than just the name of each area.  I must index the full hierarchy of each area so I can search on combinations of names in different types of areas.  For example, I might search for a village and county (e.g. Streatley and Berkshire), to distinguish it from a village of the same name in a different county (e.g. Streatley in Bedfordshire).

I can generate the full hierarchy of an area with a PL/SQL function (strava_pkg.name_heirarchy_fn) by navigating up the linked list and discarding repeated names.  I could make that available in a virtual column.  However, I cannot build a text index on a function or a virtual column.  

Text Index Option 1: Store Hierarchy on Table, and Create a Multi-Column Text Index

I could store the hierarchy of an area on the my_areas table, and generate the area from PL/SQL function strava_pkg. name_heirarchy_fn.

DECLARE
  l_clob CLOB;
  l_my_areas my_areas%ROWTYPE;
BEGIN
  select m.*
  into   l_my_areas
  FROM   my_areas m
  WHERE  area_code = 'CPC'
  And    area_number = '40307';

  dbms_output.put_line(strava_pkg.name_heirarchy_fn(l_my_areas.area_code,l_my_areas.area_number));
  dbms_output.put_line(strava_pkg.name_heirarchy_fn(l_my_areas.parent_area_code,l_my_areas.parent_area_number));
END;
/

If I pass the code and number for a particular area, I can get its full hierarchy including its name.  I can see the parish of Streatley, is in the Unitary Authority of West Berkshire, which is in England, and England is in the United Kingdom.  If I pass the code and number of its parent, I just get the hierarchy up to its parent.  

Streatley, West Berkshire, England, United Kingdom
West Berkshire, England, United Kingdom

I can store the hierarchy on my_areas, though I have to store results on a temporary table, rather than update it directly.  Otherwise, I get a mutation error.

ALTER TABLE my_areas add name_heirarchy VARCHAR(4000)
/
CREATE GLOBAL TEMPORARY TABLE my_areas_temp ON COMMIT PRESERVE ROWS AS 
SELECT area_code, area_number, strava_pkg.name_heirarchy_fn(parent_area_code,parent_area_number) name_heirarchy
FROM my_areas WHERE parent_area_code IS NOT NULL AND parent_area_number IS NOT NULL
/
MERGE INTO my_areas u 
USING (SELECT * FROM my_areas_temp) s
ON (u.area_code = s.area_code AND u.area_number = s.area_number)
WHEN MATCHED THEN UPDATE
SET u.name_heirarchy = s.name_heirarchy
/

Then I can create a multi-column text index on the name

begin
 ctx_ddl.create_preference('my_areas_lexer', 'BASIC_LEXER');  
 ctx_ddl.set_attribute('my_areas_lexer', 'mixed_case', 'NO'); 
 ctx_ddl.create_preference('my_areas_datastore', 'MULTI_COLUMN_DATASTORE'); 
 ctx_ddl.set_attribute('my_areas_datastore', 'columns', 'name, name_heirarchy'); 
end;
/
CREATE INDEX my_areas_name_txtidx ON my_areas (name) INDEXTYPE IS ctxsys.context 
PARAMETERS ('datastore my_areas_datastore lexer my_areas_lexer sync(on commit)');

The index will sync if I have cause to update the hierarchy.

Text Index Option 2: Index a user_datastore based on the result of a PL/SQL function

Alternatively, I can build a text index on a combination of data from various sources by creating a PL/SQL procedure that combines the data and returns the string to be indexed.  

I have created a procedure (strava_pkg.name_heirarchy_txtidx) that returns a string containing the hierarchy of a given area, and then I will create a text index on that.  The format of the parameters must be exactly as follows: 

  • The rowid of the row being indexed is passed to the procedure; 
  • The string to be indexed is passed back as a CLOB parameter.
See also: Oracle Text Indexing Elements: USER_DATASTORE Attributes

…
PROCEDURE name_heirarchy_txtidx
(p_rowid in rowid
,p_dataout IN OUT NOCOPY CLOB
) IS
  l_count INTEGER := 0;
BEGIN
  FOR i IN (
    SELECT area_code, area_number, name, matchable
    FROM   my_areas m
    START WITH rowid = p_rowid
    CONNECT BY NOCYCLE prior m.parent_area_code   = m.area_code
                   AND prior m.parent_area_number = m.area_number
  ) LOOP
    IF i.matchable >= 1 THEN
      l_count := l_count + 1;
      IF l_count > 1 THEN
        p_dataout := p_dataout ||', '|| i.name;
      ELSE
        p_dataout := i.name;
      END IF;
    END IF;
  END LOOP;
END name_heirarchy_txtidx;
…

As an example, if I pass a particular rowid to the procedure, I obtain the full hierarchy of areas as before.

set serveroutput on
DECLARE
  l_rowid ROWID;
  l_clob CLOB;
BEGIN
  select rowid
  into   l_rowid
  FROM   my_areas m
  WHERE  area_code = 'CPC'
  And    area_number = '40307';

  strava_pkg.name_heirarchy_txtidx(l_rowid, l_clob);
  dbms_output.put_line(l_clob);
END;
/

Streatley, West Berkshire, England, United Kingdom

PL/SQL procedure successfully completed.

The procedure is referenced as an attribute to a user datastore, I can then build a text index on the user datastore.

BEGIN
  ctx_ddl.create_preference('my_areas_lexer', 'BASIC_LEXER');  
  ctx_ddl.set_attribute('my_areas_lexer', 'mixed_case', 'NO'); 
  ctx_ddl.create_preference('my_areas_datastore', 'user_datastore'); 
  ctx_ddl.set_attribute('my_areas_datastore', 'procedure', 'strava_pkg.name_heirarchy_txtidx'); 
  ctx_ddl.set_attribute('my_areas_datastore', 'output_type', 'CLOB');
END;
/

CREATE INDEX my_areas_name_txtidx on my_areas (name) INDEXTYPE IS ctxsys.context 
PARAMETERS ('datastore my_areas_datastore lexer my_areas_lexer');
I have not been able to combine a multi-column datastore with a user datastore.

Text Search examples

Both options produce an index that I can use in the same way.  I can search for a particular name, for example, the village of Streatley.

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'streatley',1)>0
/

I get the two Streatleys, one in Berkshire, and the other in Bedfordshire.  

  SCORE(1) AREA AREA_NUMBER NAME                 SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- -------------------- ---------- ------------------------------------------------------------
        16 CPC        41076 Streatley            CP         Streatley, Central Bedfordshire, England, United Kingdom
        16 CPC        40307 Streatley            CP         Streatley, West Berkshire, England, United Kingdom

As I have indexed the full hierarchy, I can be more precise and search for both the village and the county, even though they are two different rows in the my_areas table.

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'streatley and berks%',1)>0
/

Now I just get one result.  The Streatley in Berkshire.

  SCORE(1) AREA AREA_NUMBER NAME                 SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- -------------------- ---------- ------------------------------------------------------------
        11 CPC        40307 Streatley            CP         Streatley, West Berkshire, England, United Kingdom

Searching For the Top of Hierarchies

My search query works satisfactorily if my search identifies areas with no children, but supposing I search for something higher up the hierarchy, like Berkshire?  

SELECT score(1), area_Code, area_number, name, suffix, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'berkshire',1)>0
/

I get 184 areas, of different types within the areas called Berkshire, because the name of the parent area appears in the hierarchy of all the children and so is returned by the text index.

           Area        Area
  SCORE(1) Code      Number NAME                      SUFFIX     NAME_HEIRARCHY
---------- ---- ----------- ------------------------- ---------- -----------------------------------------------------------
        11 UTA       101678 Windsor and Maidenhead    (B)        Windsor and Maidenhead, Berkshire, England, United Kingdom
        11 UTA       101680 Wokingham                 (B)        Wokingham, Berkshire, England, United Kingdom
        11 UTA       101681 Reading                   (B)        Reading, Berkshire, England, United Kingdom
        11 UTA       101685 West Berkshire                       West Berkshire, England, United Kingdom
        11 UTW        40258 Norreys                   Ward       Norreys, Wokingham, Berkshire, England, United Kingdom
        11 UTW        40261 Barkham                   Ward       Barkham, Wokingham, Berkshire, England, United Kingdom
…

However, I am just interested in the highest points on each part of the hierarchy I have identified.  So, I exclude any result where its parent is also in the result set.

WITH x AS (
SELECT area_code, area_number, parent_area_code, parent_area_number, name, name_heirarchy
FROM   my_areas m
WHERE  CONTAINS(name,'berkshire',1)>0
) SELECT * FROM x WHERE NOT EXISTS (
  SELECT 'x' FROM x x1
  WHERE  x1.area_code = x.parent_area_code
  AND    x1.area_number = x.parent_area_number
)
/

In this case, I still get two results because the boundaries of the unitary authority of West Berkshire are not entirely within the ceremonial county of Berkshire (some parts of Hungerford and Lambourne were exchanged with Wiltshire in 1990), hence I could not make Berkshire the parent of West Berkshire.

Area        Area
Code      Number      SCORE NAME_HEIRARCHY
---- ----------- ---------- ------------------------------------------------------------
UTA       101685         11 West Berkshire, England, United Kingdom
CCTY           7         11 Berkshire, England, United Kingdom

Text Searching for Activities that pass through Areas

It is a simple extension to join the pre-processed areas through which activities pass to the areas found by the text search, and then exclude areas whose parent was also found in the same activity.

WITH x AS (
SELECT aa.activity_id, m.area_code, m.area_number, m.parent_area_code, m.parent_area_number, m.name, m.name_heirarchy
FROM   my_areas m, activity_areas aa
WHERE  m.area_Code = aa.area_code
AND    m.area_number = aa.area_number
AND    CONTAINS(name,'berkshire',1)>0
) 
SELECT a.activity_id, a.activity_date, a.activity_name, a.activity_type, a.distance_km
,      x.area_Code, x.area_number, x.name, x.name_heirarchy
FROM   x, activities a
WHERE  x.activity_id = a.activity_id
AND    a.activity_date between TO_DATE('01022019','DDMMYYYY') and TO_DATE('28022019','DDMMYYYY')
AND NOT EXISTS (
  SELECT 'x' FROM x x1
  WHERE  x1.area_code = x.parent_area_code
  AND    x1.area_number = x.parent_area_number
  AND    x1.activity_id = x.activity_id)
ORDER BY a.activity_date
/

Now I can see the rides in Berkshire in February 2019.  I get two rows returned for the ride that was in both Berkshire and West Berkshire.  

  Activity Activity                                                Activity Distance Area   Area
        ID Date      ACTIVITY_NAME                                 Type         (km) Code Number NAME            NAME_HEIRARCHY
---------- --------- --------------------------------------------- -------- -------- ---- ------ --------------- -------------------------
2156308823 17-FEB-19 MV - Aldworth, CLCTC Aldworth-Reading         Ride       120.86 CCTY      7 Berkshire       England, United Kingdom
2156308823 17-FEB-19 MV - Aldworth, CLCTC Aldworth-Reading         Ride       120.86 UTA  101685 West Berkshire  England, United Kingdom
2172794879 24-FEB-19 MV - Maidenhead                               Ride        48.14 CCTY      7 Berkshire       England, United Kingdom
2173048214 24-FEB-19 CLCTC: Maidenhead - Turville Heath            Ride        53.15 CCTY      7 Berkshire       England, United Kingdom
2173048406 24-FEB-19 Maidenhead - Burnham Beeches - West Drayton   Ride        27.92 CCTY      7 Berkshire       England, United Kingdom
…

References

I found these references useful while creating the Text index:

Monday, March 29, 2021

Spatial Data 5: Searching For Geometries That Intersect Other Geometries

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

I have loaded basic data for all countries, and detailed data for the UK and other countries where I have recorded activities.  The next step is to determine which activities pass through which areas.  Generically, the question is simply whether one geometry intersects with another.   I can test this in SQL with the sdo_geom.relate() function.

WHERE SDO_GEOM.RELATE(a.geom,'anyinteract',m.geom) = 'TRUE'

However, working out whether an activity, with several thousand points, is within an area defined with several thousand points can be CPU intensive and time-consuming.  Larger areas such as UK counties average over 20,000 points. 

I have 60,000 defined areas, of which, over 20,000 of which are for the UK.  I have 2700 activities recorded on Strava, with an average of 2700 points, but some have over 10,000 points.  It isn't viable to compare every activity with every area.  Comparing these large geometries can take a significant time, too long to do the spatial queries every time I want to interrogate the data, and too long for an on-line application.

Pre-processing Geometry Intersections

However, the data, once loaded is static.  Definitions of areas can change, but it is rare.  Activities do not change.  Therefore, I have decided to pre-process the data to produce a table of matching activities and areas.

CREATE TABLE activity_areas
(activity_id NUMBER NOT NULL
,area_code   VARCHAR2(4) NOT NULL
,area_number NUMBER NOT NULL
,geom_length NUMBER
,CONSTRAINT ACTIVITY_AREAS_PK PRIMARY KEY (activity_id, area_code, area_number)
,CONSTRAINT ACTIVITY_AREAS_FK FOREIGN KEY (activity_id) REFERENCES ACTIVITIES (activity_id)
,CONSTRAINT ACTIVITY_AREAS_FK2 FOREIGN KEY (area_code, area_number) 
                       REFERENCES MY_AREAS (area_code, area_number)
);

Recursive Search

I have written the search as a PL/SQL procedure to search areas that match a particular activity.

  • I pass the ID of the activity to be processed to the procedure.
  • I can specify the area code and number, or the parent area code and number, at which to search through the areas.  I usually leave them to default to null so the search starts with areas at the root of the hierarchy that therefore have no parents (i.e. sovereign countries).  
  • The procedure then calls itself recursively for each area that it finds matches the activity, to search its children.  This way, I limit the total number of comparisons required.  
  • For every area and activity, I have calculated the minimum bounding rectangle using sdo_geom.sdo_mbr() and stored it in another geometry column on the same row.  This geometry contains just 5 points (the last point is the same as the first to close the rectangle).  I can compare two rectangles very quickly, and if they don't intersect overlap then there is no need to see if the actual geometries overlap.  This approach filters out geometries that cannot match, so that fewer geometries then have to be compared in full, thus significantly improving the performance of the search.
AND SDO_GEOM.RELATE(a.mbr,'anyinteract',m.mbr) = 'TRUE'

  • I have found that it is necessary to have the MBR comparison earlier in the predicate clauses than the GEOM comparison.

…
PROCEDURE activity_area_search
(p_activity_id INTEGER
,p_area_code   my_areas.area_code%TYPE DEFAULT NULL
,p_area_number my_areas.area_number%TYPE DEFAULT NULL
,p_query_type VARCHAR2 DEFAULT 'P'
,p_level INTEGER DEFAULT 0
) IS
BEGIN
  FOR i IN(
   SELECT m.*
   ,      CASE WHEN m.geom_27700 IS NOT NULL THEN sdo_geom.sdo_length(SDO_GEOM.sdo_intersection(m.geom_27700,a.geom_27700,5), unit=>'unit=km') 
               WHEN m.geom       IS NOT NULL THEN sdo_geom.sdo_length(SDO_GEOM.sdo_intersection(m.geom,a.geom,5), unit=>'unit=km') 
          END geom_length
   ,      (SELECT MIN(m2.area_level) FROM my_areas m2 
           WHERE  m2.parent_area_code = m.area_code AND m2.parent_area_number = m.area_number) min_child_level
   FROM   my_areas m
   ,      activities a
   WHERE  (  (p_query_type = 'P' AND parent_area_code = p_area_code AND parent_area_number = p_area_number) 
          OR (p_query_type = 'A' AND area_code        = p_area_code AND area_number        = p_area_number)
	  OR (p_query_type = 'A' AND p_area_number IS NULL          AND area_code          =  p_area_code)
          OR (p_area_code IS NULL AND p_area_number IS NULL AND parent_area_code IS NULL AND parent_area_number IS NULL))
   AND    a.activity_id = p_activity_id
   and    SDO_GEOM.RELATE(a.mbr,'anyinteract',m.mbr) = 'TRUE'
   and    SDO_GEOM.RELATE(a.geom,'anyinteract',m.geom) = 'TRUE'
  ) LOOP
    IF i.area_level>0 OR i.num_children IS NULL THEN
      BEGIN
        INSERT INTO activity_areas
        (activity_id, area_code, area_number, geom_length)
        VALUES
        (p_activity_id, i.area_code, i.area_number, i.geom_length);
      EXCEPTION
        WHEN dup_val_on_index THEN
          UPDATE activity_areas
          SET    geom_length = i.geom_length
          WHERE  activity_id = p_activity_id
          AND    area_code = i.area_code
          AND    area_number = i.area_number;
      END;
    END IF;
  
    IF i.num_children > 0 THEN
      strava_pkg.activity_area_search(p_activity_id, i.area_code, i.area_number, 'P', p_level+1);
    END IF;
  END LOOP;

END activity_area_search;
…

The search can process a single activity by calling the procedure.  An activity that found just 10 areas, took just 6 seconds to process.  However, it does not scale linearly.  Activities that have over 100 areas can take at least 6 minutes.
SQL> exec strava_pkg.activity_area_search(4372796838);
Searching 4372796838:-
Found SOV-1159320701:United Kingdom,    2.895 km
.Searching 4372796838:SOV-1159320701
.Found GEOU-1159320743:England,    2.851 km
..Searching 4372796838:GEOU-1159320743
..Found GLA-117537:Greater London,    2.851 km
...Searching 4372796838:GLA-117537
...Found LBO-50724:City of Westminster,    1.732 km
....Searching 4372796838:LBO-50724
....Found LBW-117484:Abbey Road,    1.435 km
....Found LBW-50639:Maida Vale,    0.298 km
....Done 4372796838:LBO-50724:    0.415 secs).
...Found LBO-50632:Camden,    1.119 km
....Searching 4372796838:LBO-50632
....Found LBW-117286:Kilburn,    0.273 km
....Found LBW-117288:Swiss Cottage,    1.033 km
....Found LBW-117287:West Hampstead,    0.084 km
....Done 4372796838:LBO-50632:    0.521 secs).
...Done 4372796838:GLA-117537:    3.368 secs).
..Done 4372796838:GEOU-1159320743:    4.372 secs).
.Done 4372796838:SOV-1159320701:    4.750 secs).
Done 4372796838:-:    5.532 secs).

PL/SQL procedure successfully completed.
Since I load Strava activities from the bulk download, I also process them in bulk.
--process unmatched activities
set pages 99 lines 180 timi on serveroutput on
column activity_name format a60
BEGIN 
  FOR i IN (
    SELECT a.activity_id, activity_date, activity_name
    ,      distance_km, num_pts, ROUND(num_pts/NULLIF(distance_km,0),0) ppkm
    FROM   activities a
    WHERE  activity_id NOT IN (SELECT DISTINCT activity_id FROM activity_areas)
    AND    num_pts>0
  ) LOOP
    strava_pkg.activity_area_search(i.activity_id);
    commit;
  END LOOP;
END;
/
Matching 2,700 activities produced 71,628 rows on activity_areas for 5,620 distinct areas. In the next article, I will demonstrate how to text search the areas for matching activities.

Sunday, March 21, 2021

Spatial Data 4: Obtaining Geographical Data

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

The next stage is to use my Strava data as a resource for ride planning.  For example, I want to go for a ride in the Chilterns this weekend, I want to look at previous rides in the Chilterns to see where I have gone.  This presents a number of challenges that I will cover over the next few blogs.

  • I need a working definition of the Chilterns.  
  • I need to identify which activities entered the area defined as being the Chilterns.  

More generically, I might be interested in any area in any country.  I need to be able to search for areas by name, then identify the activities that passed through these areas.

Geographical Areas

The world is divided up into 206 sovereignties (including independent and leased areas), and those are then divided down.  Let's take the United Kingdom as an example:

United Kingdom

.England
.Northern Ireland
.Scotland
.Wales

.Guernsey
..Alderney
..Guernsey
..Herm
..Sark
.Isle of Man
.Jersey

.Anguilla
.Bermuda
.Cayman Islands
.Dhekelia Sovereign Base Area
.Falkland Islands
.Gibraltar
.British Indian Ocean Territory
..Diego Garcia Naval Support Facility
.Montserrat
.Pitcairn Islands
.South Georgia and the Islands
..South Georgia
..South Sandwich Islands
.Saint Helena
..Ascension
..Saint Helena
..Tristan da Cunha
.Turks and Caicos Islands
.British Virgin Islands
 Akrotiri Sovereign Base Area

  • The United Kingdom consists of the 4 'home' countries.
    • These are divided down into counties, authorities, districts, boroughs, wards and parishes.
  • Guernsey, Jersey and the Isle of Man are "Crown Dependencies".
  • There are 14 dependent territories
    • Some of these are broken down further into separate islands.

I need enough areas to allow me to effectively search areas by name and then determine which activities are in which areas.

To return to the original question, the Chiltern Hills are not a government administrative area but are designated as an Area of Outstanding Natural Beauty (AONB).  As, they are a useful shorthand to describe some areas where I regularly cycle, so I have included them in the heirarchy.

Loading Spatial Data from Esri Shapefile

Lots of geographical data is publically available from a variety of organisations and governments in the form of shapefiles.  This is "Esri's somewhat open, hybrid vector data format using SHP, SHX and DBF files. Originally invented in the early 1990s, it is still commonly used as a widely supported interchange format".  Oracle provides a java shapefile converter that transforms shapefiles into database tables.

See also: 

Shapefiles are zip archives that contain a number of files, including but not limited to the following, but containing at least always the first three:

  • .shp - the main file that contains the geometry itself,
  • .shx - an index file,
  • .dbf - a DBase file containing other attributes to describe the spatial data.  When you load the shapefile, the DBF file is loaded into all the other columns in the table.  This file can be opened with Microsoft Excel so you can see the data,
  • .prj - contains the projection description of the data,
  • .csv -the same data as in the .dbf file, but as a comma-separated data file,
  • .cfg - the code page of the data in the .dbf file.

A little searching with Google turned up a number of useful sources of publically available spatial data (although most of it requires to be licenced for commercial use).

Most of the shapefiles provide data in latitude/longitude in WGS84 that corresponds to SRID 4326.  However, the data from the UK government and the Ordnance Survey uses the British National Grid (BNG) GCS_OSGB_1936.  This corresponds to SRID 27700 (see Convert GPX Track to a Spatial Line Geometry).

By default, the shapefile converter creates geometries in the coordinate system provided by the shapefile.  It is possible to specify a different coordinate system at load time, however, converting the data significantly slows the load process (my experience is that it increases load duration by about a factor of approximately 5).  

The spatial data is loaded into a geometry column called geom by default.  However, the column name can be specified.

Later when it comes to comparing spatial data, you can only compare geometries that have the same SRID.  Therefore, it is important to know the coordinate system of the data with which you are dealing.  My convention is to put WGS84 (SRID 4326) data into columns call geom, and to put British National Grid into columns called geom_27700. I load data in the coordinate system of the shapefile.  Later on, I may add additional columns and copy and convert the data.

I have written a simple shell script (load_shapes.sh) to call the java shapefile converter, including controlling the SRID and the name of the table and the geometry column.

#load_shapes.sh
…
function shp_load {
  echo $0:$*
…
cd $dir
  pwd
  export clpath=$ORACLE_HOME/suptools/tfa/release/tfa_home/jlib/ojdbc5.jar:$ORACLE_HOME/md/jlib/sdoutl.jar:$ORACLE_HOME/md/jlib/sdoapi.jar
  echodo "java -cp $clpath oracle.spatial.util.SampleShapefileToJGeomFeature -h oracle-database.local -p 1521 -sn oracle_pdb -u strava -d strava -t $table -f $base -r $srid -g ${col}"
}

clear
#set -x

shp_load /tmp/strava/ne_10m_admin_0_sovereignty.shp
shp_load /tmp/strava/ne_10m_admin_0_map_units
shp_load /tmp/strava/ne_10m_admin_0_map_subunits
…

I can now load each shapefile into a separate table.  

Merging Shapefile Data into a Single Set of Data

The various tables created by loading shapefiles will each have their own structures determined by what was put into the shapefile. Ultimately, I am going to load them all into a single table with which I will work.  

Areas have a hierarchy and that is represented in this table by the linked list of area code and number to parent area code and number.  Foreign key constraints ensure the parent values are valid.  There are also check constraints to prevent an area from being its own parent.

REM my_areas_ddl.sql
…
CREATE TABLE my_areas
(area_Code varchar2(4) NOT NULL
,area_number integer NOT NULL
,uqid varchar2(20) NOT NULL
…
,area_level integer NOT NULL 
,parent_area_code varchar2(4)
,parent_area_number integer
,parent_uqid varchar2(20)
,name varchar2(40)
,suffix varchar2(20)
,iso_code3 varchar2(3)
…
,num_children integer
,matchable integer default 1
…
,geom mdsys.sdo_geometry
,geom_27700 mdsys.sdo_geometry
,mbr mdsys.sdo_geometry
,constraint my_areas_pk primary key (area_code, area_number)
,constraint my_areas_uqid unique (uqid)
,constraint my_areas_rfk_area_code foreign key (parent_area_code, parent_area_number) references my_areas (area_code, area_number)
,constraint my_areas_rfk_uqid foreign key (parent_uqid) references my_areas (uqid)
,constraint my_areas_fk_area_code foreign key (area_code) references my_area_codes (area_code)
,constraint my_areas_check_parent_area_code CHECK (area_code != parent_area_code OR area_number != parent_area_number) 
,constraint my_areas_check_parent_uqid CHECK (uqid != parent_uqid)
)
/
--alter table my_areas modify matchable default 1;
Alter table my_areas add constraint my_areas_uq_iso_code3 unique (iso_code3);
Create index my_areas_rfk_uqid on my_areas(parent_uqid);
Create index my_areas_rfk_area_code on my_areas (parent_area_code, parent_area_number);

I have created scripts to populate data in the my_areas table from the Natural Earth data, and from the data for each country.  Different scripts are needed for each shapefile.

  • load_countries.sql - to load Natural Earth data
  • load_uk.sql - to load Ordnance Survey data of Great Britain.  This includes some DML to work out which wards and parishes are in which districts and boroughs and update the hierarchy accordingly.
  • load_XXX.sql, - load administrative areas for a country where XXX is the 3-letter ISO code for that country (eg. load_FRA.sql for France).
  • fix_names.sql - to simplify names stripping common suffixes such as a county, district, authority, ward etc.
    • fix_my_areas.sql - script to collect statistics, count children for each area, look for areas that children of another area with the same name, simplify areas with more than 10,000 points. 

Thursday, February 18, 2021

Spatial Data 3. Analyse a track in proximity to a GPS route

This blog is part of a series about my first steps using Spatial Data in the Oracle database.  I am using the GPS data for my cycling activities collected by Strava.  All of my files are available on GitHub.

Swain's Lane, Highgate
Now I have loaded some data, I am going to start to do something useful with it.  I go out on my bike most mornings, and I usually ride up Swain's Lane in Highgate three times.  How long did each one take?  Over time, have I got faster or slower?

I need a definition of Swain's Lane that I can compare to.  I will start by drawing a route with my favourite GPS software.  A route is just a sequence of route points.  I can then export that as a GPX file.

<?xml version="1.0" encoding="UTF-8"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="ViewRanger - //www.viewranger.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
  <rte>
    <name><![CDATA[Swain's World]]></name>
    <rtept lat="51.569613039632" lon="-0.14770468632509"></rtept>
    <rtept lat="51.569407978151" lon="-0.14832964102552"></rtept>
    <rtept lat="51.567090552402" lon="-0.14674177328872"></rtept>
    <rtept lat="51.567080548869" lon="-0.14592101733016"></rtept>
    <rtept lat="51.569618041121" lon="-0.14773419062425"></rtept>
  </rte>
</gpx>

Geometries Table

I will load the GPX route into a table much as I did with the track files. 
drop table my_geometries purge;

createg table my_geometries
(geom_id    NUMBER NOT NULL 
,descr      VARCHAR2(64)
,gpx        XMLTYPE
,geom       mdsys.sdo_geometry
,geom_27700 mdsys.sdo_geometry
,mbr        mdsys.sdo_geometry
,constraint my_geometries_pk PRIMARY KEY (geom_id)
)
XMLTYPE COLUMN gpx STORE AS SECUREFILE BINARY XML (CACHE DISABLE STORAGE IN ROW)
/
The difference is that I have a series of route points instead of track points, so the paths in extract() and extractvalue() are slightly different.
delete from my_geometries where geom_id = 2;
INSERT INTO my_geometries (geom_id, descr, gpx) 
VALUES (2,'Swains World Route', XMLTYPE(strava_pkg.getClobDocument('STRAVA','swainsworldroute.gpx')));

UPDATE my_geometries
SET geom = mdsys.sdo_geometry(2002,4326,null,mdsys.sdo_elem_info_array(1,2,1),
cast(multiset(
  select CASE n.rn WHEN 1 THEN pt.lng WHEN 2 THEN pt.lat END ord
  from (
    SELECT /*+MATERIALIZE*/ rownum rn
    ,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'rtept/@lon')) as lng
    ,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'rtept/@lat')) as lat
    FROM   my_geometries g,
           TABLE(XMLSEQUENCE(extract(g.gpx,'/gpx/rte/rtept','xmlns="http://www.topografix.com/GPX/1/1"'))) t
    where g.geom_id = 2
    ) pt,
    (select 1 rn from dual union all select 2 from dual) n
	order by pt.rn, n.rn
  ) AS mdsys.sdo_ordinate_array))
WHERE gpx IS NOT NULL
AND   geom IS NULL
/
UPDATE my_geometries
SET mbr = sdo_geom.sdo_mbr(geom)
,   geom_27700 = sdo_cs.transform(geom,27700)
/

Commit;
Set pages 99 lines 180 
Select geom_id, descr, gpx, geom 
from my_geometries
where geom_id = 2;

   GEOM_ID DESCR
---------- ----------------------------------------------------------------
GPX
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GEOM(SDO_GTYPE, SDO_SRID, SDO_POINT(X, Y, Z), SDO_ELEM_INFO, SDO_ORDINATES)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         2 Swains World Route
<?xml version="1.0" encoding="US-ASCII"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="ViewRanger - //www.viewranger.com" xml
SDO_GEOMETRY(2002, 4326, NULL, SDO_ELEM_INFO_ARRAY(1, 2, 1), SDO_ORDINATE_ARRAY(-.14651114, 51.5670769, -.14649237, 51.567298, -.1465782, 51.567563, -.14680618, 51.5680165, -.14697
516, 51.5682533, -.14754379, 51.5688701, -.14807219, 51.5694887))
I am going to build spatial indexes on the geometry columns, so I need to define the upper and lower bound values on the coordinates.
delete from user_sdo_geom_metadata where table_name = 'MY_GEOMETRIES';
insert into user_sdo_geom_metadata (table_name,column_name,diminfo,srid)
values ( 
  'MY_GEOMETRIES' , 'GEOM_27700',
  sdo_dim_array(
    sdo_dim_element('Easting',-1000000,1500000,0.05), 
    sdo_dim_element('Northing', -500000,2000000,0.05)),
  27700);
insert into user_sdo_geom_metadata (table_name,column_name,diminfo,srid)
values ( 
  'MY_GEOMETRIES' , 'GEOM',
  sdo_dim_array(
    sdo_dim_element('Longitude',-180,180,0.05), 
    sdo_dim_element('Latgitude',-90,90,0.05)),
  4326);
insert into user_sdo_geom_metadata (table_name,column_name,diminfo,srid)
values ( 
  'MY_GEOMETRIES' , 'MBR',
  sdo_dim_array(
    sdo_dim_element('Longitude',-180,180,0.05), 
    sdo_dim_element('Latgitude',-90,90,0.05)),
  4326);
commit;

CREATE INDEX my_geometries_geom ON my_geometries (geom) INDEXTYPE IS MDSYS.SPATIAL_INDEX_v2;
CREATE INDEX my_geometries_geom_27700 ON my_geometries (geom_27700) INDEXTYPE IS MDSYS.SPATIAL_INDEX_v2;
CREATE INDEX my_geometries_mbr ON my_geometries (mbr) INDEXTYPE IS MDSYS.SPATIAL_INDEX_v2;

Compare Geometries

Now I can compare my Swain's Lane geometry to my activity geometries.  Let's start by looking for rides in December 2020 that went up Swain's Lane
Column activity_id heading 'Activity|ID'
Column activity_name format a30
Column geom_relate heading 'geom|relate' format a6
With a as (
SELECT a.activity_id, a.activity_date, a.activity_name
,      SDO_GEOM.RELATE(a.geom,'anyinteract',g.geom,25) geom_relate
FROM   activities a
,      my_geometries g
WHERE  a.activity_type = 'Ride'
--And    a.activity_id IN(4468006769)
And    a.activity_date >= TO_DATE('01122020','DDMMYYYY')
and    g.geom_id = 2 /*Swains World Route*/
)
Select *
From   a
Where  geom_relate = 'TRUE'
Order by activity_date
/
Where there is a relation between the two geometries then I have a hit.
  Activity                                                    geom
        ID ACTIVITY_DATE       ACTIVITY_NAME                  relate
---------- ------------------- ------------------------------ ------
4419821750 08:44:45 02.12.2020 Loop                           TRUE
4428307816 10:49:25 04.12.2020 Loop                           TRUE
4431920358 09:41:13 05.12.2020 Loop                           TRUE
…
4528825613 09:39:38 28.12.2020 Loop                           TRUE
4534027888 11:29:45 29.12.2020 Loop                           TRUE
4538488655 09:57:55 30.12.2020 Loop                           TRUE

25 rows selected.

Analyse Individual Efforts

Now I want to analyse each of my trips up Swain's Lane on a particular day.  I am going to work with the GPX rather than the spatial geometry because I am interested also in time, elevation and heart rate data that is not stored in the spatial geometry.
Also, you can't use analytic functions on spatial geometries.
with x as (
SELECT activity_id
,      TO_DATE(EXTRACTVALUE(VALUE(t), 'trkpt/time'),'YYYY-MM-DD"T"HH24:MI:SS"Z"') time
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lat')) lat
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lon')) lng
FROM   activities a,
       TABLE(XMLSEQUENCE(extract(a.gpx,'/gpx/trk/trkseg/trkpt','xmlns="http://www.topografix.com/GPX/1/1"'))) t
WHERE  a.activity_id IN(4468006769)
), y as (
select x.*, strava_pkg.make_point(lng,lat) loc
from x
)
select lag(loc,1) over (partition by activity_id order by time) last_loc
from   y
/

select lag(loc,1) over (partition by activity_id order by time) last_loc
           *
ERROR at line 13:
ORA-22901: cannot compare VARRAY or LOB attributes of an object type
Instead, I will have to apply analytic functions to the values extracted from the GPX and then create a spatial point.  Thus I will be able to calculate the length of each individual trip by aggregating the distance between each pair of points.
The following query splits out each trip up Swain's Lane in a particular activity and shows the distance, duration, and metrics about elevation, gradient, and heart rate. 
alter session set statistics_level=ALL;
alter session set nls_date_Format = 'hh24:mi:ss dd.mm.yyyy';
break on activity_id skip 1
compute sum of sum_dist on activity_id
compute sum of num_pt on activity_id
compute sum of sum_secs on activity_id
Set lines 180 pages 50 timi on
Column activity_id heading 'Activity|ID'
Column activity_name format a15
column time format a20
column lat format 999.99999999
column lng format 999.99999999
column ele format 9999.9
column hr format 999
column sdo_relate format a10
column num_pts heading 'Num|Pts' format 99999
column sum_dist heading 'Dist.|(km)' format 999.999
column sum_secs heading 'Secs' format 9999
column avg_speed heading 'Avg|Speed|(kmph)' format 99.9
column ele_gain heading 'Ele|Gain|(m)' format 9999.9
column ele_loss heading 'Ele|Loss|(m)' format 9999.9
column avg_grade heading 'Avg|Grade|%' format 99.9
column min_ele heading 'Min|Ele|(m)' format 999.9
column max_ele heading 'Max|Ele|(m)' format 999.9
column avg_hr heading 'Avg|HR' format 999
column max_hr heading 'Max|HR' format 999
WITH geo as ( /*route geometry to compare to*/
select /*MATERIALIZE*/ g.*, 25 tol
,      sdo_geom.sdo_length(geom, unit=>'unit=m') geom_length
from   my_geometries g
where  geom_id = 2 /*Swains World Route*/
), a as ( /*extract all points in activity*/
SELECT a.activity_id, g.geom g_geom, g.tol, g.geom_length
,      TO_DATE(EXTRACTVALUE(VALUE(t), 'trkpt/time'),'YYYY-MM-DD"T"HH24:MI:SS"Z"') time
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lat')) lat
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lon')) lng
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/ele')) ele
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/extensions/gpxtpx:TrackPointExtension/gpxtpx:hr'
       ,'xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"')) hr
FROM   activities a,
       geo g,
       TABLE(XMLSEQUENCE(extract(a.gpx,'/gpx/trk/trkseg/trkpt'
       ,'xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"'))) t
Where  a.activity_id IN(4468006769)
and    SDO_GEOM.RELATE(a.geom,'anyinteract',g.geom,g.tol) = 'TRUE' /*activity has relation to reference geometry*/
), b as ( /*smooth elevation*/
Select a.*
,      avg(ele) over (partition by activity_id order by time rows between 2 preceding and 2 following) avg_ele
From   a
), c as ( /*last point*/
Select b.*
,      row_number() over (partition by activity_id order by time) seq
,      lag(time,1) over (partition by activity_id order by time) last_time
,      lag(lat,1) over (partition by activity_id order by time) last_lat
,      lag(lng,1) over (partition by activity_id order by time) last_lng
--,      lag(ele,1) over (partition by activity_id order by time) last_ele
,      lag(avg_ele,1) over (partition by activity_id order by time) last_avg_ele
From   b
), d as ( /*make points*/
SELECT c.* 
,      strava_pkg.make_point(lng,lat) loc
,      strava_pkg.make_point(last_lng,last_lat) last_loc
FROM   c
), e as ( /*determine whether point is inside the polygon*/
select d.*
,      86400*(time-last_time) secs
,      avg_ele-last_avg_ele ele_diff
,      sdo_geom.sdo_distance(loc,last_loc,0.05,'unit=m') dist
,      SDO_GEOM.RELATE(loc,'anyinteract', g_geom, tol) sdo_relate
FROM   d
), f as (
select e.*
,      CASE WHEN sdo_relate != lag(sdo_relate,1) over (partition by activity_id order by time) THEN 1 END sdo_diff
from   e
), g as (
select f.*
,      SUM(sdo_diff) over (partition by activity_id order by time range between unbounded preceding and current row) sdo_seq
from f
where  sdo_relate = 'TRUE'
)
select activity_id, min(time), max(time)
, sum(dist)/1000 sum_dist
, sum(secs) sum_secs
, 3.6*sum(dist)/sum(secs) avg_speed
, sum(greatest(0,ele_diff)) ele_gain
, sum(least(0,ele_diff)) ele_loss
, 100*sum(ele_diff*dist)/sum(dist*dist) avg_grade
, min(ele) min_ele
, max(ele) max_ele
, sum(hr*secs)/sum(secs) avg_Hr
, max(hr) max_hr
, count(*) num_pts
from   g
group by activity_id, sdo_seq, g.geom_length
having sum(dist)>= g.geom_length/2 /*make sure line we find is longer than half route to prevent fragmentation*/
order by 2
/
select * from table(dbms_xplan.display_cursor(null,null,'ADVANCED +IOSTATS -PROJECTION +ADAPTIVE'))
/
4a_1swains.sql
  • In subquery a, I compare the geometry of the activity with the geometry of Swain's Lane using sdo_geom.relate() to confirm that the activity includes Swain's Lane, but then I extract all the points in the activity GPX.
  • GPS is optimised for horizontal accuracy.  Even so, the tolerance for determining whether the track is close to the route has to be set to 25m to allow for noise in the data (Swain's Lane is tree-lined, and has walls on both sides, that both attenuate the GPS signal).  GPS elevation data is notorious for being noisy even under good conditions; you can see this in the variation of height gained on each ascent.  Sub-query b calculates an average elevation across 5 track points (up to +/-2 points).  
  • I need to compare each point in the track to each previous point so I can do some calculations and determine when the track comes into proximity with the Swain's Lane route, subquery c uses analytic functions to determine the previous point.  It is not possible to apply the analytic function to a geometry.
  • Subquery e determines whether a track point is in proximity to the route.  The tolerance, 25m, is set in subquery geo.  Then subquery f flags where the track point is in proximity to the route and the previous one was not.  Finally, subquery g maintains a running total of the number of times the track has gone close enough to the route.  That becomes a sequence number for each ascent of Swain's Lane by which I can group the subsequent analytics.
                                                                     Avg     Ele     Ele   Avg    Min    Max
  Activity                                            Dist.        Speed    Gain    Loss Grade    Ele    Ele  Avg  Max   Num
        ID MIN(TIME)           MAX(TIME)               (km)  Secs (kmph)     (m)     (m)     %    (m)    (m)   HR   HR   Pts
---------- ------------------- ------------------- -------- ----- ------ ------- ------- ----- ------ ------ ---- ---- -----
4468006769 14:55:51 13.12.2020 14:58:17 13.12.2020     .372   147    9.1    36.1      .0   8.6   86.8  122.7  141  153   147
           15:08:13 13.12.2020 15:10:28 13.12.2020     .374   136    9.9    36.2      .0   8.4   86.8  122.8  147  155   136
           15:22:49 13.12.2020 15:25:18 13.12.2020     .369   150    8.9    36.2      .0   8.2   86.8  122.7  147  155   150
**********                                         -------- -----
sum                                                   1.116   433
On my laptop, this query takes about 10s, of which about 8s is spent on the window sort for the analytic functions, and 2s is spent working out whether the track points are in proximity to the route.
Plan hash value: 3042349692

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name             | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                  |      2 |        |       |       |  5147 (100)|          |      6 |00:00:20.94 |     392 |
|   1 |  SORT ORDER BY                           |                  |      2 |      1 |   104 |       |  5147   (1)| 00:00:01 |      6 |00:00:20.94 |     392 |
|*  2 |   FILTER                                 |                  |      2 |        |       |       |            |          |      6 |00:00:20.94 |     392 |
|   3 |    HASH GROUP BY                         |                  |      2 |      1 |   104 |       |  5147   (1)| 00:00:01 |      6 |00:00:20.94 |     392 |
|   4 |     VIEW                                 |                  |      2 |   8168 |   829K|       |  5144   (1)| 00:00:01 |    866 |00:00:20.93 |     392 |
|   5 |      WINDOW SORT                         |                  |      2 |   8168 |    16M|    21M|  5144   (1)| 00:00:01 |    866 |00:00:20.93 |     392 |
|*  6 |       VIEW                               |                  |      2 |   8168 |    16M|       |  1569   (1)| 00:00:01 |    866 |00:00:20.93 |     392 |
|   7 |        WINDOW SORT                       |                  |      2 |   8168 |  1403K|  1688K|  1569   (1)| 00:00:01 |  10208 |00:00:09.63 |     392 |
|   8 |         VIEW                             |                  |      2 |   8168 |  1403K|       |  1252   (1)| 00:00:01 |  10208 |00:00:06.22 |     392 |
|   9 |          WINDOW SORT                     |                  |      2 |   8168 |  1021K|  1248K|  1252   (1)| 00:00:01 |  10208 |00:00:06.20 |     392 |
|  10 |           VIEW                           |                  |      2 |   8168 |  1021K|       |  1016   (1)| 00:00:01 |  10208 |00:00:06.05 |     392 |
|  11 |            WINDOW SORT                   |                  |      2 |   8168 |  4546K|  5040K|  1016   (1)| 00:00:01 |  10208 |00:00:00.76 |     392 |
|  12 |             NESTED LOOPS                 |                  |      2 |   8168 |  4546K|       |    31   (0)| 00:00:01 |  10208 |00:00:00.41 |     392 |
|  13 |              NESTED LOOPS                |                  |      2 |      1 |   560 |       |     2   (0)| 00:00:01 |      2 |00:00:00.03 |     104 |
|  14 |               TABLE ACCESS BY INDEX ROWID| MY_GEOMETRIES    |      2 |      1 |   112 |       |     1   (0)| 00:00:01 |      2 |00:00:00.01 |       4 |
|* 15 |                INDEX UNIQUE SCAN         | MY_GEOMETRIES_PK |      2 |      1 |       |       |     0   (0)|          |      2 |00:00:00.01 |       2 |
|* 16 |               TABLE ACCESS BY INDEX ROWID| ACTIVITIES       |      2 |      1 |   448 |       |     1   (0)| 00:00:01 |      2 |00:00:00.03 |     100 |
|* 17 |                INDEX UNIQUE SCAN         | ACTIVITIES_PK    |      2 |      1 |       |       |     0   (0)|          |      2 |00:00:00.01 |       4 |
|  18 |              XPATH EVALUATION            |                  |      2 |        |       |       |            |          |  10208 |00:00:00.37 |     288 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(SUM("DIST")>="G"."GEOM_LENGTH"/2)
   6 - filter("SDO_RELATE"='TRUE')
  15 - access("GEOM_ID"=2)
  16 - filter(("A"."ACTIVITY_TYPE"='Ride' AND "SDO_GEOM"."RELATE"("A"."GEOM",'anyinteract',"G"."GEOM",25)='TRUE'))
  17 - access("A"."ACTIVITY_ID"=4468006769)
I can apply this approach to all my trips up Swain's Lane.  However, I have logged 1115 ascents, and if I attempt to process them in a single SQL query I will have to do some very large window sorts that will spill out of memory (at least they will on my machine).  Instead, it is faster to process each activity separately in a PL/SQL loop (see 4b_allswains2.sql).
I now have a table containing all of my ascents of Swain's Lane and I can see if I am getting faster or slower.  I simply dumped the data into Excel with SQL Developer.  
Unfortunately, I have discovered that I am not going faster!

Friday, February 12, 2021

Spatial Data 2: Convert GPX Track to a Spatial Line Geometry

This blog is part of a series about my first steps using Spatial Data in the Oracle database.  I am using the GPS data for my cycling activities collected by Strava.  All of my files are available on GitHub.

Having loaded my GPS tracks from GPX files into an XML type column, the next stage is to extract the track points and create a spatial geometry column.  

Defining Spatial Geometries

Spatial objects are generically referred to as geometries.  When you define one, you have to specify what kind of geometry it is, and what coordinate system you are using. Later when you compare geometries to each other they have to use the same coordinate system. Otherwise, Oracle will raise an error.  Fortunately, Oracle can convert between coordinate systems.

Various coordinate systems are used for geographical data, they are given EPSG Geodetic Parameter Dataset codes.  Oracle supports various coordinate systems.  As well as older definitions, it also has current definitions where the ESPG code matches the Spatial Reference ID (SDO_SRID).  They can be queried from SDO_COORD_REF_SYS.

I will use two different coordinate systems during this series of blogs

Set lines 150 pages 99
Column coord_ref_sys_name format a35
Column legacy_cs_bounds format a110
select srid, coord_ref_sys_name, coord_ref_sys_kind, legacy_cs_bounds 
from SDO_COORD_REF_SYS where srid IN(4326, 27700)
/
      SRID COORD_REF_SYS_NAME                  COORD_REF_SYS_KIND
---------- ----------------------------------- ------------------------
LEGACY_CS_BOUNDS(SDO_GTYPE, SDO_SRID, SDO_POINT(X, Y, Z), SDO_ELEM_INFO, SDO_ORDINATES)
--------------------------------------------------------------------------------------------------------------
      4326 WGS 84                              GEOGRAPHIC2D
SDO_GEOMETRY(2003, 4326, NULL, SDO_ELEM_INFO_ARRAY(1, 1003, 3), SDO_ORDINATE_ARRAY(-180, -90, 180, 90))

     27700 OSGB 1936 / British National Grid   PROJECTED

  • "The World Geodetic System (WGS) is a standard for use in cartography, geodesy, and satellite navigation including GPS". The latest revision is WGS 84 (also known as WGS 1984, EPSG:4326). It is the reference coordinate system used by the Global Positioning System (GPS).  Where I am dealing with longitude and latitude, specified in degrees, especially from GPS data, I need to tell Oracle that it is WGS84 by specifying SDO_SRID of 4326.
  • Later on, I will also be using data for Great Britain available from the Ordnance Survey that uses the Ordnance Survey National Grid (also known as British National Grid) reference system.  That requires SDO_SRID to be set to 27700.

See also:

Creating Spatial Points

I have found it useful to create a packaged function to convert longitude and latitude to a spatial data point.  It is a useful shorthand that I use in various places.

create or replace package body strava_pkg as 
k_module  CONSTANT VARCHAR2(48) := $$PLSQL_UNIT;
…
----------------------------------------------------------------------------------------------------
function make_point 
(longitude in number
,latitude  in number)
return sdo_geometry deterministic is
  l_module VARCHAR2(64);
  l_action VARCHAR2(64);
begin
  dbms_application_info.read_module(module_name=>l_module
                                   ,action_name=>l_action);
  dbms_application_info.set_module(module_name=>k_module
                                  ,action_name=>'make_point');

  if longitude is not null and latitude is not null then
    return
      sdo_geometry (
        2001, 4326,
        sdo_point_type (longitude, latitude, null),
        null, null
      );
  else
    return null;
  end if;

  dbms_application_info.set_module(module_name=>l_module
                                  ,action_name=>l_action);
end make_point;
----------------------------------------------------------------------------------------------------
END strava_pkg;
/

strava_pkg.sql

There are two parameters to SDO_GEOMETRY that I always have to specify.

  • The first parameter, SDO_GTYPE, describes the natures of the spatial geometry being defined.  Here it is 2001.  The 2 indicates that it is a 2-dimensional geometry, and the 1 indicates that it is a single point.  See SDO_GEOMETRY Object Type
  • The second parameter, SDO_SRID, defines the coordinate system that I discussed above.  4326 indicates that I am working with longitude and latitude.

XML Namespace

GPS data is often held in GPX or GPS Exchange Format.  This is an XML schema.  GPX has been the de-facto XML standard for the lightweight interchange of GPS data since the initial GPX 1.0 release in 2002.  The GPX 1.1 schema was released in 2004 (see https://www.topografix.com/gpx.asp).  

Garmin has created an extension schema that holds additional athlete training information such as heart rate.

I can extract individual track points from a GPX with SQL using the extract() and extractvalue() functions.  However, I have GPX tracks that use both versions of the Topographix GPX schema (it depends on upon which piece of software emitted the GPX file), and some that also use the Garmin extensions.  

Therefore, I need to register all three schemas with Oracle.  I can download the schema files with wget.

cd /tmp/strava
wget http://www.topografix.com/GPX/1/0/gpx.xsd --output-document=gpx0.xsd
wget http://www.topografix.com/GPX/1/1/gpx.xsd
wget https://www8.garmin.com/xmlschemas/TrackPointExtensionv1.xsd

Then I can register the files 

delete from plan_table WHERE statement_id = 'XSD';
insert into plan_table (statement_id, plan_id, object_name, object_alias)
values ('XSD', 1, 'gpx0.xsd', 'http://www.topografix.com/GPX/1/0/gpx.xsd');
insert into plan_table (statement_id, plan_id, object_name, object_alias)
values ('XSD', 2, 'gpx.xsd', 'http://www.topografix.com/GPX/1/1/gpx.xsd');
insert into plan_table (statement_id, plan_id, object_name, object_alias)
values ('XSD', 3, 'TrackPointExtensionv1.xsd', 'https://www8.garmin.com/xmlschemas/TrackPointExtensionv1.xsd');

DECLARE
  xmlSchema xmlType;
  res       boolean;
BEGIN
  FOR i IN (
    SELECT object_alias schemaURL
    ,      object_name  schemaDoc
    FROM   plan_table
    WHERE  statement_id = 'XSD'
    ORDER BY plan_id
  ) LOOP
    --read xsd file
    xmlSchema := XMLTYPE(getCLOBDocument('STRAVA',i.schemaDoc,'AL32UTF8'));
    --if already exists delete XSD
    if (dbms_xdb.existsResource(i.schemaDoc)) then
        dbms_xdb.deleteResource(i.schemaDoc);
    end if;
    --create resource from XSD
    res := dbms_xdb.createResource(i.schemaDoc,xmlSchema);

    -- Delete existing  schema
    dbms_xmlschema.deleteSchema(
      i.schemaURL
    );
    -- Now reregister the schema
    dbms_xmlschema.registerSchema(
      i.schemaURL,
      xmlSchema,
      TRUE,TRUE,FALSE,FALSE
    );
  END LOOP;
End;
/
3a_register_xml_schema.sql

Then I can query the registered schemas.

Set pages 99 lines 160
Column schema_url format a60
Column qual_schema_url format a105
select schema_url, local, hier_type, binary, qual_schema_url
from user_xml_schemas
/

SCHEMA_URL                                                   LOC HIER_TYPE   BIN
------------------------------------------------------------ --- ----------- ---
QUAL_SCHEMA_URL
---------------------------------------------------------------------------------------------------------
https://www8.garmin.com/xmlschemas/TrackPointExtensionv1.xsd YES CONTENTS    NO
http://xmlns.oracle.com/xdb/schemas/STRAVA/https://www8.garmin.com/xmlschemas/TrackPointExtensionv1.xsd

http://www.topografix.com/GPX/1/0/gpx.xsd                    YES CONTENTS    NO
http://xmlns.oracle.com/xdb/schemas/STRAVA/www.topografix.com/GPX/1/0/gpx.xsd

http://www.topografix.com/GPX/1/1/gpx.xsd                    YES CONTENTS    NO
http://xmlns.oracle.com/xdb/schemas/STRAVA/www.topografix.com/GPX/1/1/gpx.xsd

Extracting GPS Track Points from GPX

A GPS track is a list of points specifying at least time, longitude, latitude and often elevation.  I can extract all the points in a GPX as a set of rows.  However, I must specify the correct namespace for the specific GPX.

Column time_string format a20
SELECT g.activity_id
,      EXTRACTVALUE(VALUE(t), 'trkpt/time') time_string
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lat')) lat
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lon')) lng
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/ele')) ele
,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/extensions/gpxtpx:TrackPointExtension/gpxtpx:hr'
       ,'xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"')) hr
 FROM activities g,
      TABLE(XMLSEQUENCE(extract(g.gpx,'/gpx/trk/trkseg/trkpt'
      ,'xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"'
      ))) t
Where  activity_id IN(4468006769)
And rownum <= 10
/

  Activity
        ID TIME_STRING                    LAT           LNG     ELE   HR
---------- -------------------- ------------- ------------- ------- ----
4468006769 2020-12-13T14:31:13Z   51.52963800    -.18753600    30.6   57
           2020-12-13T14:31:14Z   51.52963500    -.18753400    30.6   57
           2020-12-13T14:31:15Z   51.52964100    -.18753100    30.6   57
           2020-12-13T14:31:16Z   51.52964000    -.18752900    30.6   57
           2020-12-13T14:31:17Z   51.52963600    -.18752700    30.6   57
           2020-12-13T14:31:18Z   51.52963200    -.18752700    30.6   57
           2020-12-13T14:31:19Z   51.52962900    -.18752800    30.6   57
           2020-12-13T14:31:20Z   51.52962800    -.18752800    30.6   57
           2020-12-13T14:31:21Z   51.52962800    -.18752900    30.6   57
           2020-12-13T14:31:22Z   51.52962800    -.18753000    30.6   57

I can use this approach to extract all the points from a GPS track and create a spatial line geometry.  I have put the whole process into a packaged procedure strava_pkg.load_activity.

First I need to work out which version of the Topographix schema is in use.  So I can try extracting the creator name with each and see which is not null.

…
IF l_num_rows > 0 THEN
  UPDATE activities
  SET    gpx = XMLTYPE(l_gpx), geom = null, geom_27700 = null, num_pts = 0, xmlns = NULL
  WHERE  activity_id = p_activity_id
  RETURNING extractvalue(gpx,'/gpx/@version', 'xmlns="http://www.topografix.com/GPX/1/0"') 
  ,         extractvalue(gpx,'/gpx/@version', 'xmlns="http://www.topografix.com/GPX/1/1"') 
  INTO      l_xmlns0, l_xmlns1;
  l_num_rows := SQL%rowcount;
END IF;
…

Now I can extract all the points in a GPX as a set of rows and put them into a spatial geometry.  I turn each row with two coordinates into two rows with one point each.  Note that longitude is listed before latitude for each point.  I convert the rows into a list using multiset() and finally cast that as a spatial ordinate array. 

Note that the SDO_GTYPE is 2002 (rather than 2001) because it is a line (rather than a single point) on a two-dimensional coordinate system.

  BEGIN
    UPDATE activities a
    SET geom = mdsys.sdo_geometry(2002,4326,null,mdsys.sdo_elem_info_array(1,2,1),
    cast(multiset(
      select CASE n.rn WHEN 1 THEN pt.lng WHEN 2 THEN pt.lat END ord
      from (
        SELECT rownum rn
        ,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lon')) as lng
        ,      TO_NUMBER(EXTRACTVALUE(VALUE(t), 'trkpt/@lat')) as lat
        FROM   TABLE(XMLSEQUENCE(extract(a.gpx,'/gpx/trk/trkseg/trkpt', 'xmlns="http://www.topografix.com/GPX/1/1"'))) t
        ) pt,
        (select 1 rn from dual union all select 2 from dual) n
	    order by pt.rn, n.rn
      ) AS mdsys.sdo_ordinate_array))
    , xmlns = 'xmlns="http://www.topografix.com/GPX/1/1"'
    WHERE  a.gpx IS NOT NULL
    And    activity_id = p_activity_id;
    l_num_rows := SQL%rowcount;
  EXCEPTION
    WHEN e_13034 OR e_29877 THEN 
	  dbms_output.put_line('Exception:'||sqlerrm);
	  l_num_rows := 0;
  END;

I have found it helpful to simplify the line geometry with sdo_util.simplify(). It removes some of the noise in the GPS data and has resolved problems with calculating the length of lines that intersect with areas.

  BEGIN
    UPDATE activities 
    SET    geom = sdo_util.simplify(geom,1)
    WHERE  geom IS NOT NULL
    And    activity_id = p_activity_id;
    l_num_rows := SQL%rowcount;
  EXCEPTION
    WHEN e_13034 THEN 
	  dbms_output.put_line('Exception:'||sqlerrm);
  END;

There are a few other fields I also update at this point.  You will see me use them later.

  • NUM_PTS is the number of points in the line geometry.  
  • GEOM_27700 is the result of converting the line to British National Grid reference coordinates.  This helps when comparing it to British boundary data obtained from the Ordnance Survey or other government agencies.
  • MBR is the minimum bounding rectangle for the line.  This is generated to enable me to improve the performance of some spatial queries.  I have found some of the spatial operators to calculate intersections between geometries are quite slow and CPU intensive when applied to GPS tracks and boundary data that both have lots of points.  SDO_GEOM.SDO_MBR simply returns 4 ordinates that define the bounding rectangle.  This can be used to roughly match geometries that might match before doing a proper match.

  UPDATE activities 
  SET    num_pts = SDO_UTIL.GETNUMVERTICES(geom)
  ,      geom_27700 = sdo_cs.transform(geom,27700)
  ,      mbr = sdo_geom.sdo_mbr(geom)
  WHERE  geom IS NOT NULL
  And    activity_id = p_activity_id
  RETURNING num_pts INTO l_num_pts;
  dbms_output.put_line('Activity ID:'||p_activity_id||', '||l_num_pts||' points');
…

Now I can load each GPX and process it into a spatial geometry in one step.  I can process all of the activities in a simple loop.

set serveroutput on timi on
exec strava_pkg.load_activity(4468006769);
Loading Activity: 4468006769
ACTIVITIES/4468006769.gpx.gz - 1286238 bytes
xmlns 1=StravaGPX Android
Activity ID:4468006769, 998 points

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.41
Now my Strava activities are all in spatial geometries and I can start to do some spatial processing.