Monday, March 29, 2021

Spatial Data 5: Searching For Geometries That Intersect Other Geometries

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

I have loaded basic data for all countries, and detailed data for the UK and other countries where I have recorded activities.  The next step is to determine which activities pass through which areas.  Generically, the question is simply whether one geometry intersects with another.   I can test this in SQL with the sdo_geom.relate() function.

WHERE SDO_GEOM.RELATE(a.geom,'anyinteract',m.geom) = 'TRUE'

However, working out whether an activity, with several thousand points, is within an area defined with several thousand points can be CPU intensive and time-consuming.  Larger areas such as UK counties average over 20,000 points. 

I have 60,000 defined areas, of which, over 20,000 of which are for the UK.  I have 2700 activities recorded on Strava, with an average of 2700 points, but some have over 10,000 points.  It isn't viable to compare every activity with every area.  Comparing these large geometries can take a significant time, too long to do the spatial queries every time I want to interrogate the data, and too long for an on-line application.

Pre-processing Geometry Intersections

However, the data, once loaded is static.  Definitions of areas can change, but it is rare.  Activities do not change.  Therefore, I have decided to pre-process the data to produce a table of matching activities and areas.

CREATE TABLE activity_areas
(activity_id NUMBER NOT NULL
,area_code   VARCHAR2(4) NOT NULL
,area_number NUMBER NOT NULL
,geom_length NUMBER
,CONSTRAINT ACTIVITY_AREAS_PK PRIMARY KEY (activity_id, area_code, area_number)
,CONSTRAINT ACTIVITY_AREAS_FK FOREIGN KEY (activity_id) REFERENCES ACTIVITIES (activity_id)
,CONSTRAINT ACTIVITY_AREAS_FK2 FOREIGN KEY (area_code, area_number) 
                       REFERENCES MY_AREAS (area_code, area_number)
);

Recursive Search

I have written the search as a PL/SQL procedure to search areas that match a particular activity.

  • I pass the ID of the activity to be processed to the procedure.
  • I can specify the area code and number, or the parent area code and number, at which to search through the areas.  I usually leave them to default to null so the search starts with areas at the root of the hierarchy that therefore have no parents (i.e. sovereign countries).  
  • The procedure then calls itself recursively for each area that it finds matches the activity, to search its children.  This way, I limit the total number of comparisons required.  
  • For every area and activity, I have calculated the minimum bounding rectangle using sdo_geom.sdo_mbr() and stored it in another geometry column on the same row.  This geometry contains just 5 points (the last point is the same as the first to close the rectangle).  I can compare two rectangles very quickly, and if they don't intersect overlap then there is no need to see if the actual geometries overlap.  This approach filters out geometries that cannot match, so that fewer geometries then have to be compared in full, thus significantly improving the performance of the search.
AND SDO_GEOM.RELATE(a.mbr,'anyinteract',m.mbr) = 'TRUE'

  • I have found that it is necessary to have the MBR comparison earlier in the predicate clauses than the GEOM comparison.

…
PROCEDURE activity_area_search
(p_activity_id INTEGER
,p_area_code   my_areas.area_code%TYPE DEFAULT NULL
,p_area_number my_areas.area_number%TYPE DEFAULT NULL
,p_query_type VARCHAR2 DEFAULT 'P'
,p_level INTEGER DEFAULT 0
) IS
BEGIN
  FOR i IN(
   SELECT m.*
   ,      CASE WHEN m.geom_27700 IS NOT NULL THEN sdo_geom.sdo_length(SDO_GEOM.sdo_intersection(m.geom_27700,a.geom_27700,5), unit=>'unit=km') 
               WHEN m.geom       IS NOT NULL THEN sdo_geom.sdo_length(SDO_GEOM.sdo_intersection(m.geom,a.geom,5), unit=>'unit=km') 
          END geom_length
   ,      (SELECT MIN(m2.area_level) FROM my_areas m2 
           WHERE  m2.parent_area_code = m.area_code AND m2.parent_area_number = m.area_number) min_child_level
   FROM   my_areas m
   ,      activities a
   WHERE  (  (p_query_type = 'P' AND parent_area_code = p_area_code AND parent_area_number = p_area_number) 
          OR (p_query_type = 'A' AND area_code        = p_area_code AND area_number        = p_area_number)
	  OR (p_query_type = 'A' AND p_area_number IS NULL          AND area_code          =  p_area_code)
          OR (p_area_code IS NULL AND p_area_number IS NULL AND parent_area_code IS NULL AND parent_area_number IS NULL))
   AND    a.activity_id = p_activity_id
   and    SDO_GEOM.RELATE(a.mbr,'anyinteract',m.mbr) = 'TRUE'
   and    SDO_GEOM.RELATE(a.geom,'anyinteract',m.geom) = 'TRUE'
  ) LOOP
    IF i.area_level>0 OR i.num_children IS NULL THEN
      BEGIN
        INSERT INTO activity_areas
        (activity_id, area_code, area_number, geom_length)
        VALUES
        (p_activity_id, i.area_code, i.area_number, i.geom_length);
      EXCEPTION
        WHEN dup_val_on_index THEN
          UPDATE activity_areas
          SET    geom_length = i.geom_length
          WHERE  activity_id = p_activity_id
          AND    area_code = i.area_code
          AND    area_number = i.area_number;
      END;
    END IF;
  
    IF i.num_children > 0 THEN
      strava_pkg.activity_area_search(p_activity_id, i.area_code, i.area_number, 'P', p_level+1);
    END IF;
  END LOOP;

END activity_area_search;
…

The search can process a single activity by calling the procedure.  An activity that found just 10 areas, took just 6 seconds to process.  However, it does not scale linearly.  Activities that have over 100 areas can take at least 6 minutes.
SQL> exec strava_pkg.activity_area_search(4372796838);
Searching 4372796838:-
Found SOV-1159320701:United Kingdom,    2.895 km
.Searching 4372796838:SOV-1159320701
.Found GEOU-1159320743:England,    2.851 km
..Searching 4372796838:GEOU-1159320743
..Found GLA-117537:Greater London,    2.851 km
...Searching 4372796838:GLA-117537
...Found LBO-50724:City of Westminster,    1.732 km
....Searching 4372796838:LBO-50724
....Found LBW-117484:Abbey Road,    1.435 km
....Found LBW-50639:Maida Vale,    0.298 km
....Done 4372796838:LBO-50724:    0.415 secs).
...Found LBO-50632:Camden,    1.119 km
....Searching 4372796838:LBO-50632
....Found LBW-117286:Kilburn,    0.273 km
....Found LBW-117288:Swiss Cottage,    1.033 km
....Found LBW-117287:West Hampstead,    0.084 km
....Done 4372796838:LBO-50632:    0.521 secs).
...Done 4372796838:GLA-117537:    3.368 secs).
..Done 4372796838:GEOU-1159320743:    4.372 secs).
.Done 4372796838:SOV-1159320701:    4.750 secs).
Done 4372796838:-:    5.532 secs).

PL/SQL procedure successfully completed.
Since I load Strava activities from the bulk download, I also process them in bulk.
--process unmatched activities
set pages 99 lines 180 timi on serveroutput on
column activity_name format a60
BEGIN 
  FOR i IN (
    SELECT a.activity_id, activity_date, activity_name
    ,      distance_km, num_pts, ROUND(num_pts/NULLIF(distance_km,0),0) ppkm
    FROM   activities a
    WHERE  activity_id NOT IN (SELECT DISTINCT activity_id FROM activity_areas)
    AND    num_pts>0
  ) LOOP
    strava_pkg.activity_area_search(i.activity_id);
    commit;
  END LOOP;
END;
/
Matching 2,700 activities produced 71,628 rows on activity_areas for 5,620 distinct areas. In the next article, I will demonstrate how to text search the areas for matching activities.

Sunday, March 21, 2021

Spatial Data 4: Obtaining Geographical Data

This blog is part of a series about my first steps in using Spatial Data in the Oracle database.  I am using the GPS data from my cycling activities collected by Strava. All of my files are available on GitHub.

The next stage is to use my Strava data as a resource for ride planning.  For example, I want to go for a ride in the Chilterns this weekend, I want to look at previous rides in the Chilterns to see where I have gone.  This presents a number of challenges that I will cover over the next few blogs.

  • I need a working definition of the Chilterns.  
  • I need to identify which activities entered the area defined as being the Chilterns.  

More generically, I might be interested in any area in any country.  I need to be able to search for areas by name, then identify the activities that passed through these areas.

Geographical Areas

The world is divided up into 206 sovereignties (including independent and leased areas), and those are then divided down.  Let's take the United Kingdom as an example:

United Kingdom

.England
.Northern Ireland
.Scotland
.Wales

.Guernsey
..Alderney
..Guernsey
..Herm
..Sark
.Isle of Man
.Jersey

.Anguilla
.Bermuda
.Cayman Islands
.Dhekelia Sovereign Base Area
.Falkland Islands
.Gibraltar
.British Indian Ocean Territory
..Diego Garcia Naval Support Facility
.Montserrat
.Pitcairn Islands
.South Georgia and the Islands
..South Georgia
..South Sandwich Islands
.Saint Helena
..Ascension
..Saint Helena
..Tristan da Cunha
.Turks and Caicos Islands
.British Virgin Islands
 Akrotiri Sovereign Base Area

  • The United Kingdom consists of the 4 'home' countries.
    • These are divided down into counties, authorities, districts, boroughs, wards and parishes.
  • Guernsey, Jersey and the Isle of Man are "Crown Dependencies".
  • There are 14 dependent territories
    • Some of these are broken down further into separate islands.

I need enough areas to allow me to effectively search areas by name and then determine which activities are in which areas.

To return to the original question, the Chiltern Hills are not a government administrative area but are designated as an Area of Outstanding Natural Beauty (AONB).  As, they are a useful shorthand to describe some areas where I regularly cycle, so I have included them in the heirarchy.

Loading Spatial Data from Esri Shapefile

Lots of geographical data is publically available from a variety of organisations and governments in the form of shapefiles.  This is "Esri's somewhat open, hybrid vector data format using SHP, SHX and DBF files. Originally invented in the early 1990s, it is still commonly used as a widely supported interchange format".  Oracle provides a java shapefile converter that transforms shapefiles into database tables.

See also: 

Shapefiles are zip archives that contain a number of files, including but not limited to the following, but containing at least always the first three:

  • .shp - the main file that contains the geometry itself,
  • .shx - an index file,
  • .dbf - a DBase file containing other attributes to describe the spatial data.  When you load the shapefile, the DBF file is loaded into all the other columns in the table.  This file can be opened with Microsoft Excel so you can see the data,
  • .prj - contains the projection description of the data,
  • .csv -the same data as in the .dbf file, but as a comma-separated data file,
  • .cpg - the code page of the data in the .dbf file.

A little searching with Google turned up a number of useful sources of publically available spatial data (although most of it requires to be licenced for commercial use).

Most of the shapefiles provide data in latitude/longitude in WGS84 that corresponds to SRID 4326.  However, the data from the UK government and the Ordnance Survey uses the British National Grid (BNG) GCS_OSGB_1936.  This corresponds to SRID 27700 (see Convert GPX Track to a Spatial Line Geometry).

By default, the shapefile converter creates geometries in the coordinate system provided by the shapefile.  It is possible to specify a different coordinate system at load time, however, converting the data significantly slows the load process (my experience is that it increases load duration by about a factor of approximately 5).  

The spatial data is loaded into a geometry column called geom by default.  However, the column name can be specified.

Later when it comes to comparing spatial data, you can only compare geometries that have the same SRID.  Therefore, it is important to know the coordinate system of the data with which you are dealing.  My convention is to put WGS84 (SRID 4326) data into columns call geom, and to put British National Grid into columns called geom_27700. I load data in the coordinate system of the shapefile.  Later on, I may add additional columns and copy and convert the data.

I have written a simple shell script (load_shapes.sh) to call the java shapefile converter, including controlling the SRID and the name of the table and the geometry column.

#load_shapes.sh
…
function shp_load {
  echo $0:$*
…
cd $dir
  pwd
  export clpath=$ORACLE_HOME/suptools/tfa/release/tfa_home/jlib/ojdbc5.jar:$ORACLE_HOME/md/jlib/sdoutl.jar:$ORACLE_HOME/md/jlib/sdoapi.jar
  echodo "java -cp $clpath oracle.spatial.util.SampleShapefileToJGeomFeature -h oracle-database.local -p 1521 -sn oracle_pdb -u strava -d strava -t $table -f $base -r $srid -g ${col}"
}

clear
#set -x

shp_load /tmp/strava/ne_10m_admin_0_sovereignty.shp
shp_load /tmp/strava/ne_10m_admin_0_map_units
shp_load /tmp/strava/ne_10m_admin_0_map_subunits
…

I can now load each shapefile into a separate table.  

Merging Shapefile Data into a Single Set of Data

The various tables created by loading shapefiles will each have their own structures determined by what was put into the shapefile. Ultimately, I am going to load them all into a single table with which I will work.  

Areas have a hierarchy and that is represented in this table by the linked list of area code and number to parent area code and number.  Foreign key constraints ensure the parent values are valid.  There are also check constraints to prevent an area from being its own parent.

REM my_areas_ddl.sql
…
CREATE TABLE my_areas
(area_Code varchar2(4) NOT NULL
,area_number integer NOT NULL
,uqid varchar2(20) NOT NULL
…
,area_level integer NOT NULL 
,parent_area_code varchar2(4)
,parent_area_number integer
,parent_uqid varchar2(20)
,name varchar2(40)
,suffix varchar2(20)
,iso_code3 varchar2(3)
…
,num_children integer
,matchable integer default 1
…
,geom mdsys.sdo_geometry
,geom_27700 mdsys.sdo_geometry
,mbr mdsys.sdo_geometry
,constraint my_areas_pk primary key (area_code, area_number)
,constraint my_areas_uqid unique (uqid)
,constraint my_areas_rfk_area_code foreign key (parent_area_code, parent_area_number) references my_areas (area_code, area_number)
,constraint my_areas_rfk_uqid foreign key (parent_uqid) references my_areas (uqid)
,constraint my_areas_fk_area_code foreign key (area_code) references my_area_codes (area_code)
,constraint my_areas_check_parent_area_code CHECK (area_code != parent_area_code OR area_number != parent_area_number) 
,constraint my_areas_check_parent_uqid CHECK (uqid != parent_uqid)
)
/
--alter table my_areas modify matchable default 1;
Alter table my_areas add constraint my_areas_uq_iso_code3 unique (iso_code3);
Create index my_areas_rfk_uqid on my_areas(parent_uqid);
Create index my_areas_rfk_area_code on my_areas (parent_area_code, parent_area_number);

I have created scripts to populate data in the my_areas table from the Natural Earth data, and from the data for each country.  Different scripts are needed for each shapefile.

  • load_countries.sql - to load Natural Earth data
  • load_uk.sql - to load Ordnance Survey data of Great Britain.  This includes some DML to work out which wards and parishes are in which districts and boroughs and update the hierarchy accordingly.
  • load_XXX.sql, - load administrative areas for a country where XXX is the 3-letter ISO code for that country (eg. load_FRA.sql for France).
  • fix_names.sql - to simplify names stripping common suffixes such as a county, district, authority, ward etc.
    • fix_my_areas.sql - script to collect statistics, count children for each area, look for areas that children of another area with the same name, simplify areas with more than 10,000 points.