Wednesday, February 03, 2021

Spatial Data 1: Loading GPX data into XML data types

This blog is part of a series about my first steps using Spatial Data in the Oracle database.  I am using the GPS data for my cycling activities collected by Strava. All of my files are available on GitHub.

In these posts, I have only shown extracts of some of the scripts I have written.  The full files are available on github.

Upload and Expand Strava Bulk Export

Strava will bulk export all your data to a zipped folder.  It contains various CSV files.  I am interested in activities.csv that contains a row for each activity with various pieces of data including the name of the data file that can be found in the /activities directory.  That file will usually be a .gpx, or it may be zipped as a .gpx.gz file.  GPX is an XML schema that contains sets of longitude/latitude coordinates and may contain other attributes.  

The first job is to upload the Strava export .zip file to somewhere accessible to the database server (in my case /vagrant) and to expand it (to /tmp/strava/).

cd /vagrant
mkdir /tmp/strava
unzip /vagrant/export_1679301.zip -d /tmp/strava

Create Strava Schema 

I need to create a new database schema to hold the various objects I will create, and I have to give it certain privileges.
connect / as sysdba
create user strava identified by strava;
grant connect, resource to strava;
grant create view to strava;
grant select_catalog_role to strava;
grant XDBADMIN to STRAVA;
grant alter session to STRAVA;
alter user strava quota unlimited on users;
alter user strava default tablespace users;

GRANT CREATE ANY DIRECTORY TO strava;
CREATE OR REPLACE DIRECTORY strava as '/tmp/strava';
CREATE OR REPLACE DIRECTORY activities as '/tmp/strava/activities';
CREATE OR REPLACE DIRECTORY exec_dir AS '/usr/bin';

GRANT READ, EXECUTE ON DIRECTORY exec_dir TO strava;
GRANT READ, EXECUTE ON DIRECTORY strava TO strava;
GRANT READ ON DIRECTORY activities TO strava;
  • I need to create database directories for both the CSV files in /tmp/strava and the various GPX files in the /tmp/strava/activities sub-directory.  I will need read privilege on both directories, and also execute privilege on the strava directory so that I can use a pre-processor script.
  • The exec_dir directory points to /usr/bin where the zip executables are located.  I need read and execute privilege on this so I can read directly from zipped files.
  • XDBADMIN: "Allows the grantee to register an XML schema globally, as opposed to registering it for use or access only by its owner. It also lets the grantee bypass access control list (ACL) checks when accessing Oracle XML DB Repository".

Import CSV file via an External Table

I will start by creating an external table to read the Strava activities.csv file, and then copy it into a database table.  This file is a simple comma-separated variable file.  The activity date, name and description are enclosed in double-quotes.  
The first problem that I encountered was that some of the descriptions I typed into Strava contain newline characters and the external table interprets them as the end of the record even though these characters are inside the double-quotes.
4380927517,"23 Nov 2020, 18:03:54",Zwift Crash Recovery,Virtual Ride,"Zwift Crash Recovery
1. recover fit file per https://zwiftinsider.com/retrieve-lost-ride/, 
2. fix corrupt .fit file with https://www.fitfiletools.com",1648,13.48,,false,Other,activities/4682540615.gpx.gz,,10.0,1648.0,1648.0,13480.2001953125,13.199999809265137,
8.179733276367188,91.0,36.20000076293945,12.600000381469727,69.5999984741211,7.099999904632568,0.40652215480804443,,,84.0,62.1943244934082,
,,,150.66201782226562,276.8444519042969,,,,,,,,,,,,158.0,1649.0,,,0.0,,1.0,,,,,,,,,,,,,,,,4907360.0,,,,,,,,,,,
As Chris Saxon points out on AskTom, it is necessary to pre-process the records to replace the newline characters with something else.  I found this awk script to process the record.  So I put it into a shell script nlfix.sh, made it executable and invoked as a pre-processor in the external table definition.
#nlfix.sh
/usr/bin/gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' $*
nlfix.sh
  • Note the full path for gawk is specified.
A database directory is needed for the location of the pre-processor scripts and it is necessary to grant read and execute privileges on it.  I simply put the pre-processor in the same directory as the CSV file so I could use the same strava directory I created earlier.
GRANT READ, EXECUTE ON DIRECTORY strava TO strava;
Now I can define an external table that will read the activities.csv file. 
CREATE TABLE strava.activities_ext
(Activity_ID NUMBER
,Activity_Date DATE
,Activity_Name VARCHAR2(100)
,Activity_Type VARCHAR2(15)
,Activity_Description VARCHAR2(200)
,Elapsed_Time NUMBER
,Distance_km NUMBER
…)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
 DEFAULT DIRECTORY strava
 ACCESS PARAMETERS 
 (RECORDS DELIMITED BY newline 
  SKIP 1
  DISABLE_DIRECTORY_LINK_CHECK
  PREPROCESSOR strava:'nlfix.sh' 
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' RTRIM
  MISSING FIELD VALUES ARE NULL
  REJECT ROWS WITH ALL NULL FIELDS
  NULLIF = BLANKS
(Activity_ID,Activity_Date date "DD Mon yyyy,HH24:mi:ss"
,Activity_Name,Activity_Type,Activity_Description
,Elapsed_Time,Distance_km
…))
LOCATION ('activities.csv')
) REJECT LIMIT 5
/

Import Activities

Now I can simply copy from the external table to a regular table.  I have omitted a lot of columns that Strava does not populate (at least not in my export) but that appear in the CSV file.
rem 1b_create_activities_ext.sql
spool 1b_create_activities_ext 

CREATE TABLE strava.activities AS
select ACTIVITY_ID,ACTIVITY_DATE,ACTIVITY_NAME,ACTIVITY_TYPE,ACTIVITY_DESCRIPTION,
ELAPSED_TIME,DISTANCE_KM,RELATIVE_EFFORT,COMMUTE_CHAR,ACTIVITY_GEAR,
FILENAME,
ATHLETE_WEIGHT,BIKE_WEIGHT,ELAPSED_TIME2,MOVING_TIME,DISTANCE_M,MAX_SPEED,AVERAGE_SPEED,
ELEVATION_GAIN,ELEVATION_LOSS,ELEVATION_LOW,ELEVATION_HIGH,MAX_GRADE,AVERAGE_GRADE,
--AVERAGE_POSITIVE_GRADE,AVERAGE_NEGATIVE_GRADE,
MAX_CADENCE,AVERAGE_CADENCE,
--MAX_HEART_RATE,
AVERAGE_HEART_RATE,
--MAX_WATTS,
AVERAGE_WATTS,CALORIES,
--MAX_TEMPERATURE,AVERAGE_TEMPERATURE,
RELATIVE_EFFORT2,
TOTAL_WORK,
--NUMBER_OF_RUNS,
--UPHILL_TIME,DOWNHILL_TIME,OTHER_TIME,
PERCEIVED_EXERTION,
--TYPE,
--START_TIME,
WEIGHTED_AVERAGE_POWER,POWER_COUNT,
PREFER_PERCEIVED_EXERTION,PERCEIVED_RELATIVE_EFFORT,
COMMUTE,
--TOTAL_WEIGHT_LIFTED,
FROM_UPLOAD,
GRADE_ADJUSTED_DISTANCE,
--WEATHER_OBSERVATION_TIME,WEATHER_CONDITION,
--WEATHER_TEMPERATURE,APPARENT_TEMPERATURE,
--DEWPOINT,HUMIDITY,WEATHER_PRESSURE,
--WIND_SPEED,WIND_GUST,WIND_BEARING,
--PRECIPITATION_INTENSITY,
--SUNRISE_TIME,SUNSET_TIME,MOON_PHASE,
BIKE
--GEAR,
--PRECIPITATION_PROBABILITY,PRECIPITATION_TYPE,
--CLOUD_COVER,WEATHER_VISIBILITY,UV_INDEX,WEATHER_OZONE,
--JUMP_COUNT,TOTAL_GRIT,AVG_FLOW,
--FLAGGED
FROM strava.activities_ext
/

ALTER TABLE activities ADD CONSTRAINT activities_pk PRIMARY KEY (activity_id);
…
ALTER TABLE activities ADD (gpx XMLTYPE) XMLTYPE COLUMN gpx STORE AS SECUREFILE BINARY XML (CACHE DISABLE STORAGE IN ROW);
ALTER TABLE activities ADD (geom mdsys.sdo_geometry));
ALTER TABLE activities ADD (geom_27700 mdsys.sdo_geometry));
ALTER TABLE activities ADD (mbr mdsys.sdo_geometry));
ALTER TABLE activities ADD (xmlns VARCHAR2(128));
ALTER TABLE activities ADD (num_pts INTEGER DEFAULT 0);

Spool off
  • I have specified a primary key on activity_id and made a number of other columns not nullable.
  • I have added a new XMLTYPE column GPX into which I will load the GPS data in the .gpx files.  

FIT files

Some applications, such as Garmin and Rouvy generate compressed .fit files, and Strava exports them again (apparently if it can't convert them, although it can convert the .fit files from Zwift to .gpx).  These are binary files, and since I only have a few of them, I have converted them to .gpx files using GPSBabel on my laptop, and then I reuploaded the .gpx files.
for %i in (*.fit.gz) do "C:\Program Files\GnuWin\bin\gzip" -fd %i
for %i in (*.fit) do "C:\Program Files (x86)\GPSBabel\GPSBabel.exe" -i garmin_fit -f "%i" -o gpx -F "%~ni".gpx
I then update the file name in the activities table.
UPDATE activities
SET filename = REPLACE(filename,'.fit.gz','.gpx')
WHERE filename like '%.fit.gz'
/

Compress GPX files (optional)

Some of the GPX files in the Strava export are compressed and some are not.  There is no obvious reason why.  To minimise the space I can gzip the GPX files.
gzip -9v /tmp/strava/activities/*.gpx
If I do compress any .gpx files, then I also need to update the file names in the activities table.
UPDATE activities
Set filename = filename||'.gz'
Where filename like '%.gpx'
/

Load the GPX files into the XML data type.

The next stage is to load each of the GPX files into the activities table.  
create or replace package body strava_pkg as 
k_module      CONSTANT VARCHAR2(48) := $$PLSQL_UNIT;
…
----------------------------------------------------------------------------------------------------
function getClobDocument
(p_directory IN VARCHAR2
,p_filename  IN VARCHAR2
,p_charset   IN VARCHAR2 DEFAULT NULL
) return        CLOB deterministic
is
  l_module VARCHAR2(64); 
  l_action VARCHAR2(64);

  v_filename      VARCHAR2(128);
  v_directory     VARCHAR2(128);
  v_file          bfile;
  v_unzipped      blob := empty_blob();

  v_Content       CLOB := ' ';
  v_src_offset    number := 1 ;
  v_dst_offset    number := 1 ;
  v_charset_id    number := 0;
  v_lang_ctx      number := DBMS_LOB.default_lang_ctx;
  v_warning       number;

  e_22288 EXCEPTION; --file or LOB operation FILEOPEN failed
  PRAGMA EXCEPTION_INIT(e_22288, -22288);
BEGIN
  dbms_application_info.read_module(module_name=>l_module
                                   ,action_name=>l_action);
  dbms_application_info.set_module(module_name=>k_module
                                  ,action_name=>'getClobDocument');

  IF p_charset IS NOT NULL THEN
    v_charset_id := NLS_CHARSET_ID(p_charset);
  END IF;

  v_filename  := REGEXP_SUBSTR(p_filename,'[^\/]+',1,2);
  v_directory := REGEXP_SUBSTR(p_filename,'[^\/]+',1,1);

  IF v_directory IS NOT NULL and v_filename IS NULL THEN /*if only one parameters then it is actually a filename*/
    v_filename := v_directory; 
    v_directory := '';
  END IF;

  IF p_directory IS NOT NULL THEN
    v_directory := p_directory;
  END IF;

  v_File := bfilename(UPPER(v_directory),v_filename);

  BEGIN
    DBMS_LOB.fileopen(v_File, DBMS_LOB.file_readonly);
  exception 
    when VALUE_ERROR OR e_22288 then
      dbms_output.put_line('Can''t open:'||v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
      v_content := '';
      dbms_application_info.set_module(module_name=>l_module
                                      ,action_name=>l_action);
      return v_content;
  END;

  IF v_filename LIKE '%.gz' THEN
    v_unzipped := utl_compress.lz_uncompress(v_file);
    dbms_lob.converttoclob(
      dest_lob     => v_content,
      src_blob     => v_unzipped,
      amount       => DBMS_LOB.LOBMAXSIZE, 
      dest_offset  => v_dst_offset,
      src_offset   => v_src_offset,
      blob_csid    => dbms_lob.default_csid,
      lang_context => v_lang_ctx,
      warning      => v_warning);
  ELSE --ELSIF v_filename LIKE '%.g__' THEN
    DBMS_LOB.LOADCLOBFROMFILE(v_Content, 
      Src_bfile    => v_File,
      amount       => DBMS_LOB.LOBMAXSIZE, 
      src_offset   => v_src_offset, 
      dest_offset  => v_dst_offset,
      bfile_csid   => v_charset_id, 
      lang_context => v_lang_ctx,
      warning => v_warning);
  END IF;

  dbms_output.put_line(v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
  DBMS_LOB.fileclose(v_File);

  dbms_application_info.set_module(module_name=>l_module
                                  ,action_name=>l_action);

  return v_Content;
exception when others then
  dbms_output.put_line(v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
  DBMS_LOB.fileclose(v_File);
  dbms_application_info.set_module(module_name=>l_module
                                  ,action_name=>l_action);
  raise;
end getClobDocument;
----------------------------------------------------------------------------------------------------
…
END strava_pkg;
/
I can simply query the contents of the uncompressed GPX file in SQL by calling the function.  In this case, the zipped .gpx file is 65K but decompresses to 1.2Mb.
Set long 1000 lines 200 pages 99 serveroutput on
Column filename  format a30
Column gpx format a100
select activity_id, filename
, getClobDocument('',filename) gpx
from activities
where filename like '%.gpx%'
And activity_id = 4468006769
order by 1
/

ACTIVITY_ID FILENAME                       GPX
----------- ------------------------------ ----------------------------------------------------------------------------------------------------
 4468006769 activities/4468006769.gpx.gz   <?xml version="1.0" encoding="UTF-8"?>
                                           <gpx creator="StravaGPX Android" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLoc
                                           ation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin
                                           .com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.gar
                                           min.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd
                                           " version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlsch
                                           emas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3">
                                            <metadata>
                                             <time>2020-12-13T14:31:13Z</time>
                                            </metadata>
                                            <trk>
                                             <name>Loop</name>
                                             <type>1</type>
                                             <trkseg>
                                              <trkpt lat="51.5296380" lon="-0.1875360">
                                               <ele>30.6</ele>
                                               <time>2020-12-13T14:31:13Z</time>
                                               <extensions>
                                                <gpxtpx:TrackPointExtension>
                                                 <gpxtpx:hr>57</gpxtpx:hr>
                                                </gpxtpx:TrackPointExtension>
                                               </extensions>
                                              </trkpt>
…

activities/4468006769.gpx.gz - 1286238
Elapsed: 00:00:00.14
I can load the .gpx files into the GPX column of the activities table with a simple update statement.  The CLOB returned from the function is converted to an XML with XMLTYPE.
UPDATE activities
SET gpx = XMLTYPE(strava_pkg.getClobDocument('ACTIVITIES',filename))
WHERE filename like '%.gpx%'
/
I can now query back the same GPX from the database.
Set long 1100 lines 200 pages 99 serveroutput on
select activity_id, filename, gpx
from activities
where filename like '%.gpx%'
And activity_id = 4468006769
order by 1
/

ACTIVITY_ID FILENAME                       GPX
----------- ------------------------------ ----------------------------------------------------------------------------------------------------
 4468006769 activities/4468006769.gpx.gz   <?xml version="1.0" encoding="US-ASCII"?>
                                           <gpx creator="StravaGPX Android" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLoc
                                           ation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin
                                           .com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.gar
                                           min.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd
                                           " version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlsch
                                           emas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3">
                                             <metadata>
                                               <time>2020-12-13T14:31:13Z</time>
                                             </metadata>
                                             <trk>
                                               <name>Loop</name>
                                               <type>1</type>
                                               <trkseg>
                                                 <trkpt lat="51.5296380" lon="-0.1875360">
                                                   <ele>30.6</ele>
                                                   <time>2020-12-13T14:31:13Z</time>
                                                   <extensions>
                                                     <gpxtpx:TrackPointExtension>
                                                       <gpxtpx:hr>57</gpxtpx:hr>
                                                     </gpxtpx:TrackPointExtension>
                                                   </extensions>
                                                 </trkpt>
                                                 <trkpt lat="51.5296350" lon="-0.1875340">
…

No comments :