This blog is part of a series about my first steps using Spatial Data in the Oracle database. I am using the GPS data for my cycling activities collected by Strava. All of my files are available on GitHub.
In these posts, I have only shown extracts of some of the scripts I have written. The full files are available on github.
Upload and Expand Strava Bulk Export
Strava will bulk export all your data to a zipped folder. It contains various CSV files. I am interested in activities.csv that contains a row for each activity with various pieces of data including the name of the data file that can be found in the /activities directory. That file will usually be a .gpx, or it may be zipped as a .gpx.gz file. GPX is an XML schema that contains sets of longitude/latitude coordinates and may contain other attributes.
The first job is to upload the Strava export .zip file to somewhere accessible to the database server (in my case /vagrant) and to expand it (to /tmp/strava/).
cd /vagrant
mkdir /tmp/strava
unzip /vagrant/export_1679301.zip -d /tmp/strava
Create Strava Schema
I need to create a new database schema to hold the various objects I will create, and I have to give it certain privileges.
connect / as sysdba
create user strava identified by strava;
grant connect, resource to strava;
grant create view to strava;
grant select_catalog_role to strava;
grant XDBADMIN to STRAVA;
grant alter session to STRAVA;
alter user strava quota unlimited on users;
alter user strava default tablespace users;
GRANT CREATE ANY DIRECTORY TO strava;
CREATE OR REPLACE DIRECTORY strava as '/tmp/strava';
CREATE OR REPLACE DIRECTORY activities as '/tmp/strava/activities';
CREATE OR REPLACE DIRECTORY exec_dir AS '/usr/bin';
GRANT READ, EXECUTE ON DIRECTORY exec_dir TO strava;
GRANT READ, EXECUTE ON DIRECTORY strava TO strava;
GRANT READ ON DIRECTORY activities TO strava;
- I need to create database directories for both the CSV files in /tmp/strava and the various GPX files in the /tmp/strava/activities sub-directory. I will need read privilege on both directories, and also execute privilege on the strava directory so that I can use a pre-processor script.
- The exec_dir directory points to /usr/bin where the zip executables are located. I need read and execute privilege on this so I can read directly from zipped files.
- XDBADMIN: "Allows the grantee to register an XML schema globally, as opposed to registering it for use or access only by its owner. It also lets the grantee bypass access control list (ACL) checks when accessing Oracle XML DB Repository".
Import CSV file via an External Table
I will start by creating an external table to read the Strava activities.csv file, and then copy it into a database table. This file is a simple comma-separated variable file. The activity date, name and description are enclosed in double-quotes.
The first problem that I encountered was that some of the descriptions I typed into Strava contain newline characters and the external table interprets them as the end of the record even though these characters are inside the double-quotes.
4380927517,"23 Nov 2020, 18:03:54",Zwift Crash Recovery,Virtual Ride,"Zwift Crash Recovery
1. recover fit file per https://zwiftinsider.com/retrieve-lost-ride/,
2. fix corrupt .fit file with https://www.fitfiletools.com",1648,13.48,,false,Other,activities/4682540615.gpx.gz,,10.0,1648.0,1648.0,13480.2001953125,13.199999809265137,
8.179733276367188,91.0,36.20000076293945,12.600000381469727,69.5999984741211,7.099999904632568,0.40652215480804443,,,84.0,62.1943244934082,
,,,150.66201782226562,276.8444519042969,,,,,,,,,,,,158.0,1649.0,,,0.0,,1.0,,,,,,,,,,,,,,,,4907360.0,,,,,,,,,,,
As Chris Saxon points out on AskTom, it is necessary to pre-process the records to replace the newline characters with something else. I found this awk script to process the record. So I put it into a shell script nlfix.sh, made it executable and invoked as a pre-processor in the external table definition.
#nlfix.sh
/usr/bin/gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' $*
- Note the full path for gawk is specified.
A database directory is needed for the location of the pre-processor scripts and it is necessary to grant read and execute privileges on it. I simply put the pre-processor in the same directory as the CSV file so I could use the same strava directory I created earlier.
GRANT READ, EXECUTE ON DIRECTORY strava TO strava;
Now I can define an external table that will read the activities.csv file.
CREATE TABLE strava.activities_ext
(Activity_ID NUMBER
,Activity_Date DATE
,Activity_Name VARCHAR2(100)
,Activity_Type VARCHAR2(15)
,Activity_Description VARCHAR2(200)
,Elapsed_Time NUMBER
,Distance_km NUMBER
…)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
DEFAULT DIRECTORY strava
ACCESS PARAMETERS
(RECORDS DELIMITED BY newline
SKIP 1
DISABLE_DIRECTORY_LINK_CHECK
PREPROCESSOR strava:'nlfix.sh'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' RTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
NULLIF = BLANKS
(Activity_ID,Activity_Date date "DD Mon yyyy,HH24:mi:ss"
,Activity_Name,Activity_Type,Activity_Description
,Elapsed_Time,Distance_km
…))
LOCATION ('activities.csv')
) REJECT LIMIT 5
/
- I have included all of the columns in the CSV whether I need them later or not.
- I have specified a date format picture string for the activity date.
- See also Reading the Active Session History Compressed Export File in eDB360/SQLd360 as an External Table.
Import Activities
Now I can simply copy from the external table to a regular table. I have omitted a lot of columns that Strava does not populate (at least not in my export) but that appear in the CSV file.
rem 1b_create_activities_ext.sql
spool 1b_create_activities_ext
CREATE TABLE strava.activities AS
select ACTIVITY_ID,ACTIVITY_DATE,ACTIVITY_NAME,ACTIVITY_TYPE,ACTIVITY_DESCRIPTION,
ELAPSED_TIME,DISTANCE_KM,RELATIVE_EFFORT,COMMUTE_CHAR,ACTIVITY_GEAR,
FILENAME,
ATHLETE_WEIGHT,BIKE_WEIGHT,ELAPSED_TIME2,MOVING_TIME,DISTANCE_M,MAX_SPEED,AVERAGE_SPEED,
ELEVATION_GAIN,ELEVATION_LOSS,ELEVATION_LOW,ELEVATION_HIGH,MAX_GRADE,AVERAGE_GRADE,
--AVERAGE_POSITIVE_GRADE,AVERAGE_NEGATIVE_GRADE,
MAX_CADENCE,AVERAGE_CADENCE,
--MAX_HEART_RATE,
AVERAGE_HEART_RATE,
--MAX_WATTS,
AVERAGE_WATTS,CALORIES,
--MAX_TEMPERATURE,AVERAGE_TEMPERATURE,
RELATIVE_EFFORT2,
TOTAL_WORK,
--NUMBER_OF_RUNS,
--UPHILL_TIME,DOWNHILL_TIME,OTHER_TIME,
PERCEIVED_EXERTION,
--TYPE,
--START_TIME,
WEIGHTED_AVERAGE_POWER,POWER_COUNT,
PREFER_PERCEIVED_EXERTION,PERCEIVED_RELATIVE_EFFORT,
COMMUTE,
--TOTAL_WEIGHT_LIFTED,
FROM_UPLOAD,
GRADE_ADJUSTED_DISTANCE,
--WEATHER_OBSERVATION_TIME,WEATHER_CONDITION,
--WEATHER_TEMPERATURE,APPARENT_TEMPERATURE,
--DEWPOINT,HUMIDITY,WEATHER_PRESSURE,
--WIND_SPEED,WIND_GUST,WIND_BEARING,
--PRECIPITATION_INTENSITY,
--SUNRISE_TIME,SUNSET_TIME,MOON_PHASE,
BIKE
--GEAR,
--PRECIPITATION_PROBABILITY,PRECIPITATION_TYPE,
--CLOUD_COVER,WEATHER_VISIBILITY,UV_INDEX,WEATHER_OZONE,
--JUMP_COUNT,TOTAL_GRIT,AVG_FLOW,
--FLAGGED
FROM strava.activities_ext
/
ALTER TABLE activities ADD CONSTRAINT activities_pk PRIMARY KEY (activity_id);
…
ALTER TABLE activities ADD (gpx XMLTYPE) XMLTYPE COLUMN gpx STORE AS SECUREFILE BINARY XML (CACHE DISABLE STORAGE IN ROW);
ALTER TABLE activities ADD (geom mdsys.sdo_geometry));
ALTER TABLE activities ADD (geom_27700 mdsys.sdo_geometry));
ALTER TABLE activities ADD (mbr mdsys.sdo_geometry));
ALTER TABLE activities ADD (xmlns VARCHAR2(128));
ALTER TABLE activities ADD (num_pts INTEGER DEFAULT 0);
Spool off
- I have specified a primary key on activity_id and made a number of other columns not nullable.
- I have added a new XMLTYPE column GPX into which I will load the GPS data in the .gpx files.
FIT files
Some applications, such as Garmin and Rouvy generate compressed .fit files, and Strava exports them again (apparently if it can't convert them, although it can convert the .fit files from Zwift to .gpx). These are binary files, and since I only have a few of them, I have converted them to .gpx files using GPSBabel on my laptop, and then I reuploaded the .gpx files.
for %i in (*.fit.gz) do "C:\Program Files\GnuWin\bin\gzip" -fd %i
for %i in (*.fit) do "C:\Program Files (x86)\GPSBabel\GPSBabel.exe" -i garmin_fit -f "%i" -o gpx -F "%~ni".gpx
I then update the file name in the activities table.
UPDATE activities
SET filename = REPLACE(filename,'.fit.gz','.gpx')
WHERE filename like '%.fit.gz'
/
Compress GPX files (optional)
Some of the GPX files in the Strava export are compressed and some are not. There is no obvious reason why. To minimise the space I can gzip the GPX files.
gzip -9v /tmp/strava/activities/*.gpx
If I do compress any .gpx files, then I also need to update the file names in the activities table.
UPDATE activities
Set filename = filename||'.gz'
Where filename like '%.gpx'
/
Load the GPX files into the XML data type.
The next stage is to load each of the GPX files into the activities table.
- A GPX file can be read directly into a CLOB with dbms_lob.loadclobfromfile(). I have written a PL/SQL packaged function to read the contents of a file and return it as a CLOB.
- Based on Simon Greener's Spatial DB Advisor Blog: Loading and Processing GPX 1.1 files using Oracle XMLDB.
- If the filename ends in .gz then I assume the file is GZIPped and I first uncompress it into a BLOB with utl_compress.lz_uncompress(), and then I can convert the BLOB to a CLOB.
create or replace package body strava_pkg as
k_module CONSTANT VARCHAR2(48) := $$PLSQL_UNIT;
…
----------------------------------------------------------------------------------------------------
function getClobDocument
(p_directory IN VARCHAR2
,p_filename IN VARCHAR2
,p_charset IN VARCHAR2 DEFAULT NULL
) return CLOB deterministic
is
l_module VARCHAR2(64);
l_action VARCHAR2(64);
v_filename VARCHAR2(128);
v_directory VARCHAR2(128);
v_file bfile;
v_unzipped blob := empty_blob();
v_Content CLOB := ' ';
v_src_offset number := 1 ;
v_dst_offset number := 1 ;
v_charset_id number := 0;
v_lang_ctx number := DBMS_LOB.default_lang_ctx;
v_warning number;
e_22288 EXCEPTION; --file or LOB operation FILEOPEN failed
PRAGMA EXCEPTION_INIT(e_22288, -22288);
BEGIN
dbms_application_info.read_module(module_name=>l_module
,action_name=>l_action);
dbms_application_info.set_module(module_name=>k_module
,action_name=>'getClobDocument');
IF p_charset IS NOT NULL THEN
v_charset_id := NLS_CHARSET_ID(p_charset);
END IF;
v_filename := REGEXP_SUBSTR(p_filename,'[^\/]+',1,2);
v_directory := REGEXP_SUBSTR(p_filename,'[^\/]+',1,1);
IF v_directory IS NOT NULL and v_filename IS NULL THEN /*if only one parameters then it is actually a filename*/
v_filename := v_directory;
v_directory := '';
END IF;
IF p_directory IS NOT NULL THEN
v_directory := p_directory;
END IF;
v_File := bfilename(UPPER(v_directory),v_filename);
BEGIN
DBMS_LOB.fileopen(v_File, DBMS_LOB.file_readonly);
exception
when VALUE_ERROR OR e_22288 then
dbms_output.put_line('Can''t open:'||v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
v_content := '';
dbms_application_info.set_module(module_name=>l_module
,action_name=>l_action);
return v_content;
END;
IF v_filename LIKE '%.gz' THEN
v_unzipped := utl_compress.lz_uncompress(v_file);
dbms_lob.converttoclob(
dest_lob => v_content,
src_blob => v_unzipped,
amount => DBMS_LOB.LOBMAXSIZE,
dest_offset => v_dst_offset,
src_offset => v_src_offset,
blob_csid => dbms_lob.default_csid,
lang_context => v_lang_ctx,
warning => v_warning);
ELSE --ELSIF v_filename LIKE '%.g__' THEN
DBMS_LOB.LOADCLOBFROMFILE(v_Content,
Src_bfile => v_File,
amount => DBMS_LOB.LOBMAXSIZE,
src_offset => v_src_offset,
dest_offset => v_dst_offset,
bfile_csid => v_charset_id,
lang_context => v_lang_ctx,
warning => v_warning);
END IF;
dbms_output.put_line(v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
DBMS_LOB.fileclose(v_File);
dbms_application_info.set_module(module_name=>l_module
,action_name=>l_action);
return v_Content;
exception when others then
dbms_output.put_line(v_directory||'/'||v_filename||' - '||v_dst_offset||' bytes');
DBMS_LOB.fileclose(v_File);
dbms_application_info.set_module(module_name=>l_module
,action_name=>l_action);
raise;
end getClobDocument;
----------------------------------------------------------------------------------------------------
…
END strava_pkg;
/
I can simply query the contents of the uncompressed GPX file in SQL by calling the function. In this case, the zipped .gpx file is 65K but decompresses to 1.2Mb.
Set long 1000 lines 200 pages 99 serveroutput on
Column filename format a30
Column gpx format a100
select activity_id, filename
, getClobDocument('',filename) gpx
from activities
where filename like '%.gpx%'
And activity_id = 4468006769
order by 1
/
ACTIVITY_ID FILENAME GPX
----------- ------------------------------ ----------------------------------------------------------------------------------------------------
4468006769 activities/4468006769.gpx.gz <?xml version="1.0" encoding="UTF-8"?>
<gpx creator="StravaGPX Android" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLoc
ation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin
.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.gar
min.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd
" version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlsch
emas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3">
<metadata>
<time>2020-12-13T14:31:13Z</time>
</metadata>
<trk>
<name>Loop</name>
<type>1</type>
<trkseg>
<trkpt lat="51.5296380" lon="-0.1875360">
<ele>30.6</ele>
<time>2020-12-13T14:31:13Z</time>
<extensions>
<gpxtpx:TrackPointExtension>
<gpxtpx:hr>57</gpxtpx:hr>
</gpxtpx:TrackPointExtension>
</extensions>
</trkpt>
…
activities/4468006769.gpx.gz - 1286238
Elapsed: 00:00:00.14
I can load the .gpx files into the GPX column of the activities table with a simple update statement. The CLOB returned from the function is converted to an XML with XMLTYPE.
UPDATE activities
SET gpx = XMLTYPE(strava_pkg.getClobDocument('ACTIVITIES',filename))
WHERE filename like '%.gpx%'
/
I can now query back the same GPX from the database.
Set long 1100 lines 200 pages 99 serveroutput on
select activity_id, filename, gpx
from activities
where filename like '%.gpx%'
And activity_id = 4468006769
order by 1
/
ACTIVITY_ID FILENAME GPX
----------- ------------------------------ ----------------------------------------------------------------------------------------------------
4468006769 activities/4468006769.gpx.gz <?xml version="1.0" encoding="US-ASCII"?>
<gpx creator="StravaGPX Android" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLoc
ation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin
.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.gar
min.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd
" version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlsch
emas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3">
<metadata>
<time>2020-12-13T14:31:13Z</time>
</metadata>
<trk>
<name>Loop</name>
<type>1</type>
<trkseg>
<trkpt lat="51.5296380" lon="-0.1875360">
<ele>30.6</ele>
<time>2020-12-13T14:31:13Z</time>
<extensions>
<gpxtpx:TrackPointExtension>
<gpxtpx:hr>57</gpxtpx:hr>
</gpxtpx:TrackPointExtension>
</extensions>
</trkpt>
<trkpt lat="51.5296350" lon="-0.1875340">
…
No comments :
Post a Comment