Loading and analyzing GTFS Schedule data with DuckDB.
This projects provides an empty DuckDB database that contains the necessary tables and their relations for GTFS Schedule datasets. It can be found at database/gtfs.duckdb.
The SQL script that created the DuckDB database can be found at scripts/create_gtfs_database.sql. It's directly derived from the official GTFS Schedule Field Definitions docs.
The Entity Relationship Diagram looks like this:
To show the GTFS Schedule database can be used, this repo contains an example of how to transform and load a public dataset.
There’s a German dataset that contains the full train and local traffic for Germany. It contains more than 32 million stop times, 1.6 million trips, and 663 thousand stops.
First, you need to clone this git repo locally. Then you can run the scripts/providers/gtfs-de/full/download_data.sh script to download the data to the source-data/providers/gtfs-de/full directory.
After that, you can load the data to a copy of the pre-existing GTFS database with the following command:
scripts/providers/gtfs-de/full/create_database.sh
The database will be written to exported-data/providers/gtfs-de/full/data.duckdb
.
Once this is done, you can open the DuckDB CLI to query the data:
duckdb exported-data/providers/gtfs-de/full/data.duckdb
DuckDB can also be used to create a lightweight "view database" that contains only references to the actual data, hosted on object storage like S3 or R2.
The process to create such a "view database" consists of four main steps:
- Exporting the data from a database as Parquet files.
- Uploading the exported Parquet files to an object storage service.
- Creating the database file containing the views to the object storage files.
- Uploading the new database file to object storage as well, to be able to use the ATTACH statement of DuckDB.
With our example dataset, if the above steps to create the basic database have been run, you can do the following:
- Run the
scripts/providers/gtfs-de/full/export_data.sh
script. This will export the data as one Parquet file per table. - Upload the Parquet files to an object storage service.
- Edit the
scripts/providers/gtfs-de/full/create_view_database.sh
script to match the location of your object storage uploads, and run the script. - Upload the resulting database file exported-data/providers/gtfs-de/full/database.duckdb to your object storage as well.
The resulting DuckDB database file will be around 250kb in size, and contains no actual data, but references to the data on object storage via views.
You can access the generated example database via SQL Workbench.