gtfs-realtime-capsule is a command-line tool that scrapes, normalizes, and archives real-time public transit data. Inspired by the BusObservatory API by Prof. Anthony Townsend of Jacobs Urban Tech Hub at Cornell Tech, the goal of this project is to make it seamless for anyone to help archive realtime transit data and analytics and make it publicly available though a distributed database.
- 📖 Table of Contents
- ❔ About
- 🚀 Getting Started
- 🔨 Development
- Support
- ✨ Roadmap
- Contributing
- 👤 Authors
- 🤝 Credits
- 💛 Support
- ⚖️ Disclaimer
- 📃 License
GTFS Realtime is an extension of the GTFS format that allows transit agencies to share live updates about their services, including delays, vehicle locations, and service disruptions. It is used by Google Maps for realtime updates on transit schedules, including delays, cancellations, and changes in arrival and departure times. Raw GTFS-realtime data parsed from the New York City's ACE Subway Lines feed is available at example_mta_ace_subway.txt.
It is a rich dataset, however the feeds are ephemeral. With each update, the feed is overwritten by current transit data and historical data is not available. GTFS-Realtime-Capsule solves this problem by scraping, normalizing and archiving feeds as they update.
- Scraping: Collect live transit data from various sources.
- Normalization: Standardize data formats for consistency and ease of use.
- Archiving: Store historical data for future analysis and reporting.
- Ensure you have Git installed on your machine.
- Create an Amazon S3 bucket and save your public and secret key.
- Download and install Docker.
- Download and install Docker Compose.
- Request an API key from Mobility Database.
- Install
make
.
- Open your command prompt or terminal.
- Clone the repository:
git clone https://github.com/tsdataclinic/gtfs-realtime-capsule.git
- Navigate to the project directory and ensure you're on the main branch:
cd gtfs-realtime-capsule git branch # Ensure you're on the main branch
- Start the docker daemon.
Update the global config file.
- Open the config/global_config.json in your file editor.
- Update the
s3_bucket.uri
field to the uri of your Amazon S3 bucket. - Update the
s3_bucket.public_key
field to the public key of your Amazon S3 bucket. - Update the
s3_bucket.secret_key
field to the secret key of your Amazon S3 bucket. - Update the
mobilitydatabase.token
field to the API key you requested from Mobility Database.
The recommended way to run the project is with Docker Compose. In this example we will scrape the GTFS-realtime feed for the New York City's Subway ACE lines. The feed id is mdb-1630
. Feed metadata is provided by Mobility Database.
- Open your command prompt or terminal.
- Ensure you are on the root project directory.
pwd # make sure you are at repo root directory
- Generate the
docker-compose.yml
for the mdb-1630 GTFS-realtime feed.
make local-prod-generate-compose FEEDS="mdb-1630" # space separated list of feed ids
- Start the application.
make local-prod-run # this will start containers defined in docker-compose.yml
Check that the scraper and normalizer is running correctly. Check the Amazon S3 bucket for new files. TODO: Expand here on how to ensure everything is working as intended.
The GTFS Realtime data exchange format is based on Protocol Buffers.
TODO: Expand on the architecture of the project.
- Two components
- Scraper
- Read protobufs from the transit agency API endpoint
- Save raw protobufs to the S3 bucket
- Normalizer
- Parse raw protobufs from the S3 bucket
- Convert data to parquet format
- Scraper
How to start local development Docker container
How to run scraper and normalizer locally
Instructions for implementing your custom feeds. How to implement your feeds
Common Errors: Address common queries and troubleshooting tips.
For other questions or support, please create an issue.
Improve documentation.
We welcome contributions! Please see our contributing guidelines for more information.
This project was created by Two Sigma Data Clinic volunteers.
TODO: Fill this out with third party software used, and prior work done. Reference TS Data Clinic.
TODO: Do we need a legal disclaimer here?
This project is licensed under the Apache License - see the LICENSE file for details.