Real-time TTC data ETL pipeline using Apache Airflow, PostgreSQL, and MinIO.
This project ingests real-time GTFS data from TTC's API, stores it in MinIO, and processes it into PostgreSQL for analysis. The pipeline joins real-time data with static GTFS data to provide comprehensive transit information.
- Data Source: TTC GTFS-RT API
- Storage: MinIO (object storage)
- Database: PostgreSQL
- Orchestration: Apache Airflow
- Containerization: Docker & Docker Compose
track-ttc/
├── dags/ # Airflow DAGs
├── plugins/ # Custom Airflow components
├── src/ # ETL business logic
├── sql/ # Database scripts
├── data/ # Static GTFS data
├── docker/ # Docker configurations
├── config/ # Configuration files
└── tests/ # Unit and integration tests
- Docker & Docker Compose
- Python 3.9+
- Git
-
Clone the repository
git clone <repository-url> cd track-ttc
-
Set up environment
cp .env.example .env # Edit .env with your configuration -
Start services
docker-compose up -d
-
Access services
- Airflow: http://localhost:8080
- MinIO Console: http://localhost:9001
- PostgreSQL: localhost:5432
Required environment variables in .env:
POSTGRES_USER=postgres
POSTGRES_PASSWORD=SimplePassword
POSTGRES_DB=trackTTC
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
python -m pytest tests/python test.py- Extract: Fetch GTFS-RT data from TTC API
- Load: Store raw data in MinIO
- Transform: Process and join with static GTFS data
- Load: Insert processed data into PostgreSQL
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License