A comprehensive data platform for the Collaboratory for Equity in Water Allocation (COEQWAL) project for California water management scenario presentation, analysis, and review.
The data platform code repository and documentation is organized into four sections(data, etl, database, api). Each section has its own series of (often nested) READMEs for details.
The AWS infrastructure reference, INFRASTRUCTURE.md, lives in the private coeqwal-private-docs repository (https://github.com/berkeley-gif/coeqwal-private-docs). A schematic of that infrastructure can be found at https://dev.coeqwal.org/aws_architecture.html.
Directory of reference files related to project data.
Active scenarios (72): s0011, s0020, s0021, s0023, s0024, s0025, s0026, s0027, s0028, s0030, s0031, s0032, s0033, s0035, s0036, s0037, s0039, s0040, s0041, s0042, s0044, s0045, s0046, s0047, s0048, s0049, s0050, s0051, s0056, s0057, s0058, s0059, s0060, s0062, s0063, s0065, s0067, s0068, s0069, s0071, s0072, s0073, s0074, s0075, s0076, s0077, s0078, s0079, s0080, s0081, s0082, s0083, s0084, s0085, s0087, s0088, s0089, s0091, s0092, s0093, s0094, s0095, s0096, s0097, s0098, s0099, s0100, s0101, s0102, s0103, s0104, s0105
Last refreshed 2026-05-22 from https://api.coeqwal.org/api/scenarios. Regenerate with python etl/ingestion/tools/refresh_active_scenarios.py.
The block above is regenerated by hand. To check the public API right now, without a local DB connection:
curl -s https://api.coeqwal.org/api/scenarios \
| python3 -c "import json,sys; d=json.load(sys.stdin); codes=sorted(s['short_code'] for s in d); print(f'{len(codes)} active scenarios on the public API, range {codes[0]} to {codes[-1]}.')"Prints the live count and the lowest / highest short_code from the production API.
Development status. The ETL is still under iterative development, but provides a fully-functioning scenario and tier loading process. Development
process has been to iteratively improve the process with each successive batch loaded. Next steps are to unify the pipeline steps (etl/run_full_pipeline.py, in progress), unify and streamline the verification, and check the (currently rough) calculations. It is important to know that I typically only had time to do a quick first pass on the calculations, copied from the Water Allocation Modeling Team's Jupyter notebook as best I could given time constraints. Future iterations can harden these. I also accepted inefficiencies when working alone that could be improved for a more developer-friendly process.
The ETL has two parallel pipelines: the scenario model-run pipeline and the tier data pipeline. The full developer runbook (every command, the signal that each step worked, and troubleshooting) lives in etl/README.md. This section is the conceptual map.
Handles CalSim scenario model run data end to end:
scan→download→promote→AWS Batch→audit→run_all (statistics)→verify→activate
It moves in two stages:
- Stage 1 (
scanthrough the post-Batchaudit) pulls each scenario's model run ZIP from the COEQWAL shared Google Drive, stages it to S3, and extracts the SV input and DV output DSS files to CSV. - Stage 2 (
run_all→activate) reads those CSVs back from S3, computes derived statistics, and writes them to PostgreSQL. An activation step flips the scenarios'is_activerecord in the database, signalling to the API to deliver it.
Today each arrow is a manual step the developer runs in sequence from Cloud9.
Stage 1 products land in S3 (under s3://coeqwal-model-run/scenario/<id>/):
- The original CalSim ZIP (
run/) - The extracted DV and SV CSVs (
csv/) - The trend report reference CSV when present (
verify/) - Two JSON receipts at the scenario root:
ingest_record.json(ingestion contract: declared basenames, SHA-256 hashes, provenance) andextract_record.json(Batch run record with the validation result inlined)
Stage 2 products land in PostgreSQL as statistics rows: reservoir storage, urban delivery, ag demand and shortage, MI contractor reliability, environmental flow alteration, refuge delivery, delta salinity / NDO / X2.
Website: Stage 1's S3 products feed Get Data (download links served by the coeqwalPresignDownload Lambda), and Stage 2's statistics feed Data in Depth (via the FastAPI service).
Two pieces of automation are designed and on the way to production-ready:
- A unified driver (
etl/run_full_pipeline.py) that wraps every stage into one resumable command. Its current state and caveats are documented in the ETL runbook. - An EventBridge
Batch -> run_alltrigger so statistics fire automatically on extraction completion.
Both are tracked in docs/TEAM_RUNBOOK.md.
Team-delivered tier outcome CSVs loaded straight into PostgreSQL, independent of the scenario model-run pipeline. The developer drops correctly-formatted CSVs into etl/tier_data/staging/, the loader generates an idempotent UPSERT, and psql applies it to the tier_result and tier_location_result tables. No S3 or Batch involvement. See the tier-data runbook for the commands.
Products: Tier result rows in the database. Website: Tier visualization tools.
Every stage of both pipelines leaves a receipt, and each pipeline has verifiers that answer a specific question:
- Did the DSS file convert to CSV correctly? The Batch container runs
validate_csvs.pyagainst the modeling team's trend report reference CSV. Automatic, on every Batch run. - Did the statistics load into Postgres correctly?
verify_all_sections.pyrecomputes the headline statistics from the reference CSVs and compares against the database (developer diagnostic, experimental). - Does the public API return those same numbers?
verify_api.pycomparesapi.coeqwal.orgresponses against direct database queries. - For tier data, does the database match what the team handed us?
verify_tiers.pycomparestier_resultrows against the staging CSVs the team delivered.
What to read after each run, what each console summary and status value means, and which artifact to open when something is flagged all live in etl/verification/README.md. The end-to-end artifact map (what lands where, and which receipts are in S3, on local disk, or tracked in git) is in the ETL runbook.
PostgreSQL + PostGIS on RDS. A highly-normalized schema of ~96 tables organized conceptually into layers.
Layers:
- 00-09 Foundational and reference data: versioning, lookups, network, entities, variables, assumptions and operations, scenarios, hydroclimate, themes, tier locations.
- 10+ Derived results: tier results, statistics, period summaries.
Source-of-truth artifacts:
- ERD:
database/schema/COEQWAL_SCENARIOS_DB_ERD.md - Latest monthly audit:
audits/monthly_<timestamp>/report.md(gitignored, regenerable)
Audit chain (each tool answers a different question):
| Question | Tool |
|---|---|
| Full monthly audit: content + verification + health + cost | python database/audit/run_monthly_audit.py |
| Is the DB shaped correctly? | database/run_audit.sh, verify_erd_against_audit.py, per-layer 09_verify_level*.sql |
| Are layers 00-08 correct? | database/scripts/export_layer_tables.py + diff vs database/seed_tables/ |
| Are computed results correct? | etl/statistics/verify_all_sections.py (CSV -> DB), etl/statistics/verify_api.py (DB -> API) |
After a monthly audit, what to read:
| Run | After a run, what do I read? |
|---|---|
Monthly database auditdatabase/audit/run_monthly_audit.py |
Console: Ends with a MONTHLY AUDIT COMPLETE block naming the output directory and the report filename.Digest: audits/monthly_<ts>/report.md for the top-level summary (row counts, ERD diff, audit-field checks). Drill into layer_exports/ or results_samples/ only if a section is flagged. |
Tech: PostgreSQL, PostGIS, psql, aws_s3 extension for S3-side loads.
See database/README.md for the full schema reference, audit guide, and developer onboarding.
The public-facing surface at https://api.coeqwal.org. Two pieces:
| Piece | What it does | Source |
|---|---|---|
| FastAPI service on ECS Fargate | Serves the rest of the API. Statistics, tiers, verification status, etc. Async Python with automatic OpenAPI docs at /docs |
api/coeqwal-api/ |
coeqwalPresignDownload Lambda |
Lists scenarios in S3 and presigns download URLs. Backs GET /scenario and GET /download through API Gateway v2 |
api/lambda/coeqwalPresignDownload/ |
Endpoints:
- Production: https://api.coeqwal.org/api
- Interactive docs: https://api.coeqwal.org/docs
Request flow (FastAPI service):
Request -> Uvicorn -> FastAPI -> Pydantic (validates) -> asyncpg (queries DB) -> Response
Tech: FastAPI, Pydantic, asyncpg, Uvicorn. Containerized and deployed to ECS Fargate via GitHub Actions.
See api/coeqwal-api/README.md for endpoints, filtering options, and local development. See api/lambda/coeqwalPresignDownload/README.md for the download Lambda.
What runs where, at a glance:
| Service | Role |
|---|---|
| RDS PostgreSQL | Managed Postgres with PostGIS. Production database |
| ECS Fargate | Runs the FastAPI container (Docker -> ECR -> ECS) |
S3 (coeqwal-model-run) |
Model run ZIPs and extracted CSVs |
Lambda (coeqwalEtlTrigger) |
S3 PUT -> Batch SubmitJob for the extraction pipeline |
Lambda (coeqwalPresignDownload) |
Presigned-URL downloads for the website |
Lambda (coeqwal-database-audit) |
Scheduled DB audit |
| AWS Batch on Fargate Spot | Runs the DSS-to-CSV extraction container (coeqwal-dss-queue, job def coeqwal-dss-jobdef:3) |
| ECR | Docker image registry (coeqwal-etl:latest) |
| API Gateway (HTTP API v2) | Routes GET /scenario and GET /download to the presign Lambda |
| Route 53 | DNS routing to api.coeqwal.org |
Cloud9 (coeqwal-db-admin) |
Dev environment with the credentials and connection strings for running ETL and DB ops |
| GitHub Actions | CI for the FastAPI service (api/coeqwal-api/) and the ETL container (etl/batch-container/) |
Detailed runbook (AWS account IDs, ARNs, cost levers, IAM policies, deprecation notes) lives in INFRASTRUCTURE.md in the private coeqwal-private-docs repository (https://github.com/berkeley-gif/coeqwal-private-docs).
If the EC2 backing your Cloud9 instance is running low on disk space (common after OS updates, package installs, or prolonged operation):
Check current disk usage:
df -hShows disk usage for all mounted filesystems in human-readable units (GB/MB). The / (root) filesystem is the one most likely to fill up. Look for Use% approaching 100%.
Clean the DNF package cache:
sudo dnf clean allDNF (the package manager on Amazon Linux 2023 / RHEL-based systems) caches downloaded packages and metadata on disk after installation. Over time this cache can grow significantly. dnf clean all removes all cached package data, repo metadata, and headers. Safe to run at any time. Packages are re-downloaded from the repo on the next dnf operation.
Run df -h again to confirm space was reclaimed.
If dnf clean all doesn't reclaim much:
Manually remove the DNF cache directory and check root disk usage:
sudo rm -rf /var/cache/dnf/*
df -h /Trim the system journal:
systemd accumulates log journal files under /var/log/journal/. Check how much space the journal is using, then vacuum it down to 50 MB:
sudo journalctl --disk-usage
sudo journalctl --vacuum-size=50M
df -h /--disk-usage reports total journal size. --vacuum-size=50M deletes the oldest journal files until the total size is at or below 50 MB. Safe to run. It only removes old log history, not active logs.
If space is still low, check Docker (~1.2 GB):
Docker can accumulate disk usage from stopped containers, dangling images, unused volumes, and build cache. See a breakdown of what Docker is holding:
docker system dfThis shows how much space is used by images, containers, volumes, and build cache, and how much is "reclaimable" (unused).
Safely remove everything unused:
docker container prune -f # removes all stopped containers
docker image prune -a -f # removes all images not used by a running container
docker volume prune -f # removes all volumes not attached to a container
df -h /These commands only delete objects that are not currently in use. Running containers, their images, and attached volumes are untouched. The -f flag skips the confirmation prompt.
When picking up a partially-finished thread (geometry refactor, gw/sw reconciliation, TAIESM1 ingest, master crosswalk reconciliation, end-to-end pipeline automation, etc.), start at the team runbook. It indexes each open thread with current state, files touched, and a "Next steps" block.
docs/TEAM_RUNBOOK.md- active threads, rolled-back roadmap entries (R1, R2), and conventions for picking work back updocs/statistics_roadmap.md- statistics ETL roadmap (connection unification, atomic transactions, verification streamlining, reference-directory clarity)
See LICENSE for details.