Skip to content

berkeley-gif/coeqwal-data-platform

Repository files navigation

COEQWAL Data Platform

A comprehensive data platform for the Collaboratory for Equity in Water Allocation (COEQWAL) project for California water management scenario presentation, analysis, and review.

Code

The data platform code repository and documentation is organized into four sections(data, etl, database, api). Each section has its own series of (often nested) READMEs for details.

AWS infrastructure

The AWS infrastructure reference, INFRASTRUCTURE.md, lives in the private coeqwal-private-docs repository (https://github.com/berkeley-gif/coeqwal-private-docs). A schematic of that infrastructure can be found at https://dev.coeqwal.org/aws_architecture.html.

Directory of reference files related to project data.

Active scenarios (72): s0011, s0020, s0021, s0023, s0024, s0025, s0026, s0027, s0028, s0030, s0031, s0032, s0033, s0035, s0036, s0037, s0039, s0040, s0041, s0042, s0044, s0045, s0046, s0047, s0048, s0049, s0050, s0051, s0056, s0057, s0058, s0059, s0060, s0062, s0063, s0065, s0067, s0068, s0069, s0071, s0072, s0073, s0074, s0075, s0076, s0077, s0078, s0079, s0080, s0081, s0082, s0083, s0084, s0085, s0087, s0088, s0089, s0091, s0092, s0093, s0094, s0095, s0096, s0097, s0098, s0099, s0100, s0101, s0102, s0103, s0104, s0105

Last refreshed 2026-05-22 from https://api.coeqwal.org/api/scenarios. Regenerate with python etl/ingestion/tools/refresh_active_scenarios.py.

Confirm live scenario coverage

The block above is regenerated by hand. To check the public API right now, without a local DB connection:

curl -s https://api.coeqwal.org/api/scenarios \
  | python3 -c "import json,sys; d=json.load(sys.stdin); codes=sorted(s['short_code'] for s in d); print(f'{len(codes)} active scenarios on the public API, range {codes[0]} to {codes[-1]}.')"

Prints the live count and the lowest / highest short_code from the production API.

Development status. The ETL is still under iterative development, but provides a fully-functioning scenario and tier loading process. Development process has been to iteratively improve the process with each successive batch loaded. Next steps are to unify the pipeline steps (etl/run_full_pipeline.py, in progress), unify and streamline the verification, and check the (currently rough) calculations. It is important to know that I typically only had time to do a quick first pass on the calculations, copied from the Water Allocation Modeling Team's Jupyter notebook as best I could given time constraints. Future iterations can harden these. I also accepted inefficiencies when working alone that could be improved for a more developer-friendly process.

The ETL has two parallel pipelines: the scenario model-run pipeline and the tier data pipeline. The full developer runbook (every command, the signal that each step worked, and troubleshooting) lives in etl/README.md. This section is the conceptual map.

Scenario model-run pipeline

Handles CalSim scenario model run data end to end:

scandownloadpromoteAWS Batchauditrun_all (statistics)verifyactivate

It moves in two stages:

  • Stage 1 (scan through the post-Batch audit) pulls each scenario's model run ZIP from the COEQWAL shared Google Drive, stages it to S3, and extracts the SV input and DV output DSS files to CSV.
  • Stage 2 (run_allactivate) reads those CSVs back from S3, computes derived statistics, and writes them to PostgreSQL. An activation step flips the scenarios' is_active record in the database, signalling to the API to deliver it.

Today each arrow is a manual step the developer runs in sequence from Cloud9.

Stage 1 products land in S3 (under s3://coeqwal-model-run/scenario/<id>/):

  • The original CalSim ZIP (run/)
  • The extracted DV and SV CSVs (csv/)
  • The trend report reference CSV when present (verify/)
  • Two JSON receipts at the scenario root: ingest_record.json (ingestion contract: declared basenames, SHA-256 hashes, provenance) and extract_record.json (Batch run record with the validation result inlined)

Stage 2 products land in PostgreSQL as statistics rows: reservoir storage, urban delivery, ag demand and shortage, MI contractor reliability, environmental flow alteration, refuge delivery, delta salinity / NDO / X2.

Website: Stage 1's S3 products feed Get Data (download links served by the coeqwalPresignDownload Lambda), and Stage 2's statistics feed Data in Depth (via the FastAPI service).

Two pieces of automation are designed and on the way to production-ready:

  • A unified driver (etl/run_full_pipeline.py) that wraps every stage into one resumable command. Its current state and caveats are documented in the ETL runbook.
  • An EventBridge Batch -> run_all trigger so statistics fire automatically on extraction completion.

Both are tracked in docs/TEAM_RUNBOOK.md.

Tier data pipeline

Team-delivered tier outcome CSVs loaded straight into PostgreSQL, independent of the scenario model-run pipeline. The developer drops correctly-formatted CSVs into etl/tier_data/staging/, the loader generates an idempotent UPSERT, and psql applies it to the tier_result and tier_location_result tables. No S3 or Batch involvement. See the tier-data runbook for the commands.

Products: Tier result rows in the database. Website: Tier visualization tools.

Verifications and audits

Every stage of both pipelines leaves a receipt, and each pipeline has verifiers that answer a specific question:

  1. Did the DSS file convert to CSV correctly? The Batch container runs validate_csvs.py against the modeling team's trend report reference CSV. Automatic, on every Batch run.
  2. Did the statistics load into Postgres correctly? verify_all_sections.py recomputes the headline statistics from the reference CSVs and compares against the database (developer diagnostic, experimental).
  3. Does the public API return those same numbers? verify_api.py compares api.coeqwal.org responses against direct database queries.
  4. For tier data, does the database match what the team handed us? verify_tiers.py compares tier_result rows against the staging CSVs the team delivered.

What to read after each run, what each console summary and status value means, and which artifact to open when something is flagged all live in etl/verification/README.md. The end-to-end artifact map (what lands where, and which receipts are in S3, on local disk, or tracked in git) is in the ETL runbook.

PostgreSQL + PostGIS on RDS. A highly-normalized schema of ~96 tables organized conceptually into layers.

Layers:

  • 00-09 Foundational and reference data: versioning, lookups, network, entities, variables, assumptions and operations, scenarios, hydroclimate, themes, tier locations.
  • 10+ Derived results: tier results, statistics, period summaries.

Source-of-truth artifacts:

Audit chain (each tool answers a different question):

Question Tool
Full monthly audit: content + verification + health + cost python database/audit/run_monthly_audit.py
Is the DB shaped correctly? database/run_audit.sh, verify_erd_against_audit.py, per-layer 09_verify_level*.sql
Are layers 00-08 correct? database/scripts/export_layer_tables.py + diff vs database/seed_tables/
Are computed results correct? etl/statistics/verify_all_sections.py (CSV -> DB), etl/statistics/verify_api.py (DB -> API)

After a monthly audit, what to read:

Run After a run, what do I read?
Monthly database audit
database/audit/run_monthly_audit.py
Console: Ends with a MONTHLY AUDIT COMPLETE block naming the output directory and the report filename.

Digest: audits/monthly_<ts>/report.md for the top-level summary (row counts, ERD diff, audit-field checks). Drill into layer_exports/ or results_samples/ only if a section is flagged.

Tech: PostgreSQL, PostGIS, psql, aws_s3 extension for S3-side loads.

See database/README.md for the full schema reference, audit guide, and developer onboarding.

The public-facing surface at https://api.coeqwal.org. Two pieces:

Piece What it does Source
FastAPI service on ECS Fargate Serves the rest of the API. Statistics, tiers, verification status, etc. Async Python with automatic OpenAPI docs at /docs api/coeqwal-api/
coeqwalPresignDownload Lambda Lists scenarios in S3 and presigns download URLs. Backs GET /scenario and GET /download through API Gateway v2 api/lambda/coeqwalPresignDownload/

Endpoints:

Request flow (FastAPI service):

Request -> Uvicorn -> FastAPI -> Pydantic (validates) -> asyncpg (queries DB) -> Response

Tech: FastAPI, Pydantic, asyncpg, Uvicorn. Containerized and deployed to ECS Fargate via GitHub Actions.

See api/coeqwal-api/README.md for endpoints, filtering options, and local development. See api/lambda/coeqwalPresignDownload/README.md for the download Lambda.

AWS infrastructure

What runs where, at a glance:

Service Role
RDS PostgreSQL Managed Postgres with PostGIS. Production database
ECS Fargate Runs the FastAPI container (Docker -> ECR -> ECS)
S3 (coeqwal-model-run) Model run ZIPs and extracted CSVs
Lambda (coeqwalEtlTrigger) S3 PUT -> Batch SubmitJob for the extraction pipeline
Lambda (coeqwalPresignDownload) Presigned-URL downloads for the website
Lambda (coeqwal-database-audit) Scheduled DB audit
AWS Batch on Fargate Spot Runs the DSS-to-CSV extraction container (coeqwal-dss-queue, job def coeqwal-dss-jobdef:3)
ECR Docker image registry (coeqwal-etl:latest)
API Gateway (HTTP API v2) Routes GET /scenario and GET /download to the presign Lambda
Route 53 DNS routing to api.coeqwal.org
Cloud9 (coeqwal-db-admin) Dev environment with the credentials and connection strings for running ETL and DB ops
GitHub Actions CI for the FastAPI service (api/coeqwal-api/) and the ETL container (etl/batch-container/)

Detailed runbook (AWS account IDs, ARNs, cost levers, IAM policies, deprecation notes) lives in INFRASTRUCTURE.md in the private coeqwal-private-docs repository (https://github.com/berkeley-gif/coeqwal-private-docs).

Reclaiming disk space on the Cloud9 / EC2 instance

If the EC2 backing your Cloud9 instance is running low on disk space (common after OS updates, package installs, or prolonged operation):

Check current disk usage:

df -h

Shows disk usage for all mounted filesystems in human-readable units (GB/MB). The / (root) filesystem is the one most likely to fill up. Look for Use% approaching 100%.

Clean the DNF package cache:

sudo dnf clean all

DNF (the package manager on Amazon Linux 2023 / RHEL-based systems) caches downloaded packages and metadata on disk after installation. Over time this cache can grow significantly. dnf clean all removes all cached package data, repo metadata, and headers. Safe to run at any time. Packages are re-downloaded from the repo on the next dnf operation.

Run df -h again to confirm space was reclaimed.

If dnf clean all doesn't reclaim much:

Manually remove the DNF cache directory and check root disk usage:

sudo rm -rf /var/cache/dnf/*
df -h /

Trim the system journal:

systemd accumulates log journal files under /var/log/journal/. Check how much space the journal is using, then vacuum it down to 50 MB:

sudo journalctl --disk-usage
sudo journalctl --vacuum-size=50M
df -h /

--disk-usage reports total journal size. --vacuum-size=50M deletes the oldest journal files until the total size is at or below 50 MB. Safe to run. It only removes old log history, not active logs.

If space is still low, check Docker (~1.2 GB):

Docker can accumulate disk usage from stopped containers, dangling images, unused volumes, and build cache. See a breakdown of what Docker is holding:

docker system df

This shows how much space is used by images, containers, volumes, and build cache, and how much is "reclaimable" (unused).

Safely remove everything unused:

docker container prune -f   # removes all stopped containers
docker image prune -a -f    # removes all images not used by a running container
docker volume prune -f      # removes all volumes not attached to a container
df -h /

These commands only delete objects that are not currently in use. Running containers, their images, and attached volumes are untouched. The -f flag skips the confirmation prompt.

Open threads and roadmap

When picking up a partially-finished thread (geometry refactor, gw/sw reconciliation, TAIESM1 ingest, master crosswalk reconciliation, end-to-end pipeline automation, etc.), start at the team runbook. It indexes each open thread with current state, files touched, and a "Next steps" block.

  • docs/TEAM_RUNBOOK.md - active threads, rolled-back roadmap entries (R1, R2), and conventions for picking work back up
  • docs/statistics_roadmap.md - statistics ETL roadmap (connection unification, atomic transactions, verification streamlining, reference-directory clarity)

License

See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors