Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding by nikbott · Pull Request #2 · nikbott/amr

nikbott · 2026-05-29T13:04:51Z

Summary

This branch reimplements the AMR engine as three uniform, parallel backends and adds build/test/doc scaffolding. It is a parallel restructure to the ERAD-SP line on main — the two took different directory layouts (this one: omp/ mpi/ cuda/; main: sequential/ openmp/ python/ results/ scripts/). Opening for review/discussion on how to reconcile the two, not as a clobber — nothing here overwrites the ERAD-SP work, and the pre-ERAD line is preserved in the archive/pre-restructure-main tag.

What's here

Backends (uniform core/physics/tree/viz/tests/main per dir)

omp/ — OpenMP backend (evolved from the flat cpp/ sequential sources)
mpi/ — new distributed backend: ghost exchange (MPI_Alltoallv), Z-curve repartition (MPI_Exscan), ghost-aware 2:1 balance, verify_global
cuda/ — Thrust-based, modular split around tree.cuh
common/core.hpp — shared Morton + strong types (__host__ __device__, BMI2 host fast-path), de-duplicating ~160 lines across backends

Correctness fixes

fix(mpi): coarsen() deadlocked at ≥2 ranks — empty-partition ranks returned before the all_reduce_or/update_partition_map collectives. Now all ranks reach them.
fix(cuda/tree): wrap kernel syncs in CHECK_CUDA (was silently swallowing errors).
perf(python): balance rewritten with bisect to match the C++ lower_bound, with a numba-free regression test (geometric 2:1 check).

Build / tooling

CMake AMR_BUILD_MPI / AMR_BUILD_CUDA flags; cuda/Makefile with --extended-lambda; Catch2 found-or-fetched.
.clang-format, .pre-commit-config.yaml, REFERENCES.md, docs/CI_FUTURE.md, top-level README.md.
benchmarks/: fresh OMP strong-scaling data + the historical ERAD-SP-era data/scripts (marked historical).

Verification (this host)

OpenMP tests: 50038 assertions / 8 cases pass
MPI tests: full suite passes at 1/2/4 ranks (50034 assertions)
CUDA: compiles + links clean (no GPU available here to run)
Python reference: Morton round-trip + balance invariants pass
CMake: all 6 targets build; ctest AllTests + MpiTests pass

All commits SSH-signed.

For discussion

The two restructures conflict at the directory level, so a clean merge needs a layout decision (keep omp/mpi/cuda, fold in ERAD-SP's run_experiments.sh/results, etc.). Happy to drive whichever direction we agree on.

🤖 Generated with Claude Code

Citation keys for [BurstWG2011], [Holke2018], [CDK2019], [Doerfler1996], [BBHK2011], [Karypis_ParMETIS], plus OpenMP/MPI/CUDA spec references. Cross-references code/REFERENCES.md (DIC side).

Google-derived; 4-space indent, 100-column limit, include grouping (system / third-party / project), C++20 concepts handling.

clang-format, cmake-format, ruff for python/, codespell, and a local hook that refuses to commit perf.data / SVGs / cuda binaries.

Targets: ci-local, lint, format, configure, build, test, test-python, parity, container-cpu, container-gpu, slurm-strong, figs.

GitHub Actions + GitLab CI YAML mapping to make targets, including the self-hosted GPU runner sketch.

Six cudaDeviceSynchronize() calls in cuda/tree.cuh previously swallowed runtime errors (k_mark_refine, k_scatter_refine x2, k_mark_coarsen, k_scatter_coarsen, k_check_balance). Adds CHECK_CUDA() and CHECK_CUDA_LAUNCH() macros that throw std::runtime_error on failure; the macros will migrate to common/cuda_check.hpp in Stage 2.

…r config Compiled backend executables (amr/test), CMake build trees, generated *.svg/*.vtu output, perf.data, and .vscode/ should never be tracked.

Uniform core/physics/tree/viz/tests/main layout, replacing the flat cpp/ sequential sources. Catch2 test suite in omp/tests.cpp.

New backend: linear Morton tree partitioned across ranks via MPI_Exscan offsets and MPI_Alltoallv repartition; ghost-aware 2:1 balance with fetch_ghosts; verify_global invariant check. No equivalent existed before.

Completes the cuda/ backend around the already-committed tree.cuh: Thrust-based scan/transform with persistent double buffering.

cpp/ is replaced by the omp/ backend (runs serially at 1 thread).

Release-default, LTO, optional sanitizers, Catch2 via FetchContent; builds amr + amr_tests from omp/. MPI/CUDA feature flags come in Stage 2.

…fork OpenMP thread sweeps and CUDA Colab/GPU-server runs behind the early speedup figures. Reference data; see benchmarks/README.md for provenance.

… fork Target the old flat layout; kept as reference, superseded by the Stage 2 SLURM/CMake harness. README documents provenance and caveats.

coarsen() returned early on n==0, skipping all_reduce_or and update_partition_map (both MPI_Allreduce/Allgather) — deadlocking every non-empty rank at np>=2. Guard only local marking/compaction with n>0 so all ranks reach the collectives. Verified: [coarsen] and full suite pass at np 1/2/4 (was a hang at np>=2).

Thrust device lambdas in tree.cuh need --extended-lambda; without it the build errors. Mirrors mpi/Makefile. ARCH overridable (default sm_70).

Documents the linear Morton-tree design, the four backends in the new omp/mpi/cuda/python layout, verified per-backend build/test commands, and a fresh OpenMP strong-scaling sample.

WALKTHROUGH.md describes origin/main's hand-rolled CUDA (tree_kernels.cu), superseded by the Thrust cuda/tree.cuh — banner says so and the dead /home/ronan file:// links are replaced. PARALLELIZATION.md gets a layout note.

…current Strong-scaling of the restructured omp/ backend (1-32 threads, identical 89.3M-leaf result = parallel determinism). README now separates this fresh data from the historical origin/main scripts/data.

The reference balance rebuilt an O(N) dict and walked coarse levels every ripple pass. Replace with bisect_right on the sorted leaf-code array (greatest code <= neighbour, then range containment) — O(log N) per query, mirroring the C++ backends' std::lower_bound so Python stays a faithful parity oracle. Verified byte-identical (codes, levels) to the prior implementation across 32 cases (2D/3D x 4 oracles x 4 depths).

Geometric face-adjacency 2:1 check (gold standard), sorted+unique invariant, balance fixed-point, and Morton round-trip. Runnable without numba/GPU so it works as the CI parity net.

MortonCode, Coordinate, the morton:: encode/decode/SWAR ops and the geometric constants were triplicated across omp/ mpi/ cuda/. Hoist them into one header-only common/core.hpp: constexpr, __host__ __device__ via an AMR_HD macro, BMI2 fast path guarded to host-x86 (off the CUDA device pass), and both array-returning (host) and out-param (device) decode overloads so no call sites change. Backends keep their specifics (OMP Uninit + exclusive_scan, MPI collectives, CUDA Thrust glue) and now include ../common/core.hpp. Verified: omp 50038 assertions, mpi np4 50034, cuda main+tests compile.

Root CMake now builds all three C++ backends behind opt-in flags: find_package(MPI) drives the mpi/ targets (ctest runs them on 4 ranks); enable_language(CUDA) + AMR_CUDA_ARCHITECTURES drives the cuda/ targets with --extended-lambda baked in. Catch2 is found-or-fetched (prefers a system install, no network when present). Verified: configure + build of all 6 targets; ctest AllTests + MpiTests pass (CudaTests registered, needs a GPU to run).

nikbott added 30 commits December 6, 2025 22:16

cpp sequential impl

7a67315

cpp implementation overhaul

eacef55

remove dead code

acc3787

docs: add REFERENCES.md with citation keys

1f4d323

Citation keys for [BurstWG2011], [Holke2018], [CDK2019], [Doerfler1996], [BBHK2011], [Karypis_ParMETIS], plus OpenMP/MPI/CUDA spec references. Cross-references code/REFERENCES.md (DIC side).

chore(lint): add clang-format config

6bda7aa

Google-derived; 4-space indent, 100-column limit, include grouping (system / third-party / project), C++20 concepts handling.

chore(hooks): add pre-commit config

a0c91b4

clang-format, cmake-format, ruff for python/, codespell, and a local hook that refuses to commit perf.data / SVGs / cuda binaries.

build(make): add Makefile with WITH_MPI/WITH_CUDA/WITH_SANITIZE flags

6185880

Targets: ci-local, lint, format, configure, build, test, test-python, parity, container-cpu, container-gpu, slurm-strong, figs.

docs(ci): add CI_FUTURE.md drop-in workflows

f7ab4b4

GitHub Actions + GitLab CI YAML mapping to make targets, including the self-hosted GPU runner sketch.

Merge branch 'chore/dev-infra'

9007a45

Merge branch 'fix/cuda-error-checks'

9a76c55

chore(gitignore): ignore build binaries, viz output, perf data, edito…

53d1474

…r config Compiled backend executables (amr/test), CMake build trees, generated *.svg/*.vtu output, perf.data, and .vscode/ should never be tracked.

refactor(omp): extract OpenMP backend into omp/

f9aef3d

Uniform core/physics/tree/viz/tests/main layout, replacing the flat cpp/ sequential sources. Catch2 test suite in omp/tests.cpp.

feat(mpi): add distributed MPI backend with ghost exchange

d0e1031

New backend: linear Morton tree partitioned across ranks via MPI_Exscan offsets and MPI_Alltoallv repartition; ghost-aware 2:1 balance with fetch_ghosts; verify_global invariant check. No equivalent existed before.

refactor(cuda): split backend into core/physics/viz/tests/main modules

666d3fb

Completes the cuda/ backend around the already-committed tree.cuh: Thrust-based scan/transform with persistent double buffering.

refactor(cpp): remove flat sequential sources superseded by backends

8a7a639

cpp/ is replaced by the omp/ backend (runs serially at 1 thread).

build(cmake): add top-level CMakeLists for OpenMP backend

beac96b

Release-default, LTO, optional sanitizers, Catch2 via FetchContent; builds amr + amr_tests from omp/. MPI/CUDA feature flags come in Stage 2.

Merge branch 'refactor/backend-layout'

b45f635

docs(omp): import parallelization guide from origin/main fork

65c42c9

docs(cuda): import walkthrough and Colab notebook from origin/main fork

e00ee4c

chore(benchmarks): import historical benchmark data from origin/main …

ab8f5da

…fork OpenMP thread sweeps and CUDA Colab/GPU-server runs behind the early speedup figures. Reference data; see benchmarks/README.md for provenance.

chore(benchmarks): import benchmark and plot scripts from origin/main…

d969735

… fork Target the old flat layout; kept as reference, superseded by the Stage 2 SLURM/CMake harness. README documents provenance and caveats.

Merge branch 'chore/port-origin-main-artifacts'

2f13433

Merge branch 'fix/mpi-coarsen-deadlock'

955dbc5

build(cuda): add Makefile with nvcc --extended-lambda

b6b86db

Thrust device lambdas in tree.cuh need --extended-lambda; without it the build errors. Mirrors mpi/Makefile. ARCH overridable (default sm_70).

docs: add top-level README with architecture and build/test matrix

3053541

Documents the linear Morton-tree design, the four backends in the new omp/mpi/cuda/python layout, verified per-backend build/test commands, and a fresh OpenMP strong-scaling sample.

docs: mark ported guides historical; fix stale paths

bdddf17

WALKTHROUGH.md describes origin/main's hand-rolled CUDA (tree_kernels.cu), superseded by the Thrust cuda/tree.cuh — banner says so and the dead /home/ronan file:// links are replaced. PARALLELIZATION.md gets a layout note.

chore(benchmarks): add fresh OMP scaling data; clarify historical vs …

4e585d6

…current Strong-scaling of the restructured omp/ backend (1-32 threads, identical 89.3M-leaf result = parallel determinism). README now separates this fresh data from the historical origin/main scripts/data.

Merge branch 'docs/sync-to-backend-layout'

81b4a64

nikbott added 7 commits May 29, 2026 08:44

test(python): add numba-free balance/Morton regression tests

4cd8301

Geometric face-adjacency 2:1 check (gold standard), sorted+unique invariant, balance fixed-point, and Morton round-trip. Runnable without numba/GPU so it works as the CI parity net.

Merge branch 'refactor/python-balance-bisect'

a95c526

Merge branch 'refactor/common-core'

fa3c76e

Merge branch 'build/cmake-backends'

86d6d61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2

Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2
nikbott wants to merge 37 commits into
mainfrom
cpp-sequential

nikbott commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikbott commented May 29, 2026

Summary

What's here

Verification (this host)

For discussion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant