Skip to content

Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2

Open
nikbott wants to merge 37 commits into
mainfrom
cpp-sequential
Open

Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2
nikbott wants to merge 37 commits into
mainfrom
cpp-sequential

Conversation

@nikbott
Copy link
Copy Markdown
Owner

@nikbott nikbott commented May 29, 2026

Summary

This branch reimplements the AMR engine as three uniform, parallel backends and adds build/test/doc scaffolding. It is a parallel restructure to the ERAD-SP line on main — the two took different directory layouts (this one: omp/ mpi/ cuda/; main: sequential/ openmp/ python/ results/ scripts/). Opening for review/discussion on how to reconcile the two, not as a clobber — nothing here overwrites the ERAD-SP work, and the pre-ERAD line is preserved in the archive/pre-restructure-main tag.

What's here

Backends (uniform core/physics/tree/viz/tests/main per dir)

  • omp/ — OpenMP backend (evolved from the flat cpp/ sequential sources)
  • mpi/new distributed backend: ghost exchange (MPI_Alltoallv), Z-curve repartition (MPI_Exscan), ghost-aware 2:1 balance, verify_global
  • cuda/ — Thrust-based, modular split around tree.cuh
  • common/core.hpp — shared Morton + strong types (__host__ __device__, BMI2 host fast-path), de-duplicating ~160 lines across backends

Correctness fixes

  • fix(mpi): coarsen() deadlocked at ≥2 ranks — empty-partition ranks returned before the all_reduce_or/update_partition_map collectives. Now all ranks reach them.
  • fix(cuda/tree): wrap kernel syncs in CHECK_CUDA (was silently swallowing errors).
  • perf(python): balance rewritten with bisect to match the C++ lower_bound, with a numba-free regression test (geometric 2:1 check).

Build / tooling

  • CMake AMR_BUILD_MPI / AMR_BUILD_CUDA flags; cuda/Makefile with --extended-lambda; Catch2 found-or-fetched.
  • .clang-format, .pre-commit-config.yaml, REFERENCES.md, docs/CI_FUTURE.md, top-level README.md.
  • benchmarks/: fresh OMP strong-scaling data + the historical ERAD-SP-era data/scripts (marked historical).

Verification (this host)

  • OpenMP tests: 50038 assertions / 8 cases pass
  • MPI tests: full suite passes at 1/2/4 ranks (50034 assertions)
  • CUDA: compiles + links clean (no GPU available here to run)
  • Python reference: Morton round-trip + balance invariants pass
  • CMake: all 6 targets build; ctest AllTests + MpiTests pass

All commits SSH-signed.

For discussion

The two restructures conflict at the directory level, so a clean merge needs a layout decision (keep omp/mpi/cuda, fold in ERAD-SP's run_experiments.sh/results, etc.). Happy to drive whichever direction we agree on.

🤖 Generated with Claude Code

Citation keys for [BurstWG2011], [Holke2018], [CDK2019], [Doerfler1996],
[BBHK2011], [Karypis_ParMETIS], plus OpenMP/MPI/CUDA spec references.
Cross-references code/REFERENCES.md (DIC side).
Google-derived; 4-space indent, 100-column limit, include grouping
(system / third-party / project), C++20 concepts handling.
clang-format, cmake-format, ruff for python/, codespell, and a local
hook that refuses to commit perf.data / SVGs / cuda binaries.
Targets: ci-local, lint, format, configure, build, test, test-python,
parity, container-cpu, container-gpu, slurm-strong, figs.
GitHub Actions + GitLab CI YAML mapping to make targets, including the
self-hosted GPU runner sketch.
Six cudaDeviceSynchronize() calls in cuda/tree.cuh previously swallowed
runtime errors (k_mark_refine, k_scatter_refine x2, k_mark_coarsen,
k_scatter_coarsen, k_check_balance). Adds CHECK_CUDA() and
CHECK_CUDA_LAUNCH() macros that throw std::runtime_error on failure;
the macros will migrate to common/cuda_check.hpp in Stage 2.
…r config

Compiled backend executables (amr/test), CMake build trees, generated
*.svg/*.vtu output, perf.data, and .vscode/ should never be tracked.
Uniform core/physics/tree/viz/tests/main layout, replacing the flat
cpp/ sequential sources. Catch2 test suite in omp/tests.cpp.
New backend: linear Morton tree partitioned across ranks via MPI_Exscan
offsets and MPI_Alltoallv repartition; ghost-aware 2:1 balance with
fetch_ghosts; verify_global invariant check. No equivalent existed before.
Completes the cuda/ backend around the already-committed tree.cuh:
Thrust-based scan/transform with persistent double buffering.
cpp/ is replaced by the omp/ backend (runs serially at 1 thread).
Release-default, LTO, optional sanitizers, Catch2 via FetchContent;
builds amr + amr_tests from omp/. MPI/CUDA feature flags come in Stage 2.
…fork

OpenMP thread sweeps and CUDA Colab/GPU-server runs behind the early
speedup figures. Reference data; see benchmarks/README.md for provenance.
… fork

Target the old flat layout; kept as reference, superseded by the Stage 2
SLURM/CMake harness. README documents provenance and caveats.
coarsen() returned early on n==0, skipping all_reduce_or and
update_partition_map (both MPI_Allreduce/Allgather) — deadlocking every
non-empty rank at np>=2. Guard only local marking/compaction with n>0 so
all ranks reach the collectives.

Verified: [coarsen] and full suite pass at np 1/2/4 (was a hang at np>=2).
Thrust device lambdas in tree.cuh need --extended-lambda; without it the
build errors. Mirrors mpi/Makefile. ARCH overridable (default sm_70).
Documents the linear Morton-tree design, the four backends in the new
omp/mpi/cuda/python layout, verified per-backend build/test commands, and
a fresh OpenMP strong-scaling sample.
WALKTHROUGH.md describes origin/main's hand-rolled CUDA (tree_kernels.cu),
superseded by the Thrust cuda/tree.cuh — banner says so and the dead
/home/ronan file:// links are replaced. PARALLELIZATION.md gets a layout note.
…current

Strong-scaling of the restructured omp/ backend (1-32 threads, identical
89.3M-leaf result = parallel determinism). README now separates this fresh
data from the historical origin/main scripts/data.
nikbott added 7 commits May 29, 2026 08:44
The reference balance rebuilt an O(N) dict and walked coarse levels every
ripple pass. Replace with bisect_right on the sorted leaf-code array
(greatest code <= neighbour, then range containment) — O(log N) per query,
mirroring the C++ backends' std::lower_bound so Python stays a faithful
parity oracle.

Verified byte-identical (codes, levels) to the prior implementation across
32 cases (2D/3D x 4 oracles x 4 depths).
Geometric face-adjacency 2:1 check (gold standard), sorted+unique
invariant, balance fixed-point, and Morton round-trip. Runnable without
numba/GPU so it works as the CI parity net.
MortonCode, Coordinate, the morton:: encode/decode/SWAR ops and the
geometric constants were triplicated across omp/ mpi/ cuda/. Hoist them
into one header-only common/core.hpp: constexpr, __host__ __device__ via
an AMR_HD macro, BMI2 fast path guarded to host-x86 (off the CUDA device
pass), and both array-returning (host) and out-param (device) decode
overloads so no call sites change.

Backends keep their specifics (OMP Uninit + exclusive_scan, MPI
collectives, CUDA Thrust glue) and now include ../common/core.hpp.

Verified: omp 50038 assertions, mpi np4 50034, cuda main+tests compile.
Root CMake now builds all three C++ backends behind opt-in flags:
find_package(MPI) drives the mpi/ targets (ctest runs them on 4 ranks);
enable_language(CUDA) + AMR_CUDA_ARCHITECTURES drives the cuda/ targets
with --extended-lambda baked in. Catch2 is found-or-fetched (prefers a
system install, no network when present).

Verified: configure + build of all 6 targets; ctest AllTests + MpiTests
pass (CudaTests registered, needs a GPU to run).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant