Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2
Open
nikbott wants to merge 37 commits into
Open
Restructure into omp/mpi/cuda backends + correctness fixes + build/CI scaffolding#2nikbott wants to merge 37 commits into
nikbott wants to merge 37 commits into
Conversation
Citation keys for [BurstWG2011], [Holke2018], [CDK2019], [Doerfler1996], [BBHK2011], [Karypis_ParMETIS], plus OpenMP/MPI/CUDA spec references. Cross-references code/REFERENCES.md (DIC side).
Google-derived; 4-space indent, 100-column limit, include grouping (system / third-party / project), C++20 concepts handling.
clang-format, cmake-format, ruff for python/, codespell, and a local hook that refuses to commit perf.data / SVGs / cuda binaries.
Targets: ci-local, lint, format, configure, build, test, test-python, parity, container-cpu, container-gpu, slurm-strong, figs.
GitHub Actions + GitLab CI YAML mapping to make targets, including the self-hosted GPU runner sketch.
Six cudaDeviceSynchronize() calls in cuda/tree.cuh previously swallowed runtime errors (k_mark_refine, k_scatter_refine x2, k_mark_coarsen, k_scatter_coarsen, k_check_balance). Adds CHECK_CUDA() and CHECK_CUDA_LAUNCH() macros that throw std::runtime_error on failure; the macros will migrate to common/cuda_check.hpp in Stage 2.
…r config Compiled backend executables (amr/test), CMake build trees, generated *.svg/*.vtu output, perf.data, and .vscode/ should never be tracked.
Uniform core/physics/tree/viz/tests/main layout, replacing the flat cpp/ sequential sources. Catch2 test suite in omp/tests.cpp.
New backend: linear Morton tree partitioned across ranks via MPI_Exscan offsets and MPI_Alltoallv repartition; ghost-aware 2:1 balance with fetch_ghosts; verify_global invariant check. No equivalent existed before.
Completes the cuda/ backend around the already-committed tree.cuh: Thrust-based scan/transform with persistent double buffering.
cpp/ is replaced by the omp/ backend (runs serially at 1 thread).
Release-default, LTO, optional sanitizers, Catch2 via FetchContent; builds amr + amr_tests from omp/. MPI/CUDA feature flags come in Stage 2.
…fork OpenMP thread sweeps and CUDA Colab/GPU-server runs behind the early speedup figures. Reference data; see benchmarks/README.md for provenance.
… fork Target the old flat layout; kept as reference, superseded by the Stage 2 SLURM/CMake harness. README documents provenance and caveats.
coarsen() returned early on n==0, skipping all_reduce_or and update_partition_map (both MPI_Allreduce/Allgather) — deadlocking every non-empty rank at np>=2. Guard only local marking/compaction with n>0 so all ranks reach the collectives. Verified: [coarsen] and full suite pass at np 1/2/4 (was a hang at np>=2).
Thrust device lambdas in tree.cuh need --extended-lambda; without it the build errors. Mirrors mpi/Makefile. ARCH overridable (default sm_70).
Documents the linear Morton-tree design, the four backends in the new omp/mpi/cuda/python layout, verified per-backend build/test commands, and a fresh OpenMP strong-scaling sample.
WALKTHROUGH.md describes origin/main's hand-rolled CUDA (tree_kernels.cu), superseded by the Thrust cuda/tree.cuh — banner says so and the dead /home/ronan file:// links are replaced. PARALLELIZATION.md gets a layout note.
…current Strong-scaling of the restructured omp/ backend (1-32 threads, identical 89.3M-leaf result = parallel determinism). README now separates this fresh data from the historical origin/main scripts/data.
The reference balance rebuilt an O(N) dict and walked coarse levels every ripple pass. Replace with bisect_right on the sorted leaf-code array (greatest code <= neighbour, then range containment) — O(log N) per query, mirroring the C++ backends' std::lower_bound so Python stays a faithful parity oracle. Verified byte-identical (codes, levels) to the prior implementation across 32 cases (2D/3D x 4 oracles x 4 depths).
Geometric face-adjacency 2:1 check (gold standard), sorted+unique invariant, balance fixed-point, and Morton round-trip. Runnable without numba/GPU so it works as the CI parity net.
MortonCode, Coordinate, the morton:: encode/decode/SWAR ops and the geometric constants were triplicated across omp/ mpi/ cuda/. Hoist them into one header-only common/core.hpp: constexpr, __host__ __device__ via an AMR_HD macro, BMI2 fast path guarded to host-x86 (off the CUDA device pass), and both array-returning (host) and out-param (device) decode overloads so no call sites change. Backends keep their specifics (OMP Uninit + exclusive_scan, MPI collectives, CUDA Thrust glue) and now include ../common/core.hpp. Verified: omp 50038 assertions, mpi np4 50034, cuda main+tests compile.
Root CMake now builds all three C++ backends behind opt-in flags: find_package(MPI) drives the mpi/ targets (ctest runs them on 4 ranks); enable_language(CUDA) + AMR_CUDA_ARCHITECTURES drives the cuda/ targets with --extended-lambda baked in. Catch2 is found-or-fetched (prefers a system install, no network when present). Verified: configure + build of all 6 targets; ctest AllTests + MpiTests pass (CudaTests registered, needs a GPU to run).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch reimplements the AMR engine as three uniform, parallel backends and adds build/test/doc scaffolding. It is a parallel restructure to the ERAD-SP line on
main— the two took different directory layouts (this one:omp/ mpi/ cuda/;main:sequential/ openmp/ python/ results/ scripts/). Opening for review/discussion on how to reconcile the two, not as a clobber — nothing here overwrites the ERAD-SP work, and the pre-ERAD line is preserved in thearchive/pre-restructure-maintag.What's here
Backends (uniform
core/physics/tree/viz/tests/mainper dir)omp/— OpenMP backend (evolved from the flatcpp/sequential sources)mpi/— new distributed backend: ghost exchange (MPI_Alltoallv), Z-curve repartition (MPI_Exscan), ghost-aware 2:1 balance,verify_globalcuda/— Thrust-based, modular split aroundtree.cuhcommon/core.hpp— shared Morton + strong types (__host__ __device__, BMI2 host fast-path), de-duplicating ~160 lines across backendsCorrectness fixes
fix(mpi):coarsen()deadlocked at ≥2 ranks — empty-partition ranks returned before theall_reduce_or/update_partition_mapcollectives. Now all ranks reach them.fix(cuda/tree): wrap kernel syncs inCHECK_CUDA(was silently swallowing errors).perf(python):balancerewritten withbisectto match the C++lower_bound, with a numba-free regression test (geometric 2:1 check).Build / tooling
AMR_BUILD_MPI/AMR_BUILD_CUDAflags;cuda/Makefilewith--extended-lambda; Catch2 found-or-fetched..clang-format,.pre-commit-config.yaml,REFERENCES.md,docs/CI_FUTURE.md, top-levelREADME.md.benchmarks/: fresh OMP strong-scaling data + the historical ERAD-SP-era data/scripts (marked historical).Verification (this host)
ctestAllTests + MpiTests passAll commits SSH-signed.
For discussion
The two restructures conflict at the directory level, so a clean merge needs a layout decision (keep
omp/mpi/cuda, fold in ERAD-SP'srun_experiments.sh/results, etc.). Happy to drive whichever direction we agree on.🤖 Generated with Claude Code