Skip to content

feat: support unified cast and type-safe host math fallbacks for reduce-type MPI operations#38

Merged
Ziminli merged 13 commits into
masterfrom
feat/support-unified-cast
Jun 17, 2026
Merged

feat: support unified cast and type-safe host math fallbacks for reduce-type MPI operations#38
Ziminli merged 13 commits into
masterfrom
feat/support-unified-cast

Conversation

@Ziminli

@Ziminli Ziminli commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR introduces a unified compile-time type-casting and capability evaluation framework across all currently supported devices.

Previously, performing reductions and scaling modifications within core MPI collectives (such as AllReduce, Reduce, and ReduceScatter) lacked a standardized approach for dealing with non-native host types (e.g., half and bfloat16).

This change introduces a generic Caster alongside a highly extensible SFINAE-driven expression detector (SupportsOp). Together, they allow collective reduction execution loops to automatically inspect type compatibility at compile time. They can now execute native, high-performance intrinsic paths where possible, while gracefully routing through transparent float-bridge emulation structures for platforms whose host toolchains do not support math on their device float types.

Changes

  • Core Infrastructure & Type Utilities

    • Added a generic Caster abstraction layer in src/caster.h to achieve a unified interface for data type conversions across all the platforms;
    • Implemented the SFINAE-driven expression reflection trait SupportsOp in src/traits.h to check operator support (e.g., *=, +=) dynamically at compile time across distinct type and scalar combinations;
    • Provided ToFloat<kDev>() and CastTo<kDev, T>() convenience shorthand wrappers to eliminate verbose template boilerplate;
    • Refactored the bridge file generation logic to dynamically discover, validate, and aggregate flexible backend-specific caster combinations.
  • Platform-Specific Hardware Cast Implementations

    • Implemented specialized HardwareCastImpl variants for CPU, NVIDIA, Iluvatar, MetaX, Moore Threads (MThreads), and Cambricon backends;
    • Updated data_type_.h inside src/cambricon/ to explicitly decouple fp16 and bf16 layouts structurally, eliminating underlying primitive collisions.
  • Collective Code Update

    • Applied the integrated SupportsOp and unified casting paths to the reduce-related OpenMPI implementations including all_reduce.h, reduce.h, and reduce_scatter.h.
  • Platform Adjustments & Host Toolchain Fixes

    • Added specialized template constraints for Iluvatar and Moore Threads (MThreads) headers to explicitly evaluate SupportsOp to false for half and bfloat16 structures on the host CPU pass, bypassing device-only compilation restrictions;
    • Updated AUTO_DETECT_DEVICES matching logic within the NVIDIA driver mapping layer to specifically and safely bind to GPU card 0;
    • Added explicit configuration targets for compiling device code sequences cleanly on Iluvatar architectures.

Platform and Backend Affected

Platform

  • CPU
  • NVIDIA GPU
  • Iluvatar GPU
  • MetaX GPU
  • Moore Threads GPU
  • Cambricon MLU

Backend

  • OpenMPI
  • MPICH

Performance Impact

  • No performance impact
  • Performance improved
  • Performance regression possible

Performance Notes
This PR is architectural. However, performance is optimization-preserved compared to naive casting workarounds because some platforms like NVIDIA maintain native 16-bit register execution layouts rather than being artificially forced through intermediate float translation bridges.

Known Issues & Future Work

  • Extend the specialized operators within detail namespaces inside traits.h to support more operations, including relational comparison expressions (LessThanOp, EqualityOp) to cleanly back a type-safe unified path for MIN and MAX collective operations;
  • Native host-side reductions on CPU for kFloat16 and kBFloat16 data types are currently restricted. Full enablement requires future work to finalize base memory initialization routines, layout mapping allocations, and explicit host-side software arithmetic emulations.

Test Results

Test Involved Platform

  • CPU
  • NVIDIA GPU
  • Iluvatar GPU
  • MetaX GPU
  • Moore Threads GPU
  • Cambricon MLU

Test Involved Backend

  • OpenMPI
  • MPICH

Note:

  1. Since averge-reduction is primarily involved in this change, all_reduce, reduce, and reduce_scatter are set to infinicclAvg for the reduction operation type;
  2. Due to the situation mentioned in Known Issues & Future Work, fp16 and bf16 calculations are not yet supported, but other native data types are supported and casting works as expected.

CPU:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log

NVIDIA + MetaX:
send_recv.log
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log

Iluvatar:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log

Moore Threads:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log

Cambricon:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log


Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat: …, fix(nccl): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — no unrelated modifications were introduced (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene

  • The code is self-explanatory; comments were added only where the intent or rationale is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, inconsistent indentation, or mixed formatting styles remain.
  • Identifiers referenced in comments or error messages are wrapped in Markdown backticks (e.g. the `AllReduce` implementation) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 16, per .github/workflows/clang-format.yml) has been run against all modified applicable files; the diff is clean.
  • No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Type hints are added / kept consistent with the surrounding code.

Testing

  • All applicable example programs have been built and tested successfully on at least one supported heterogeneous cluster setup.

Build, CI, and Tooling

  • N/A- New backends or devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) or to if(AUTO_DETECT_BACKENDS) if applicable.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).

Documentation

  • N/A- README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • Any user-visible breaking change is called out explicitly under "Summary" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A- Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

Ziminli added 12 commits June 16, 2026 13:20
…reduce-related MPI implementations

 - add the generic `Caster` in `src/caster.h`
 - apply the generic caster in the reduce-related MPI implementations (i.e., `all_reduce`, `reduce`, and `reduce_scatter`)
 - update the bridge file generation logic to also include the platform-specific caster files and refactor it to search for a flexible combination
…ype_.h` to properly represent fp16 and bf16
… compile-time, provide convenience wrappers, and apply them to the reduce-related mpi implementations
… on Iluvatar since they are not really supported on the host side.
… on MThreads since they are not really supported on the host side.
@Ziminli Ziminli self-assigned this Jun 17, 2026
@Ziminli Ziminli merged commit 2c9455a into master Jun 17, 2026
2 checks passed
@Ziminli Ziminli deleted the feat/support-unified-cast branch June 17, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant