feat: support unified cast and type-safe host math fallbacks for reduce-type MPI operations#38
Merged
Merged
Conversation
…reduce-related MPI implementations - add the generic `Caster` in `src/caster.h` - apply the generic caster in the reduce-related MPI implementations (i.e., `all_reduce`, `reduce`, and `reduce_scatter`) - update the bridge file generation logic to also include the platform-specific caster files and refactor it to search for a flexible combination
…ype_.h` to properly represent fp16 and bf16
… compile-time, provide convenience wrappers, and apply them to the reduce-related mpi implementations
… on Iluvatar since they are not really supported on the host side.
…ly looking for the number 0 GPU card
… on MThreads since they are not really supported on the host side.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a unified compile-time type-casting and capability evaluation framework across all currently supported devices.
Previously, performing reductions and scaling modifications within core MPI collectives (such as
AllReduce,Reduce, andReduceScatter) lacked a standardized approach for dealing with non-native host types (e.g.,halfandbfloat16).This change introduces a generic
Casteralongside a highly extensible SFINAE-driven expression detector (SupportsOp). Together, they allow collective reduction execution loops to automatically inspect type compatibility at compile time. They can now execute native, high-performance intrinsic paths where possible, while gracefully routing through transparent float-bridge emulation structures for platforms whose host toolchains do not support math on their device float types.Changes
Core Infrastructure & Type Utilities
Casterabstraction layer insrc/caster.hto achieve a unified interface for data type conversions across all the platforms;SupportsOpinsrc/traits.hto check operator support (e.g.,*=,+=) dynamically at compile time across distinct type and scalar combinations;ToFloat<kDev>()andCastTo<kDev, T>()convenience shorthand wrappers to eliminate verbose template boilerplate;Platform-Specific Hardware Cast Implementations
HardwareCastImplvariants for CPU, NVIDIA, Iluvatar, MetaX, Moore Threads (MThreads), and Cambricon backends;data_type_.hinsidesrc/cambricon/to explicitly decouple fp16 and bf16 layouts structurally, eliminating underlying primitive collisions.Collective Code Update
SupportsOpand unified casting paths to the reduce-related OpenMPI implementations includingall_reduce.h,reduce.h, andreduce_scatter.h.Platform Adjustments & Host Toolchain Fixes
SupportsOpto false forhalfandbfloat16structures on the host CPU pass, bypassing device-only compilation restrictions;AUTO_DETECT_DEVICESmatching logic within the NVIDIA driver mapping layer to specifically and safely bind to GPU card 0;Platform and Backend Affected
Platform
Backend
Performance Impact
Performance Notes
This PR is architectural. However, performance is optimization-preserved compared to naive casting workarounds because some platforms like NVIDIA maintain native 16-bit register execution layouts rather than being artificially forced through intermediate float translation bridges.
Known Issues & Future Work
detailnamespaces insidetraits.hto support more operations, including relational comparison expressions (LessThanOp,EqualityOp) to cleanly back a type-safe unified path forMINandMAXcollective operations;kFloat16andkBFloat16data types are currently restricted. Full enablement requires future work to finalize base memory initialization routines, layout mapping allocations, and explicit host-side software arithmetic emulations.Test Results
Test Involved Platform
Test Involved Backend
Note:
all_reduce,reduce, andreduce_scatterare set toinfinicclAvgfor the reduction operation type;CPU:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log
NVIDIA + MetaX:
send_recv.log
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
Iluvatar:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log
Moore Threads:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log
Cambricon:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
scatter.log
send_recv.log
Checklist
Title, Branch, and Commits
feat: …,fix(nccl): …).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).master— the branch is rebased cleanly on top of the currentmaster.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `AllReduce` implementation) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
clang-format(version 16, per.github/workflows/clang-format.yml) has been run against all modified applicable files; the diff is clean.assertwith messages that include at least__FILE__,__LINE__, and__func__(CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).Python Specific (if Python files changed)
ruff checkpasses cleanly on CI (see.github/workflows/ruff.yml).ruff format --checkpasses cleanly — if not, runruff formatand commit the result.CONTRIBUTING.md§Python).pytest.skipmessages without terminal period) are honored where applicable (CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).if,for, and similar control-flow statements (CONTRIBUTING.md§Python).return, except when it directly follows a control-flow statement (CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).Testing
Build, CI, and Tooling
CMakeLists.txtunderif(AUTO_DETECT_DEVICES)or toif(AUTO_DETECT_BACKENDS)if applicable.clang-format.yml,ruff.yml) are green locally (or expected to be green on CI).Documentation
README.md,CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.!orBREAKING CHANGE:footer.Security and Safety