Skip to content

feat(moe): add MoE inference and expert parallel support#444

Open
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe
Open

feat(moe): add MoE inference and expert parallel support#444
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe

Conversation

@qinyiqun

Copy link
Copy Markdown
Contributor

Summary

  • Add a generic MoE layer stack under csrc/layers/moe.
  • Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
  • Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
  • Add a reserved deepep backend interface for future integration.
  • Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
  • Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
  • Optimize rank-local safetensors loading for EP expert weights.
  • Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

  • TP-only MoE works through the standard dispatcher.
  • DP=1 EP uses local_allreduce as the preferred current path.
  • allgather_reducescatter is available as a correctness-oriented backend.
  • DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

  • feat — new feature / new model
  • refactor — code restructuring without behavior change
  • perf — performance improvement (no behavioral change)
  • fix — bug fix
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

  • Qwen3-30B-A3B, TP=1, EP disabled
  • Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
  • Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
  • Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
  • Qwen3-8B-base non-MoE regression, TP=2, graph enabled
  • DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

  • Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
    • Prefill and decode are functional.
    • Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
  • Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
    • Model loading and decode are functional.
    • Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

  • local_allreduce is the recommended current EP backend for DP=1.
  • allgather_reducescatter is correctness-oriented and expected to be slower.
  • deepep is intentionally a placeholder interface.
  • prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
  • DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
  • Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz.
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
  • No stray merge commits from main.
  • No fixup! / squash! / wip commits remain.
  • Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

  • Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
  • No debug prints or temporary MoE logs are left behind.
  • Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

  • Changed files are formatted by scripts/format.py.
  • Project builds cleanly on NVIDIA.

Python Specific

  • Changed files are formatted by scripts/format.py.

Testing

  • Passed single request test, or reason for skipping is documented.
  • Passed offline performance test, or reason for skipping is documented.
  • Passed sanity test, or reason for skipping is documented.
  • Passed service test, or reason for skipping is documented.

@qinyiqun qinyiqun requested a review from a team June 18, 2026 02:17
- add reusable MoE router, dispatcher, runner, and expert abstractions
- enable Qwen3 MoE fused inference with TP-local expert parallel routing
- add graph-safe MoE workspace handling and EP backend selection through engine config
- preserve legacy MoE path for existing DeepSeek V2 code
struct CompiledResult {
InfinilmModel::Input input;
Compiled compiled;
std::shared_ptr<InfinilmModel::Output> replay_output;

@pengcheng888 pengcheng888 Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个新增的replay_output变量,以及graph编译时新增和修改的代码。可以注释或解释一下么,不知道啥意思

throw std::runtime_error(" Model object not found. ");
}
return workers_.front()->state_dict_keys();
std::vector<std::string> keys;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个写法,我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}

std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;

`

} else if (local_cmd == Command::LOAD_BATCH) {
try {
model_->load_parameters_no_sync(local_params);
model_->load_parameters_no_sync(local_params, local_params_strict);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);

self.parser.add_argument("--model", type=str, required=True)
self.parser.add_argument("--device", type=str, default="cpu")
self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1)
self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提供测试命令

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个dp只有python中被使用,不会传递给c++么

):
self.hf_config = read_hf_config(model_path)
self.hf_generation_config = read_hf_generation_config(model_path)
self.hf_config["moe_ep_backend"] = moe_ep_backend

@pengcheng888 pengcheng888 Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moe_ep_backend和moe_ep_size,怎么能放进hf_config中。

hf_config对应c++中model_config的config_json变量,内容只是 config.json中的信息。

}


def _is_internal_moe_packed_weight(key: str) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen3-235B-A22B中有这个权重么

Comment thread examples/bench.py
Comment on lines +227 to +228
if backend not in {
"disabled",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能解释下这四个后端么,该怎么用

return "moe" in model_type or "num_experts" in config


def configure_moe_ep_backend(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configure_moe_ep_backend , _is_moe_model, _normalize_moe_ep_backend 这几个函数是重复的

@@ -386,7 +390,8 @@ def state_dict_keyname(self):

def load_state_dict(self, state_dict, strict=None):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加了strict参数,但感觉这个pr好像没有被使用到。

namespace infinilm::models::qwen3_moe {

class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module {
class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

继承后貌似啥也没干。 这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants