feat(moe): add MoE inference and expert parallel support by qinyiqun · Pull Request #444 · InfiniTensor/InfiniLM

qinyiqun · 2026-06-18T02:17:45Z

Summary

Add a generic MoE layer stack under csrc/layers/moe.
Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
Add a reserved deepep backend interface for future integration.
Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
Optimize rank-local safetensors loading for EP expert weights.
Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

TP-only MoE works through the standard dispatcher.
DP=1 EP uses local_allreduce as the preferred current path.
allgather_reducescatter is available as a correctness-oriented backend.
DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

feat — new feature / new model
refactor — code restructuring without behavior change
perf — performance improvement (no behavioral change)
fix — bug fix
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

Qwen3-30B-A3B, TP=1, EP disabled
Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
Qwen3-8B-base non-MoE regression, TP=2, graph enabled
DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
- Prefill and decode are functional.
- Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
- Model loading and decode are functional.
- Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

local_allreduce is the recommended current EP backend for DP=1.
allgather_reducescatter is correctness-oriented and expected to be slower.
deepep is intentionally a placeholder interface.
prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits.
Branch name follows <type>/xxx-yyyy-zzzz.
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
No stray merge commits from main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
No debug prints or temporary MoE logs are left behind.
Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

Changed files are formatted by scripts/format.py.
Project builds cleanly on NVIDIA.

Python Specific

Changed files are formatted by scripts/format.py.

Testing

Passed single request test, or reason for skipping is documented.
Passed offline performance test, or reason for skipping is documented.
Passed sanity test, or reason for skipping is documented.
Passed service test, or reason for skipping is documented.

- add reusable MoE router, dispatcher, runner, and expert abstractions - enable Qwen3 MoE fused inference with TP-local expert parallel routing - add graph-safe MoE workspace handling and EP backend selection through engine config - preserve legacy MoE path for existing DeepSeek V2 code

pengcheng888 · 2026-06-26T06:16:24Z

    struct CompiledResult {
        InfinilmModel::Input input;
        Compiled compiled;
+        std::shared_ptr<InfinilmModel::Output> replay_output;


这个新增的replay_output变量，以及graph编译时新增和修改的代码。可以注释或解释一下么，不知道啥意思

pengcheng888 · 2026-06-26T06:24:54Z

        throw std::runtime_error(" Model object not found. ");
    }
-    return workers_.front()->state_dict_keys();
+    std::vector<std::string> keys;


这个写法，我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}

std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;

`

pengcheng888 · 2026-06-26T06:27:17Z

            } else if (local_cmd == Command::LOAD_BATCH) {
                try {
-                    model_->load_parameters_no_sync(local_params);
+                    model_->load_parameters_no_sync(local_params, local_params_strict);


等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);

pengcheng888 · 2026-06-26T07:18:17Z

        self.parser.add_argument("--model", type=str, required=True)
        self.parser.add_argument("--device", type=str, default="cpu")
        self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1)
+        self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1)


提供测试命令

这个dp只有python中被使用，不会传递给c++么

pengcheng888 · 2026-06-26T07:22:50Z

    ):
        self.hf_config = read_hf_config(model_path)
        self.hf_generation_config = read_hf_generation_config(model_path)
+        self.hf_config["moe_ep_backend"] = moe_ep_backend


moe_ep_backend和moe_ep_size，怎么能放进hf_config中。

hf_config对应c++中model_config的config_json变量，内容只是 config.json中的信息。

pengcheng888 · 2026-06-26T09:18:41Z

 }


+def _is_internal_moe_packed_weight(key: str) -> bool:


Qwen3-235B-A22B中有这个权重么

pengcheng888 · 2026-06-26T09:23:55Z

+    if backend not in {
+        "disabled",


能解释下这四个后端么，该怎么用

pengcheng888 · 2026-06-26T09:25:34Z

+    return "moe" in model_type or "num_experts" in config
+
+
+def configure_moe_ep_backend(


configure_moe_ep_backend ， _is_moe_model， _normalize_moe_ep_backend 这几个函数是重复的

pengcheng888 · 2026-06-26T09:36:51Z

@@ -386,7 +390,8 @@ def state_dict_keyname(self):

    def load_state_dict(self, state_dict, strict=None):


添加了strict参数，但感觉这个pr好像没有被使用到。

pengcheng888 · 2026-06-26T09:39:14Z

 namespace infinilm::models::qwen3_moe {

-class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module {
+class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock {


继承后貌似啥也没干。这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。

qinyiqun requested a review from a team June 18, 2026 02:17

qinyiqun force-pushed the moe branch from adb5ae9 to 4b3058a Compare June 18, 2026 09:30

qinyiqun force-pushed the moe branch from 4b3058a to f2d4861 Compare June 23, 2026 02:20

qinyiqun requested a review from pengcheng888 June 25, 2026 08:17

pengcheng888 reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(moe): add MoE inference and expert parallel support#444

feat(moe): add MoE inference and expert parallel support#444
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe

qinyiqun commented Jun 18, 2026

Uh oh!

pengcheng888 Jun 26, 2026 •

edited

Loading

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026 •

edited

Loading

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

pengcheng888 Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return "moe" in model_type or "num_experts" in config


		def configure_moe_ep_backend(

		@@ -386,7 +390,8 @@ def state_dict_keyname(self):

		def load_state_dict(self, state_dict, strict=None):

Uh oh!

Conversation

qinyiqun commented Jun 18, 2026

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

C++ Specific

Python Specific

Testing

Uh oh!

pengcheng888 Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengcheng888 Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pengcheng888 Jun 26, 2026 •

edited

Loading

pengcheng888 Jun 26, 2026 •

edited

Loading