feat(moe): add MoE inference and expert parallel support#444
Conversation
- add reusable MoE router, dispatcher, runner, and expert abstractions - enable Qwen3 MoE fused inference with TP-local expert parallel routing - add graph-safe MoE workspace handling and EP backend selection through engine config - preserve legacy MoE path for existing DeepSeek V2 code
| struct CompiledResult { | ||
| InfinilmModel::Input input; | ||
| Compiled compiled; | ||
| std::shared_ptr<InfinilmModel::Output> replay_output; |
There was a problem hiding this comment.
这个新增的replay_output变量,以及graph编译时新增和修改的代码。可以注释或解释一下么,不知道啥意思
| throw std::runtime_error(" Model object not found. "); | ||
| } | ||
| return workers_.front()->state_dict_keys(); | ||
| std::vector<std::string> keys; |
There was a problem hiding this comment.
这个写法,我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}
std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;
`
| } else if (local_cmd == Command::LOAD_BATCH) { | ||
| try { | ||
| model_->load_parameters_no_sync(local_params); | ||
| model_->load_parameters_no_sync(local_params, local_params_strict); |
There was a problem hiding this comment.
等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);
| self.parser.add_argument("--model", type=str, required=True) | ||
| self.parser.add_argument("--device", type=str, default="cpu") | ||
| self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1) | ||
| self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1) |
There was a problem hiding this comment.
这个dp只有python中被使用,不会传递给c++么
| ): | ||
| self.hf_config = read_hf_config(model_path) | ||
| self.hf_generation_config = read_hf_generation_config(model_path) | ||
| self.hf_config["moe_ep_backend"] = moe_ep_backend |
There was a problem hiding this comment.
moe_ep_backend和moe_ep_size,怎么能放进hf_config中。
hf_config对应c++中model_config的config_json变量,内容只是 config.json中的信息。
| } | ||
|
|
||
|
|
||
| def _is_internal_moe_packed_weight(key: str) -> bool: |
There was a problem hiding this comment.
Qwen3-235B-A22B中有这个权重么
| if backend not in { | ||
| "disabled", |
| return "moe" in model_type or "num_experts" in config | ||
|
|
||
|
|
||
| def configure_moe_ep_backend( |
There was a problem hiding this comment.
configure_moe_ep_backend , _is_moe_model, _normalize_moe_ep_backend 这几个函数是重复的
| @@ -386,7 +390,8 @@ def state_dict_keyname(self): | |||
|
|
|||
| def load_state_dict(self, state_dict, strict=None): | |||
There was a problem hiding this comment.
添加了strict参数,但感觉这个pr好像没有被使用到。
| namespace infinilm::models::qwen3_moe { | ||
|
|
||
| class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module { | ||
| class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock { |
There was a problem hiding this comment.
继承后貌似啥也没干。 这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。
Summary
csrc/layers/moe.SparseMoeBlock,TopKRouter,FusedMoeExperts, andFusedMoErunner.local_allreduceandallgather_reducescatter.deepepbackend interface for future integration.MoeMLPintocsrc/layers/moe/legacyand keep DeepSeek-V2 on the legacy path.num_key_value_heads < tp_size.Motivation
Closes #
InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.
The current implementation focuses on correctness and data-flow alignment first:
local_allreduceas the preferred current path.allgather_reducescatteris available as a correctness-oriented backend.Type of Change
feat— new feature / new modelrefactor— code restructuring without behavior changeperf— performance improvement (no behavioral change)fix— bug fixtest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Please attach screenshots for the final tested commands.
Suggested coverage:
local_allreduceallgather_reducescatterlocal_allreduceBenchmark / Performance Impact
Initial measured examples on A100:
local_allreduce, graph enabled:local_allreduce, graph enabled:This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.
Notes for Reviewers
local_allreduceis the recommended current EP backend for DP=1.allgather_reducescatteris correctness-oriented and expected to be slower.deepepis intentionally a placeholder interface.prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.layers/moe/legacyand is not migrated to the new fused Qwen3-MoE path.MoE EP backend: disabled.Checklist
Title, Branch, and Commits
<type>/xxx-yyyy-zzzz.main.fixup!/squash!/wipcommits remain.Scope and Design
C++ Specific
scripts/format.py.Python Specific
scripts/format.py.Testing