Fix moe acc#7988
Open
BingooYang wants to merge 143 commits into
Open
Conversation
…thout a slot(PaddlePaddle#7141) (PaddlePaddle#7181) * [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (PaddlePaddle#7163) * Set MC_MAX_MR_SIZE to avoid register hang * up * [fix] prevent requests from entering running state without a slot * [fix] count abort set * [fix] count preempted task in waiting list --------- Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>
… (PaddlePaddle#7192) * fix MTP bugs in TP and overlap * fix
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com> Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
* [Feature]whl version * [Feature]whl version,set root_is_pure = false * [Feature]code style Co-authored-by: ChowMingSing <610208940@qq.com>
…7218 (PaddlePaddle#7256) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut
…s in SM90 flash_mask_attn (PaddlePaddle#7216)
…addle#7266) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7191) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline
…7277) * Update docs for release/2.5 * Update English docs for release/2.5 - Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link - Update docs/get_started/installation/nvidia_gpu.md: - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option - Update docs/zh/get_started/installation/nvidia_gpu.md: - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f * Clarify --extra-index-url usage in installation docs Add note explaining that --extra-index-url is only for downloading fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed from the Paddle source specified by -i. Applied to both Chinese and English nvidia_gpu.md installation guides. Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c * Update nvidia_gpu.md --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…nd bug (PaddlePaddle#7221) (PaddlePaddle#7296) Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
…#7276) * fix * refine code * refine code * refine code * refine code * refine code
…ion Params + CUDAGraph Validation (PaddlePaddle#7215,PaddlePaddle#7281) (PaddlePaddle#7301) * refactor cudagraph args * refactor quant cli param * fix * fix * tmp skip xpu * fix
…e#7320) (PaddlePaddle#7322) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7318) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci
) (PaddlePaddle#7339) * moe bf16 ep support paddle batch_gemm
…#7308) (PaddlePaddle#7310) * support quant use pow2scale * fix * fix
…ePaddle#7159) (PaddlePaddle#7351) * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * fix
…_stop_value kernels (PaddlePaddle#7370) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.
…it scenario (PaddlePaddle#7364) (PaddlePaddle#7387) ## Motivation 在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息, 导致 prefix cache 命中率低,影响推理性能。 ## Modifications 1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`) 的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。 2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用 `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息, 确保 decode 节点能正确感知已命中的 prefix cache。 Co-authored-by: kevin <chengyf112@gmail.com>
* Reset buffer size of R3 * refine code
…addlePaddle#7843 (PaddlePaddle#7845) * [Feature]console metrics log for pd disaggregation * [Feature]console metrics log for pd disaggregation fix test
…ePaddle#7881) (PaddlePaddle#7831) * Add inner benchmark metrics component * Add window_mode * remove temp scripts * fix ut * increase coverage lines
* Update _xpu_4cards_case_test.yml * Update _xpu_8cards_case_test.yml
Co-authored-by: kevin <chengyf112@gmail.com>
…e threashold for prefill instance (PaddlePaddle#7871)
…ePaddle#7688) (PaddlePaddle#7729) * support c8 decode attention * support c16 attention && backend * opt kernel * fix * opt larger batch * inplace out * fix input_batch && remove fast_math * fix xpu * fix bug * fix ci * opt and fix mtp * fix merge * clean code * fix merge * update * update test * fix test * fix test * opt buffer * fix conflict --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…dlePaddle#7883) (PaddlePaddle#7884) * opt mtp logprob * fix * fix test and log * fix bits * Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ng CUDAGraph recapture(PaddlePaddle#7934) (PaddlePaddle#7933) * fix clear bug in rl * fix: use self.max_chunk_tokens instead of fd_config.get_max_chunk_tokens() for buffer recreation fd_config.get_max_chunk_tokens() without mm_max_tokens_per_item arg may return a smaller value than the actual initial buffer size when enable_mm and mm_max_tokens_per_item is None. Use self.max_chunk_tokens which is already computed during __init__ and consistent with first CUDAGraph capture.
…addle#7839) * PD send cache via storage & Refine swap_cache_layout op * skip messager * up * consider write cache error * fix ci * up
…ddlePaddle#7936) (PaddlePaddle#7917) * support fused noauxtc kernel on ep mode * fix unit test
…dle#7892) and Triton SamplerBackend (PaddlePaddle#7639) (PaddlePaddle#7910) * [CP][Feature] support new sampler backend with triton (PaddlePaddle#7639) * [Optimization] TopP=1.0 using _random_sample (PaddlePaddle#7892) * code check * add env FD_ENABLE_TOP_P_ONE_OPT control top_p=1 opt * defalut FD_ENABLE_TOP_P_ONE_OPT=0 * change FD_ENABLE_TOP_P_ONE_OPT=1 * fix mtp triton seed * change triton seed int64 * fix triton sampler * add seed for mtp triton sampler --------- Co-authored-by: Zero Rains <linjunlu@zerorains.top> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ddle#7923) (PaddlePaddle#7922) * fix accurate issue * fix acc issue in ep + tp mode --------- Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0271.tjzj.baidu.com>
…in accuracy (PaddlePaddle#7960) * Reset buffer size of R3 * refine code * R3 fix Eos bug * pre-commit * fix r3 ci and support dsa * refine code * refine code * reset ci dir * refine code * fix dsv3
* Reset buffer size of R3 * refine code * R3 fix Eos bug * pre-commit * fix r3 ci and support dsa * refine code * refine code * reset ci dir * refine code * fix dsv3 * fix ernie5 mm bug
…lePaddle#7951) (PaddlePaddle#7971) * Add GDR streaming weight update path * [RL] Unify GDR and IPC weight update
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-03 17:32:34
📋 Review 摘要
PR 概述:修复 MoE 模型推理精度问题,同步更新 CI 构建配置(固定 PaddlePaddle wheel 版本、改进容器清理逻辑、迁移 runner 至 APPROVAL group)。
建议拆分方案:
- PR 1:
[CI]CI 基础设施更新 —.github/workflows/**,scripts/** - PR 2:
[BugFix]MoE 精度修复 —custom_ops/gpu_ops/moe/,fastdeploy/model_executor/layers/moe/,custom_ops/gpu_ops/grouped_topk_kernels.cu - PR 3:
[Models]模型 forward 变更 —fastdeploy/model_executor/models/** - PR 4:
[OP]Attention / Quantization kernel 变更 —custom_ops/gpu_ops/append_attn/,custom_ops/gpu_ops/decode_unified_attention/,fastdeploy/model_executor/layers/quantization/
变更范围:CI workflows、MoE kernels、Models、Attention backends、Quantization
影响面 Tag:[CI] [Models] [OP] [BugFix]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu |
topk_ids_numel 以 int 承接 topk_ids.numel()(int64_t),大 batch 下存在 int32 截断风险 |
| 🟡 建议 | fastdeploy/model_executor/layers/moe/triton_moe_kernels.py |
新 kernel fused_moe_kernel_bf16 已添加 offs_token.to(tl.int64) 修复 stride 溢出,但旧 kernel fused_moe_kernel_paddle 未同步此修复 |
| ❓ 疑问 | .github/workflows/_accuracy_test.yml |
移除 --ipc=host --pid=host,可能影响容器内分布式多进程的 IPC 通信 |
未发现阻塞性问题。PR 规范问题在下面章节报。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | check-bypass.yml per_page=100 分页遗漏(本 PR 涉及 319 个文件,实际只检查了前 100 个) |
|
| F2 | 硬编码 bcebos 内部 wheel URL,长期维护风险 |
📝 PR 规范检查
标题 "Fix moe acc" 缺少官方 Tag,所有描述 section 均为空(仅模板占位符)。与上次 Review 一致,未修改。
标题建议(可直接复制):
[BugFix] Fix MoE accuracy regression
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 MoE 模型精度问题(Triton kernel 中 `stride_cm * offs_token` int32 溢出导致精度异常),同步更新 CI 构建配置以提升稳定性。
## Modifications
- 新增 `fused_moe_kernel_bf16` Triton kernel,在索引计算前统一将 `offs_token`、`off_experts`、`offs_bn` 提升为 `tl.int64`,修复大 batch 下 stride 乘法溢出
- 固定 CI 中 PaddlePaddle GPU wheel 为 3.3.1.post20260420 版本(cu126/cu129/cu130/RL/XPU 全覆盖),替换原先的 nightly pre 版本
- 所有构建/测试 workflow 新增 "Terminate and delete the container" step(`if: always()`),确保异常退出时也能清理容器
- 改进 workspace 清理逻辑,新增 `find` force cleanup fallback,避免残留目录导致 CI 卡住
- `tar` 命令统一加 `--no-same-owner` 选项,避免解压权限问题
- 多个 workflow 的 runner 从 `ubuntu-latest` 迁移到 `APPROVAL` group,runner 环境更一致
- 移除 docker 构建容器的 `--privileged` 标志,提升 CI 安全性
## Usage or Command
N/A
## Accuracy Tests
N/A(请补充 MoE 精度修复前后对比数据)
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
CI 基础设施改进合理;MoE 精度修复通过新 Triton kernel 解决了 int32 stride 溢出问题,方向正确。主要关注点:旧 kernel fused_moe_kernel_paddle 未同步 int64 修复,tritonmoe_preprocess.cu 存在 int 截断风险,以及 --ipc=host 移除对分布式测试的潜在影响。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.