Skip to content

Fix moe acc#7988

Open
BingooYang wants to merge 143 commits into
PaddlePaddle:developfrom
BingooYang:fix_moe_acc
Open

Fix moe acc#7988
BingooYang wants to merge 143 commits into
PaddlePaddle:developfrom
BingooYang:fix_moe_acc

Conversation

@BingooYang
Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Jiang-Jia-Jun and others added 30 commits April 3, 2026 11:29
…thout a slot(PaddlePaddle#7141) (PaddlePaddle#7181)

* [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (PaddlePaddle#7163)

* Set MC_MAX_MR_SIZE to avoid register hang

* up

* [fix] prevent requests from entering running state without a slot

* [fix] count abort set

* [fix] count preempted task in waiting list

---------

Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com>
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
* [Feature]whl version

* [Feature]whl version,set root_is_pure = false

* [Feature]code style

Co-authored-by: ChowMingSing <610208940@qq.com>
…7218 (PaddlePaddle#7256)

* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
…addle#7266)

* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7191)

* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
…7277)

* Update docs for release/2.5

* Update English docs for release/2.5

- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
  - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
  - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
  - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
  - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f



* Clarify --extra-index-url usage in installation docs

Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c



* Update nvidia_gpu.md

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…nd bug (PaddlePaddle#7221) (PaddlePaddle#7296)

Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
…#7276)

* fix

* refine code

* refine code

* refine code

* refine code

* refine code
…ion Params + CUDAGraph Validation (PaddlePaddle#7215,PaddlePaddle#7281) (PaddlePaddle#7301)

* refactor cudagraph args

* refactor quant cli param

* fix

* fix

* tmp skip xpu

* fix
…e#7320) (PaddlePaddle#7322)

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7318)

* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
…ePaddle#7159) (PaddlePaddle#7351)

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
…_stop_value kernels (PaddlePaddle#7370)

- speculate_limit_thinking_content_length: update current_base_step to
  step_idx+1 (step_idx now records history count before current round);
  remove incorrect step_idx decrement on accept_num truncation; mark
  step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
  step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
  formula (remove stale -accept_num offset); use <= condition so accept_idx
  maps directly to the accepted token that ends the stop sequence; fix
  accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
…it scenario (PaddlePaddle#7364) (PaddlePaddle#7387)

## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。

Co-authored-by: kevin <chengyf112@gmail.com>
gongshaotian and others added 22 commits May 20, 2026 20:15
* Reset buffer size of R3

* refine code
…addlePaddle#7843 (PaddlePaddle#7845)

* [Feature]console metrics log for pd disaggregation

* [Feature]console metrics log for pd disaggregation fix test
…ePaddle#7881) (PaddlePaddle#7831)

* Add inner benchmark metrics component

* Add window_mode

* remove temp scripts

* fix ut

* increase coverage lines
* Update _xpu_4cards_case_test.yml

* Update _xpu_8cards_case_test.yml
Co-authored-by: kevin <chengyf112@gmail.com>
…ePaddle#7688) (PaddlePaddle#7729)

* support c8 decode attention

* support c16 attention && backend

* opt kernel

* fix

* opt larger batch

* inplace out

* fix input_batch && remove fast_math

* fix xpu

* fix bug

* fix ci

* opt and fix mtp

* fix merge

* clean code

* fix merge

* update

* update test

* fix test

* fix test

* opt buffer

* fix conflict

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…dlePaddle#7883) (PaddlePaddle#7884)

* opt mtp logprob

* fix

* fix test and log

* fix bits

* Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ng CUDAGraph recapture(PaddlePaddle#7934) (PaddlePaddle#7933)

* fix clear bug in rl

* fix: use self.max_chunk_tokens instead of fd_config.get_max_chunk_tokens() for buffer recreation

fd_config.get_max_chunk_tokens() without mm_max_tokens_per_item arg may
return a smaller value than the actual initial buffer size when enable_mm
and mm_max_tokens_per_item is None. Use self.max_chunk_tokens which is
already computed during __init__ and consistent with first CUDAGraph capture.
…addle#7839)

* PD send cache via storage & Refine swap_cache_layout op

* skip messager

* up

* consider write cache error

* fix ci

* up
…dle#7892) and Triton SamplerBackend (PaddlePaddle#7639) (PaddlePaddle#7910)

* [CP][Feature] support new sampler backend with triton (PaddlePaddle#7639)

* [Optimization] TopP=1.0 using _random_sample (PaddlePaddle#7892)

* code check

* add env FD_ENABLE_TOP_P_ONE_OPT control top_p=1 opt

* defalut FD_ENABLE_TOP_P_ONE_OPT=0

* change FD_ENABLE_TOP_P_ONE_OPT=1

* fix mtp triton seed

* change triton seed int64

* fix triton sampler

* add seed for mtp triton sampler

---------

Co-authored-by: Zero Rains <linjunlu@zerorains.top>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ddle#7923) (PaddlePaddle#7922)

* fix accurate issue

* fix acc issue in ep + tp mode

---------

Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0271.tjzj.baidu.com>
…in accuracy (PaddlePaddle#7960)

* Reset buffer size of R3

* refine code

* R3 fix Eos bug

* pre-commit

* fix r3 ci and support dsa

* refine code

* refine code

* reset ci dir

* refine code

* fix dsv3
* Reset buffer size of R3

* refine code

* R3 fix Eos bug

* pre-commit

* fix r3 ci and support dsa

* refine code

* refine code

* reset ci dir

* refine code

* fix dsv3

* fix ernie5 mm bug
…lePaddle#7951) (PaddlePaddle#7971)

* Add GDR streaming weight update path

* [RL] Unify GDR and IPC weight update
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-03 17:32:34

📋 Review 摘要

PR 概述:修复 MoE 模型推理精度问题,同步更新 CI 构建配置(固定 PaddlePaddle wheel 版本、改进容器清理逻辑、迁移 runner 至 APPROVAL group)。

⚠️ 本 PR 变更量较大(319 文件),建议拆分以降低审查难度和合入风险。

建议拆分方案

  • PR 1: [CI] CI 基础设施更新 — .github/workflows/**, scripts/**
  • PR 2: [BugFix] MoE 精度修复 — custom_ops/gpu_ops/moe/, fastdeploy/model_executor/layers/moe/, custom_ops/gpu_ops/grouped_topk_kernels.cu
  • PR 3: [Models] 模型 forward 变更 — fastdeploy/model_executor/models/**
  • PR 4: [OP] Attention / Quantization kernel 变更 — custom_ops/gpu_ops/append_attn/, custom_ops/gpu_ops/decode_unified_attention/, fastdeploy/model_executor/layers/quantization/

变更范围:CI workflows、MoE kernels、Models、Attention backends、Quantization

影响面 Tag[CI] [Models] [OP] [BugFix]

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu topk_ids_numelint 承接 topk_ids.numel()(int64_t),大 batch 下存在 int32 截断风险
🟡 建议 fastdeploy/model_executor/layers/moe/triton_moe_kernels.py 新 kernel fused_moe_kernel_bf16 已添加 offs_token.to(tl.int64) 修复 stride 溢出,但旧 kernel fused_moe_kernel_paddle 未同步此修复
❓ 疑问 .github/workflows/_accuracy_test.yml 移除 --ipc=host --pid=host,可能影响容器内分布式多进程的 IPC 通信

未发现阻塞性问题。PR 规范问题在下面章节报。

历史 Findings 修复情况

Finding 问题 状态
F1 check-bypass.yml per_page=100 分页遗漏(本 PR 涉及 319 个文件,实际只检查了前 100 个) ⚠️ 仍存在
F2 硬编码 bcebos 内部 wheel URL,长期维护风险 ⚠️ 仍存在(已扩展至 cu129/cu130/RL 等更多 workflow)

📝 PR 规范检查

标题 "Fix moe acc" 缺少官方 Tag,所有描述 section 均为空(仅模板占位符)。与上次 Review 一致,未修改。

标题建议(可直接复制):

  • [BugFix] Fix MoE accuracy regression
PR 描述建议(点击展开,可直接复制)
## Motivation

修复 MoE 模型精度问题(Triton kernel 中 `stride_cm * offs_token` int32 溢出导致精度异常),同步更新 CI 构建配置以提升稳定性。

## Modifications

- 新增 `fused_moe_kernel_bf16` Triton kernel,在索引计算前统一将 `offs_token``off_experts``offs_bn` 提升为 `tl.int64`,修复大 batch 下 stride 乘法溢出
- 固定 CI 中 PaddlePaddle GPU wheel 为 3.3.1.post20260420 版本(cu126/cu129/cu130/RL/XPU 全覆盖),替换原先的 nightly pre 版本
- 所有构建/测试 workflow 新增 "Terminate and delete the container" step(`if: always()`),确保异常退出时也能清理容器
- 改进 workspace 清理逻辑,新增 `find` force cleanup fallback,避免残留目录导致 CI 卡住
- `tar` 命令统一加 `--no-same-owner` 选项,避免解压权限问题
- 多个 workflow 的 runner 从 `ubuntu-latest` 迁移到 `APPROVAL` group,runner 环境更一致
- 移除 docker 构建容器的 `--privileged` 标志,提升 CI 安全性

## Usage or Command

N/A

## Accuracy Tests

N/A(请补充 MoE 精度修复前后对比数据)

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

CI 基础设施改进合理;MoE 精度修复通过新 Triton kernel 解决了 int32 stride 溢出问题,方向正确。主要关注点:旧 kernel fused_moe_kernel_paddle 未同步 int64 修复,tritonmoe_preprocess.cu 存在 int 截断风险,以及 --ipc=host 移除对分布式测试的潜在影响。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.