Skip to content

[Feature]Add output fallback support for OpenAI serving#7942

Open
luukunn wants to merge 25 commits into
PaddlePaddle:developfrom
luukunn:fallback
Open

[Feature]Add output fallback support for OpenAI serving#7942
luukunn wants to merge 25 commits into
PaddlePaddle:developfrom
luukunn:fallback

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented May 27, 2026

Motivation

当前推理链路缺少统一的 output fallback 扩展机制,业务侧如果希望对模型输出进行兜底处理,只能在各个下游环节分别适配,难以统一管理。

本 PR 引入 output fallback framework,并将 output fallback 的实际处理前移到 data processor 中,在 reasoning/tool parsing 之前对原始 decoded stream 做统一处理。这样可以保证内容文本、reasoning 内容以及 tool call 相关文本都能共享同一套 fallback 逻辑,同时也为后续扩展自定义 fallback strategy 提供统一入口。

Modifications

本 PR 主要包含以下改动:

  1. 新增 output fallback framework

    • 新增 fastdeploy/output/fallback/ 模块
    • 新增 OutputFallbackStrategy 抽象基类
    • 新增 OutputFallbackContext
    • 新增 StreamFallbackDecision
    • 新增 OutputFallbackManager
    • 支持策略注册、实例化、链式执行、状态管理和插件导入
  2. 新增 output fallback 插件加载机制

    • 新增 fastdeploy.plugins.output_fallback
    • 支持通过插件组 fastdeploy.output_fallback_plugins 自动加载插件
    • 支持通过 --output-fallback-plugin 指定外部插件路径动态导入
  3. 新增 output fallback 相关启动参数

    • --output-fallback
    • --output-fallback-plugin
    • --output-fallback-config
  4. 将 output fallback 的应用前移到 data processor

    • fastdeploy/input/base_processor.py 中新增 output_fallback_manager
    • process_response_dict_normal() 中,对完整输出文本应用 fallback
    • process_response_dict_streaming() 中,对 streaming 增量文本应用 fallback
    • fallback 在 reasoning parser / tool parser 之前执行,确保后续解析基于修正后的文本进行
  5. 支持 streaming 场景下的 fallback 控制语义

    • send:发送当前 delta
    • hold:暂存当前 delta,本轮不输出
    • flush:在流结束时输出缓存内容
    • truncate:发送当前文本并提前终止后续生成
  6. 新增 processor 侧 fallback 状态管理

    • 新增 fallback_decode_status
    • 用于维护 fallback 修正后的流式历史文本
    • 避免 parser 继续基于未经修正的原始文本工作
    • 请求结束时同步清理 fallback 状态和 manager 状态
  7. 调整 OpenAI serving 层职责

    • serving 层不再直接调用 fallback manager 的 on_delta() / on_finish() / apply()
    • 改为消费 upstream processor 已处理好的输出结果
    • 通过 fallback_truncatedskipped 字段感知 fallback 结果
    • 在 streaming truncate 场景下设置 finish_reason = "length" 并主动 abort 对应 choice
    • 非流式场景直接透传 processor 修正后的最终文本
  8. 扩展 request / output 数据结构

    • CompletionOutput 中新增:
      • fallback_truncated
      • skipped
    • 并补充相关序列化 / 反序列化测试
  9. 补充测试

    • 新增 tests/output/test_fallback.py
    • 覆盖 strategy 默认行为、manager 链式执行、hold/flush/truncate、cleanup、插件导入等场景
    • 补充 input processor 中 fallback 应用与状态清理测试
    • 补充 OpenAI chat/completion 及 v1 serving 对 processor fallback 信号的兼容测试

Usage or Command

启用指定 fallback strategy:

--output-fallback your-strategy-name

加载自定义 fallback 插件:

--output-fallback your-strategy-name \
--output-fallback-plugin /path/to/custom_fallback.py

为策略传入配置:

--output-fallback your-strategy-name \
--output-fallback-plugin /path/to/custom_fallback.py \
--output-fallback-config '{"your-strategy-name": {"key": "value"}}'

如何增加自定义兜底协议

可以通过继承 OutputFallbackStrategy 并使用 OutputFallbackManager.register(...) 注册自定义策略。

示例:

from fastdeploy.output.fallback import (
    OutputFallbackContext,
    OutputFallbackManager,
    OutputFallbackStrategy,
    StreamFallbackDecision,
)


@OutputFallbackManager.register("custom-fallback")
class CustomFallbackStrategy(OutputFallbackStrategy):
    name = "custom-fallback"

    def should_apply(self, text: str, context: OutputFallbackContext) -> bool:
        return "bad" in text

    def apply(self, text: str, context: OutputFallbackContext) -> str:
        return text.replace("bad", "good")

    def on_delta(
        self,
        delta_text: str,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        # streaming 场景下可自定义增量处理逻辑
        if "[HOLD]" in delta_text:
            state["buffer"] = state.get("buffer", "") + delta_text.replace("[HOLD]", "")
            return StreamFallbackDecision(action="hold")

        if "[STOP]" in delta_text:
            return StreamFallbackDecision(action="truncate", text=delta_text.replace("[STOP]", ""))

        return StreamFallbackDecision(action="send", text=delta_text)

    def on_finish(
        self,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        return StreamFallbackDecision(action="flush", text=state.get("buffer", ""))

自定义策略说明:

  1. should_apply(text, context)
    • 判断当前文本是否需要应用 fallback
  2. apply(text, context)
    • 用于 non-streaming 场景下处理完整文本
    • 默认的 on_delta() 也会复用该逻辑处理无状态 streaming 文本
  3. on_delta(delta_text, context, state)
    • 用于 streaming 场景下处理每个增量文本
    • state 是按 request 维度维护的策略状态,可用于跨 chunk 缓存内容
    • 当前支持的 action:
      • send
      • hold
      • truncate
  4. on_finish(context, state)
    • 在流结束时返回 flush 内容
    • 常用于将 hold 阶段缓存的内容在最后统一输出

加载方式有两种:

  1. 通过插件路径加载
    • 使用:
       --output-fallback your-strategy-name \
       --output-fallback-plugin /path/to/custom_fallback.py
  2. 通过插件组自动加载
    • 将插件注册到 fastdeploy.output_fallback_plugins 对应的 entry point group

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 27, 2026 10:02
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 27, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

本 PR 为 OpenAI 兼容服务新增 output fallback 兜底处理框架,在 streaming / non-streaming 路径上对模型输出做后处理(修复 Markdown 加粗冒号、Markdown 表格、检测重复输出截断),并通过策略注册 + 插件机制支持自定义扩展。

Changes:

  • 新增 fastdeploy/output/fallback/ 子包:定义 OutputFallbackStrategy 基类、OutputFallbackContextStreamFallbackDecisionOutputFallbackManager,并内置 markdown-bold-colon / markdown-table / repeat-truncate 三个策略。
  • EngineArgs / api_server 接入 --output-fallback--output-fallback-plugin--output-fallback-config 三个启动参数,并将 manager 注入到 v0 / v1 chat 和 completion 的 serving 类。
  • 在 streaming / non-streaming 处理流程中调用 manager 的 apply / on_delta / on_finish / cleanup;命中 repeat-truncate 时将 finish_reason 设为 repeat_truncate 并 abort 对应 choice。

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fastdeploy/output/fallback/init.py 暴露公共类并导入三个内置策略以触发注册
fastdeploy/output/fallback/base.py 定义 fallback context / decision / 抽象基类
fastdeploy/output/fallback/manager.py 注册表 / 插件加载 / apply / on_delta / on_finish / cleanup
fastdeploy/output/fallback/markdown_bold_colon.py 修正 **xxx:** 冒号位置,支持跨 delta 缓存
fastdeploy/output/fallback/markdown_table.py 修复 Markdown 表格分隔行 / 列数不一致
fastdeploy/output/fallback/repeat_truncate.py 基于 token window 检测重复输出并触发 truncate
fastdeploy/engine/args_utils.py 增加 3 个新 CLI 参数
fastdeploy/entrypoints/openai/api_server.py 解析参数构建 manager 并注入各 handler,/config-info 暴露相应字段
fastdeploy/entrypoints/openai/serving_chat.py v0 chat 流/非流路径接入 fallback,含 repeat_truncate finish_reason
fastdeploy/entrypoints/openai/serving_completion.py v0 completion 流/非流路径接入 fallback
fastdeploy/entrypoints/openai/v1/serving_base.py 基类构造接收 manager 并在 finally 清理状态
fastdeploy/entrypoints/openai/v1/serving_chat.py v1 chat 接入 fallback(非多模态路径)
fastdeploy/entrypoints/openai/v1/serving_completion.py v1 completion 接入 fallback
tests/output/test_fallback.py 覆盖 manager、内置策略、流式 hold/flush/truncate、cleanup、插件导入

choice_completion_tokens = response_ctx.choice_completion_tokens_dict[output.index]
choice.finish_reason = self._calc_finish_reason(request_output, max_tokens, choice_completion_tokens)
if fallback_truncated:
choice.finish_reason = "repeat_truncate"
if res.get("error_msg") is not None and "Aborted" in res["error_msg"]:
choices[-1].finish_reason = "abort"
if fallback_truncated:
choices[-1].finish_reason = "repeat_truncate"
choice.finish_reason = "abort"

if fallback_truncated:
choice.finish_reason = "repeat_truncate"
Comment on lines +307 to +308
if fallback_truncated:
choice.finish_reason = "repeat_truncate"
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 70.92199% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@76339ec). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/output/fallback/manager.py 67.46% 31 Missing and 10 partials ⚠️
fastdeploy/input/base_processor.py 62.16% 9 Missing and 5 partials ⚠️
fastdeploy/entrypoints/openai/api_server.py 20.00% 6 Missing and 2 partials ⚠️
fastdeploy/entrypoints/openai/v1/serving_chat.py 71.42% 3 Missing and 1 partial ⚠️
fastdeploy/plugins/output_fallback/__init__.py 60.00% 2 Missing and 2 partials ⚠️
...deploy/entrypoints/openai/v1/serving_completion.py 75.00% 1 Missing and 2 partials ⚠️
fastdeploy/entrypoints/openai/serving_chat.py 83.33% 1 Missing and 1 partial ⚠️
...astdeploy/entrypoints/openai/serving_completion.py 85.71% 1 Missing and 1 partial ⚠️
fastdeploy/entrypoints/openai/v1/serving_base.py 50.00% 1 Missing and 1 partial ⚠️
fastdeploy/output/fallback/base.py 92.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7942   +/-   ##
==========================================
  Coverage           ?   68.09%           
==========================================
  Files              ?      472           
  Lines              ?    66200           
  Branches           ?    10217           
==========================================
  Hits               ?    45081           
  Misses             ?    18262           
  Partials           ?     2857           
Flag Coverage Δ
GPU 78.11% <70.92%> (?)
XPU 15.94% <25.17%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 27, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-02 20:21:37

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 22fe166 | Merge base: 42c66a7 (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
38(0) 38 32 2 3 1 0

2 失败详情

ℹ️ 当前无 required 失败任务,CI 仍在运行中(3 个 required 任务进行中):xpu_4cards_case_test / run_xpu_4cards_casesRun FastDeploy Unit Tests and Coverage / run_tests_with_coverageRun Four Cards Tests / run_4_cards_tests

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

This comment was marked as outdated.

@luukunn luukunn changed the title [Feature][APIServer] Add output fallback support for OpenAI serving [Feature]Add output fallback support for OpenAI serving May 28, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings June 2, 2026 10:57

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings June 3, 2026 13:18

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings June 4, 2026 09:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Comment on lines 214 to 218
self.data_processor.process_response_dict(
response_dict=request_output,
stream=stream,
include_stop_str_in_output=include_stop_str_in_output,
request=request,
Comment on lines +41 to +42
- ``hold`` : buffer this delta inside the strategy's ``state``; downstream
strategies still run, but the manager emits nothing this round.
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-04 17:49:35

📋 Review 摘要

PR 概述:新增 output fallback 框架,为 OpenAI serving 层提供统一的模型输出兜底处理机制,支持自定义策略注册、流式/非流式场景处理及插件加载。

变更范围fastdeploy/output/fallback/(新模块)、fastdeploy/input/base_processor.pyfastdeploy/entrypoints/openai/(serving_chat/completion/v1/*)、fastdeploy/engine/args_utils.pyfastdeploy/plugins/output_fallback/

影响面 Tag[Feature] [APIServer] [DataProcessor] [Engine]

问题

级别 文件 概述
🟡 建议 fastdeploy/output/fallback/manager.py:124 hold 时提前 return,当前 round 所有 trial_state 变更未写回 self.states,有状态策略状态机无法正确推进
🟡 建议 fastdeploy/output/fallback/manager.py:151 on_finish 中 buffer 非空时错误调用了 strategy.on_delta 而非 strategy.on_finish,导致自定义策略的 flush 钩子在流结束时被完全跳过

历史 Findings 修复情况

Finding 问题 状态
F1 output_fallback 类型注解缺少 Optional ✅ 已修复
F2 v1 streaming 路径缺少 fallback_truncated_choices 保护集 ✅ 已修复(truncated_choices 已加入 ServeContext
F3 v1 completion streaming 路径同样缺少保护集 ✅ 已修复
F4 repeat_truncate 不是 OpenAI 标准 finish_reason ✅ 已修复(改为 "length"
F5 truncatehold 同时触发时截断文本被静默丢弃 🔄 部分修复(if not truncated 判断已加,但 hold 时 state 不回写问题仍存在,见 N1)
F6 _calc_finish_reason 返回类型注解包含 "repeat_truncate" ⚠️ 仍存在(相关代码未在本次 diff 中改动)
F7 asdict(output) 热路径深拷贝性能问题 ⚠️ 仍存在(未改动)
F8 on_finish 返回 action="truncate" 时调用方只检查 .text 不检查 .action ⚠️ 仍存在(base_processor.pyfinish_decision.action 仍未判断)
F9 PR 描述声明支持 drop 但代码无此 action ✅ 已修复(描述已同步为 send/hold/flush/truncate
F10 request.n > 1fallback_decode_status key 冲突 ✅ 已修复(使用完整 req_id(含 ::n::idx 后缀)作为 key,各 choice 相互隔离)
F11 多策略 on_finish 对 flush 文本做覆盖而非链式传递 ✅ 已修复(改为 pending += 拼接)
F12 serving 层 cleanup 使用 base request_id 但 manager key 为 choice_id ✅ 已修复(cleanup 移至 base_processor.py,serving 层不再直接持有 manager)
F13 trial_state = dict(original_state) 浅拷贝导致可变对象泄漏 ✅ 已修复(改为 copy.deepcopy
F14 context.delta_texton_finish 调用时保存的是原始 delta,而非经 on_delta 处理后的结果 ⚠️ 仍存在
F15 on_delta 中策略实际接收的是累积缓冲内容,而非参数名 delta_text 暗示的当前增量 ⚠️ 仍存在(current_text = buffer 仍传给策略)
F16 self.output_fallback_manager 在 serving 层存储但从未消费,疑为死代码 ✅ 已修复(manager 现在注入 engine_client.data_processor,serving 层不再持有)

📝 PR 规范检查

PR 标题 [Feature]Add output fallback support for OpenAI serving 标签后缺少空格,标题格式不符合规范。

标题建议(可直接复制):

  • [Feature] Add output fallback support for OpenAI serving

PR 描述结构完整,包含 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 全部必填章节,内容充实,checklist 勾选状态符合实际变更。无需修改描述。

总体评价

本轮迭代解决了 F1–F4、F9–F13、F16 共 10 个历史问题,整体质量明显提升。当前仍需关注两个新发现的 manager.py 逻辑问题(N1/N2):hold 时 trial_state 未写回导致有状态策略状态机失效,以及 on_finish 中 buffer 非空分支未调用 strategy.on_finish 导致自定义 flush 逻辑无法执行。F6/F7/F8/F14/F15 为遗留问题,建议后续迭代修复。

"Failed to apply streaming output fallback strategy '%s'.", strategy.name
)
trial_states[strategy.name] = trial_state
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 hold 时提前 return,当前 round 所有 trial_state 变更未写回 self.states

当某策略返回 action="hold" 时,代码直接 return,跳过了末尾的状态写回循环。结果:

  1. 发出 hold 的策略在 trial_state 里记录的跨轮次状态(如已累积字节数)全部丢失;
  2. 链中排在该策略之前、本轮已运行过的其他策略,其 trial_state 变更同样丢失。

对于需要通过 state dict 追踪何时可放行的有状态策略,这会导致状态机永远无法正确推进。

建议修复:在 return 前先执行已积累的 trial_states 写回:

if decision.action == "hold":
    # 先写回已执行策略的状态,再决定是否 return
    for name, ts in trial_states.items():
        s = self._get_state(request_id, name)
        s.clear()
        s.update(ts)
    if not truncated:
        return StreamFallbackDecision(action="hold", text="")
    continue

return StreamFallbackDecision(action="send", text=current_text)

def on_finish(self, request_id: str, context: OutputFallbackContext) -> StreamFallbackDecision:
buffer = self._buffers.pop(request_id, "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 on_finish 中 buffer 非空时错误调用了 strategy.on_delta 而非 strategy.on_finish,导致自定义策略的 flush 钩子在流结束时被完全跳过。

当最后一个 delta 处于 hold 状态、buffer 非空时,此分支对每个策略调用 on_delta,策略在 on_finish 里实现的 flush 逻辑(如将缓存内容在流结束时统一输出)永远不会被执行。

另:此处 state = self._get_state(...) 直接修改持久 state,与 on_delta 正常路径(先 deepcopy 再写回)不一致,遇到异常会导致 state 半更新。

建议修复:在处理完 buffer 的 on_delta 之后,额外调用每个策略的 on_finish

if buffer:
    # ... 现有 on_delta 处理逻辑(建议同样改用 trial_state + 写回)...
    # 再触发每个策略的 on_finish
    for strategy in self.instances:
        state = self._get_state(request_id, strategy.name)
        try:
            fdec = strategy.on_finish(context, state)
        except Exception:
            continue
        if fdec.text:
            current_text += fdec.text
    return StreamFallbackDecision(action="flush", text=current_text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants