Skip to content

enable_thinking=True produces reasoning text without literal <think>...</think> tags — _ThinkingStateMachine cannot separate it from the final answer #113

Description

@micDKpara

Environment

  • Hardware: Jetson AGX Thor, 128GB unified memory, JetPack 7.2, CUDA 13.2
  • TensorRT-Edge-LLM version: (check with cat ~/TensorRT-Edge-LLM/VERSION or git log)
  • Model: Qwen3-14B, quantized to NVFP4 via tensorrt_edgellm.scripts.quantize
  • Server: experimental OpenAI-compatible server (experimental/server),
    custom persistent wrapper around LLM.generate_stream()
  • Engine built with: llm_build --maxInputLen 40960 --maxKVCacheCapacity 40960

Description

When generating with SamplingParams(enable_thinking=True, ...), the model
produces reasoning content as plain text but does not emit literal
<think> / </think> tags
in the streamed or final output.

api_server.py defines _ThinkingStateMachine (line ~219) specifically to
parse THINK_OPEN_TAG = "<think>" / THINK_CLOSE_TAG = "</think>" boundaries
in streamed text — but since the engine never emits these tags, the state
machine has nothing to key on and reasoning text is indistinguishable from
the final answer in raw output.

Steps to reproduce

  1. Quantize Qwen3-14B to NVFP4, export to ONNX, build TRT engine
    (standard pipeline per Quick Start Guide)
  2. Serve with a persistent Python wrapper calling
    llm.generate_stream(messages, SamplingParams(enable_thinking=True, ...))
  3. Send a simple prompt, e.g. "what is 2+2"
  4. Inspect raw streamed/concatenated token text

Expected

Raw output should contain <think>\n...reasoning...\n</think>\n\n2 + 2 = 4
so _ThinkingStateMachine (or any consumer) can reliably split reasoning
from the final answer.

Actual

Raw output is plain text with no <think> tags at all, e.g.:

Because reasoning itself frequently contains multiple \n\n-separated
paragraphs, there is no reliable text-based heuristic to find the boundary
between reasoning and the final answer — _ThinkingStateMachine is
unusable in this configuration.

Things I checked

  • Swapped generation_prompt and generation_prompt_thinking in
    processed_chat_template.json — output behavior with respect to tag
    presence did not change (only whether reasoning happens at all changes
    with enable_thinking=True/False, not whether tags are emitted).
  • Confirmed enable_thinking=False reliably produces clean, tag-free,
    reasoning-free output — this works as expected, just disables the
    feature entirely rather than exposing it cleanly.

Question

Is <think> tag emission expected to work with the persistent
LLM.generate_stream() Python API + a manually-quantized NVFP4 engine,
or is tag emission only supported through a specific server entrypoint /
chat template configuration we may be missing? If there's a known-correct
recipe for enable_thinking=True + reliable <think> tag output on a
custom-quantized Qwen3 engine, a pointer would be very helpful.

Happy to provide the full processed_chat_template.json, exact
quantize/export/build commands, or raw debug logs on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions