Environment
- Hardware: Jetson AGX Thor, 128GB unified memory, JetPack 7.2, CUDA 13.2
- TensorRT-Edge-LLM version: (check with
cat ~/TensorRT-Edge-LLM/VERSION or git log)
- Model: Qwen3-14B, quantized to NVFP4 via
tensorrt_edgellm.scripts.quantize
- Server: experimental OpenAI-compatible server (
experimental/server),
custom persistent wrapper around LLM.generate_stream()
- Engine built with:
llm_build --maxInputLen 40960 --maxKVCacheCapacity 40960
Description
When generating with SamplingParams(enable_thinking=True, ...), the model
produces reasoning content as plain text but does not emit literal
<think> / </think> tags in the streamed or final output.
api_server.py defines _ThinkingStateMachine (line ~219) specifically to
parse THINK_OPEN_TAG = "<think>" / THINK_CLOSE_TAG = "</think>" boundaries
in streamed text — but since the engine never emits these tags, the state
machine has nothing to key on and reasoning text is indistinguishable from
the final answer in raw output.
Steps to reproduce
- Quantize Qwen3-14B to NVFP4, export to ONNX, build TRT engine
(standard pipeline per Quick Start Guide)
- Serve with a persistent Python wrapper calling
llm.generate_stream(messages, SamplingParams(enable_thinking=True, ...))
- Send a simple prompt, e.g. "what is 2+2"
- Inspect raw streamed/concatenated token text
Expected
Raw output should contain <think>\n...reasoning...\n</think>\n\n2 + 2 = 4
so _ThinkingStateMachine (or any consumer) can reliably split reasoning
from the final answer.
Actual
Raw output is plain text with no <think> tags at all, e.g.:
Because reasoning itself frequently contains multiple \n\n-separated
paragraphs, there is no reliable text-based heuristic to find the boundary
between reasoning and the final answer — _ThinkingStateMachine is
unusable in this configuration.
Things I checked
- Swapped
generation_prompt and generation_prompt_thinking in
processed_chat_template.json — output behavior with respect to tag
presence did not change (only whether reasoning happens at all changes
with enable_thinking=True/False, not whether tags are emitted).
- Confirmed
enable_thinking=False reliably produces clean, tag-free,
reasoning-free output — this works as expected, just disables the
feature entirely rather than exposing it cleanly.
Question
Is <think> tag emission expected to work with the persistent
LLM.generate_stream() Python API + a manually-quantized NVFP4 engine,
or is tag emission only supported through a specific server entrypoint /
chat template configuration we may be missing? If there's a known-correct
recipe for enable_thinking=True + reliable <think> tag output on a
custom-quantized Qwen3 engine, a pointer would be very helpful.
Happy to provide the full processed_chat_template.json, exact
quantize/export/build commands, or raw debug logs on request.
Environment
cat ~/TensorRT-Edge-LLM/VERSIONor git log)tensorrt_edgellm.scripts.quantizeexperimental/server),custom persistent wrapper around
LLM.generate_stream()llm_build --maxInputLen 40960 --maxKVCacheCapacity 40960Description
When generating with
SamplingParams(enable_thinking=True, ...), the modelproduces reasoning content as plain text but does not emit literal
<think>/</think>tags in the streamed or final output.api_server.pydefines_ThinkingStateMachine(line ~219) specifically toparse
THINK_OPEN_TAG = "<think>"/THINK_CLOSE_TAG = "</think>"boundariesin streamed text — but since the engine never emits these tags, the state
machine has nothing to key on and reasoning text is indistinguishable from
the final answer in raw output.
Steps to reproduce
(standard pipeline per Quick Start Guide)
llm.generate_stream(messages, SamplingParams(enable_thinking=True, ...))Expected
Raw output should contain
<think>\n...reasoning...\n</think>\n\n2 + 2 = 4so
_ThinkingStateMachine(or any consumer) can reliably split reasoningfrom the final answer.
Actual
Raw output is plain text with no
<think>tags at all, e.g.:Because reasoning itself frequently contains multiple
\n\n-separatedparagraphs, there is no reliable text-based heuristic to find the boundary
between reasoning and the final answer —
_ThinkingStateMachineisunusable in this configuration.
Things I checked
generation_promptandgeneration_prompt_thinkinginprocessed_chat_template.json— output behavior with respect to tagpresence did not change (only whether reasoning happens at all changes
with
enable_thinking=True/False, not whether tags are emitted).enable_thinking=Falsereliably produces clean, tag-free,reasoning-free output — this works as expected, just disables the
feature entirely rather than exposing it cleanly.
Question
Is
<think>tag emission expected to work with the persistentLLM.generate_stream()Python API + a manually-quantized NVFP4 engine,or is tag emission only supported through a specific server entrypoint /
chat template configuration we may be missing? If there's a known-correct
recipe for
enable_thinking=True+ reliable<think>tag output on acustom-quantized Qwen3 engine, a pointer would be very helpful.
Happy to provide the full
processed_chat_template.json, exactquantize/export/build commands, or raw debug logs on request.