enable_thinking=True produces reasoning text without literal <think>...</think> tags — _ThinkingStateMachine cannot separate it from the final answer

## Environment
- Hardware: Jetson AGX Thor, 128GB unified memory, JetPack 7.2, CUDA 13.2
- TensorRT-Edge-LLM version: (check with `cat ~/TensorRT-Edge-LLM/VERSION` or git log)
- Model: Qwen3-14B, quantized to NVFP4 via `tensorrt_edgellm.scripts.quantize`
- Server: experimental OpenAI-compatible server (`experimental/server`), 
  custom persistent wrapper around `LLM.generate_stream()`
- Engine built with: `llm_build --maxInputLen 40960 --maxKVCacheCapacity 40960`

## Description

When generating with `SamplingParams(enable_thinking=True, ...)`, the model 
produces reasoning content as plain text but **does not emit literal 
`<think>` / `</think>` tags** in the streamed or final output.

`api_server.py` defines `_ThinkingStateMachine` (line ~219) specifically to 
parse `THINK_OPEN_TAG = "<think>"` / `THINK_CLOSE_TAG = "</think>"` boundaries 
in streamed text — but since the engine never emits these tags, the state 
machine has nothing to key on and reasoning text is indistinguishable from 
the final answer in raw output.

## Steps to reproduce

1. Quantize Qwen3-14B to NVFP4, export to ONNX, build TRT engine
   (standard pipeline per Quick Start Guide)
2. Serve with a persistent Python wrapper calling 
   `llm.generate_stream(messages, SamplingParams(enable_thinking=True, ...))`
3. Send a simple prompt, e.g. "what is 2+2"
4. Inspect raw streamed/concatenated token text

## Expected

Raw output should contain `<think>\n...reasoning...\n</think>\n\n2 + 2 = 4` 
so `_ThinkingStateMachine` (or any consumer) can reliably split reasoning 
from the final answer.

## Actual

Raw output is plain text with no `<think>` tags at all, e.g.:

Because reasoning itself frequently contains multiple `\n\n`-separated 
paragraphs, there is no reliable text-based heuristic to find the boundary 
between reasoning and the final answer — `_ThinkingStateMachine` is 
unusable in this configuration.

## Things I checked

- Swapped `generation_prompt` and `generation_prompt_thinking` in 
  `processed_chat_template.json` — output behavior with respect to tag 
  presence did not change (only whether reasoning happens at all changes 
  with `enable_thinking=True/False`, not whether tags are emitted).
- Confirmed `enable_thinking=False` reliably produces clean, tag-free, 
  reasoning-free output — this works as expected, just disables the 
  feature entirely rather than exposing it cleanly.

## Question

Is `<think>` tag emission expected to work with the persistent 
`LLM.generate_stream()` Python API + a manually-quantized NVFP4 engine, 
or is tag emission only supported through a specific server entrypoint / 
chat template configuration we may be missing? If there's a known-correct 
recipe for `enable_thinking=True` + reliable `<think>` tag output on a 
custom-quantized Qwen3 engine, a pointer would be very helpful.

Happy to provide the full `processed_chat_template.json`, exact 
quantize/export/build commands, or raw debug logs on request.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable_thinking=True produces reasoning text without literal <think>...</think> tags — _ThinkingStateMachine cannot separate it from the final answer #113

Environment

Description

Steps to reproduce

Expected

Actual

Things I checked

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

enable_thinking=True produces reasoning text without literal <think>...</think> tags — _ThinkingStateMachine cannot separate it from the final answer #113

Description

Environment

Description

Steps to reproduce

Expected

Actual

Things I checked

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions