Skip to content

Garbage output from Quick Start Guide example with release 0.8.0  #105

Description

@StefanoS90

Describe the bug

Hello, I am trying to run an LLM on the jetson thor.

I was able to make a successful inference of Qwen3 0.6B using the high level api as described in the guide https://nvidia.github.io/TensorRT-Edge-LLM/latest/user_guide/getting_started/quick-start-guide.html#quick-start-guide.

However, the output to the prompt "What is the capital of the United States?" is garbage.
Do you have any clue of what is possibly happaning?

See the input command and the related terminal output below:




(venv) user@user:~/workspace/TensorRT-Edge-LLM$ python - <<'PY'
from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-0.6B")
outputs = llm.generate(
    ["What is the capital of the United States?"],
    SamplingParams(max_tokens=128),
)
print(outputs[0].text)
PY
Downloading (incomplete total...): 0.00B [00:00, ?B/s]                                    Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Downloading (incomplete total...):   2%|▎             | 32.6M/1.52G [00:19<05:06, 4.86MB/s]
Fetching 10 files:  10%|███▌                                | 1/10 [00:00<00:01,  6.89it/s]
Fetching 10 files: 100%|███████████████████████████████████| 10/10 [01:17<00:00,  7.80s/it]
Download complete: 100%|██████████████████████████████| 1.52G/1.52G [01:18<00:00, 19.4MB/s]
[torch.onnx] Obtain model graph for `_Wrapper([...]` with `torch.export.export(..., strict=False)`...
/usr/lib/python3.12/contextlib.py:144: UserWarning: The tensor attribute self._model.model.last_pre_norm_hidden_states was assigned during export. Such attributes must be registered as buffers using the `register_buffer` API (https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer).
  next(self.gen)
[torch.onnx] Obtain model graph for `_Wrapper([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decompositions...
/usr/lib/python3.12/copyreg.py:99: FutureWarning: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
  return cls.__new__(cls, *args)
[torch.onnx] Run decompositions... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅
[torch.onnx] Optimize the ONNX graph...
[torch.onnx] Optimize the ONNX graph... ✅
/home/adas/workspace/TensorRT-Edge-LLM/venv/lib/python3.12/site-packages/torch/onnx/_internal/exporter/_onnx_program.py:486: UserWarning: # The axis name: batch will not be used, since it shares the same shape constraints with another axis: batch.
  rename_mapping = _dynamic_shapes.create_rename_mapping(
/home/adas/workspace/TensorRT-Edge-LLM/venv/lib/python3.12/site-packages/torch/onnx/_internal/exporter/_onnx_program.py:486: UserWarning: # The axis name: past_len will not be used, since it shares the same shape constraints with another axis: past_len.
  rename_mapping = _dynamic_shapes.create_rename_mapping(
[13:47:37.747] [INFO] [llmBuilder.cpp:98:build] Using __LUNOWUD=-peep:match_dual_gemm=off
[13:47:37.747] [INFO] [trtUtils.h:67:loadEdgellmPluginLib] EDGELLM_PLUGIN_PATH variable is not set. Default to build/libNvInfer_edgellm_plugin.so
[13:47:37.865] [INFO] [TensorRT] [MemUsageChange] Init CUDA: CPU -17, GPU +0, now: CPU 1291, GPU 59946 (MiB)
[13:47:39.094] [INFO] [TensorRT] [MemUsageChange] Init builder kernel library: CPU +1227, GPU +1224, now: CPU 2640, GPU 61328 (MiB)
[13:47:39.094] [INFO] [llmBuilder.cpp:128:build] Parsing ONNX model: /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/model.onnx
[13:47:39.100] [INFO] [TensorRT] ----------------------------------------------------------------
[13:47:39.100] [INFO] [TensorRT] Input filename:   /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/model.onnx
[13:47:39.101] [INFO] [TensorRT] ONNX IR version:  0.0.10
[13:47:39.101] [INFO] [TensorRT] Opset version:    24
[13:47:39.101] [INFO] [TensorRT] Producer name:    pytorch
[13:47:39.101] [INFO] [TensorRT] Producer version: 2.12.0+cu130
[13:47:39.101] [INFO] [TensorRT] Domain:           
[13:47:39.101] [INFO] [TensorRT] Model version:    0
[13:47:39.101] [INFO] [TensorRT] Doc string:       
[13:47:39.101] [INFO] [TensorRT] ----------------------------------------------------------------
[13:47:39.101] [WARNING] [TensorRT] ModelImporter.cpp:653: Make sure input last_token_ids has Int64 binding.
[13:47:39.104] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.120] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.121] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.121] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.122] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.122] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.124] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.124] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.125] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.125] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.126] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.126] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.127] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.127] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.128] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.128] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.129] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.129] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.130] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.130] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.131] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.131] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.133] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.133] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.134] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.134] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.135] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.135] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.136] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.136] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.137] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.137] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.138] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.138] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.140] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.140] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.141] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.141] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.142] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.142] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.143] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.143] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.144] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.144] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.145] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.145] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.146] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.146] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.147] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.148] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.149] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.149] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.150] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.150] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.152] [INFO] [TensorRT] Searching for plugin: AttentionPlugin, plugin_version: 1, plugin_namespace: 
[13:47:39.152] [INFO] [TensorRT] Successfully created plugin: AttentionPlugin
[13:47:39.268] [INFO] [TensorRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[13:47:39.357] [INFO] [TensorRT] Compiler backend is used during engine build.
[13:48:10.366] [INFO] [TensorRT] Detected 33 inputs and 29 output network tensors.
[13:48:10.689] [INFO] [TensorRT] Total Host Persistent Memory: 80 bytes
[13:48:10.689] [INFO] [TensorRT] Total Device Persistent Memory: 0 bytes
[13:48:10.689] [INFO] [TensorRT] Max Scratch Memory: 117441024 bytes
[13:48:10.689] [INFO] [TensorRT] [BlockAssignment] Started assigning block shifts. This will take 1 steps to complete.
[13:48:10.689] [INFO] [TensorRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.01038ms to assign 1 blocks to 1 nodes requiring 117441024 bytes.
[13:48:10.689] [INFO] [TensorRT] Total Activation Memory: 117441024 bytes
[13:48:26.713] [INFO] [TensorRT] Detected 33 inputs and 29 output network tensors.
[13:48:26.969] [INFO] [TensorRT] Total Host Persistent Memory: 80 bytes
[13:48:26.969] [INFO] [TensorRT] Total Device Persistent Memory: 0 bytes
[13:48:26.969] [INFO] [TensorRT] Max Scratch Memory: 33575424 bytes
[13:48:26.969] [INFO] [TensorRT] [BlockAssignment] Started assigning block shifts. This will take 1 steps to complete.
[13:48:26.969] [INFO] [TensorRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.00837ms to assign 1 blocks to 1 nodes requiring 33575424 bytes.
[13:48:26.969] [INFO] [TensorRT] Total Activation Memory: 33575424 bytes
[13:48:27.043] [INFO] [TensorRT] Total Weights Memory: 1192100096 bytes
[13:48:27.049] [INFO] [TensorRT] Compiler backend is used during engine execution.
[13:48:27.049] [INFO] [TensorRT] Engine generation completed in 47.7824 seconds.
[13:48:27.050] [INFO] [TensorRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1136 MiB
[13:48:27.210] [INFO] [TensorRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 13828 MiB
[13:48:27.826] [INFO] [builderUtils.cpp:328:buildAndSerializeEngine] Engine saved to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/llm.engine
[13:48:27.829] [INFO] [llmBuilder.cpp:944:copyConfig] Copied config.json with builder config to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/config.json
[13:48:27.829] [INFO] [fileUtils.cpp:48:copyFile] Successfully copied /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/tokenizer_config.json to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/tokenizer_config.json
[13:48:27.829] [INFO] [llmBuilder.cpp:975:copyTokenizerFiles] Copied tokenizer file: tokenizer_config.json
[13:48:27.834] [INFO] [fileUtils.cpp:48:copyFile] Successfully copied /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/tokenizer.json to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/tokenizer.json
[13:48:27.834] [INFO] [llmBuilder.cpp:975:copyTokenizerFiles] Copied tokenizer file: tokenizer.json
[13:48:27.834] [INFO] [fileUtils.cpp:48:copyFile] Successfully copied /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/processed_chat_template.json to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/processed_chat_template.json
[13:48:27.834] [INFO] [llmBuilder.cpp:975:copyTokenizerFiles] Copied tokenizer file: processed_chat_template.json
[13:48:27.973] [INFO] [fileUtils.cpp:48:copyFile] Successfully copied /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/embedding.safetensors to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/embedding.safetensors
[13:48:27.973] [INFO] [llmBuilder.cpp:1129:copyEmbeddingFile] Copied embedding.safetensors to /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/embedding.safetensors
[13:48:29.281] [INFO] [trtUtils.h:67:loadEdgellmPluginLib] EDGELLM_PLUGIN_PATH variable is not set. Default to build/libNvInfer_edgellm_plugin.so
[13:48:29.402] [INFO] [llmRuntimeUtils.cpp:444:loadEmbeddingTable] Loaded FP16 embedding: [151936, 1024]
[13:48:29.402] [INFO] [llmEngineConfig.cpp:229:parseEngineConfig] reading /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/config.json
[13:48:29.404] [INFO] [llmRuntimeUtils.cpp:167:collectRopeConfig] Collected rope config: RopeConfig:  type: Default  rotaryScale: 1  rotaryTheta: 1e+06  maxPositionEmbeddings: 40960
[13:48:29.404] [INFO] [llmEngineConfig.cpp:331:parseEngineConfig] LLMEngineConfig{ hiddenSize=1024 vocabSize=151936 outputVocabSize=151936 numDecoderLayers=28 numAttentionLayers=28 numKVHeads=8 headDim=128 rotaryDim=128 maxBatch=1 maxInputLen=4096 maxKVCapacity=8192 useTrtNativeOps=false isSpecDecodeBase=false specDecodeType=0 loraRank=0 }
[13:48:29.407] [INFO] [engineExecutor.cpp:36:EngineExecutor] loading engine file: /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/llm.engine
[13:48:29.407] [INFO] [TensorRT] Loaded engine size: 1140 MiB
[13:48:29.538] [INFO] [TensorRT] [MS] Running engine with multi stream info
[13:48:29.538] [INFO] [TensorRT] [MS] Number of aux streams is 1
[13:48:29.538] [INFO] [TensorRT] [MS] Number of total worker streams is 2
[13:48:29.538] [INFO] [TensorRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[13:48:29.626] [INFO] [TensorRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1136 (MiB)
[13:48:29.626] [INFO] [engineExecutor.cpp:53:EngineExecutor] engine loaded successfully (62 I/O tensors)
[13:48:29.731] [INFO] [llmInferenceRuntime.cpp:118:initializeCommon] Base EngineExecutor successfully loaded from /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/llm.engine.
[13:48:29.731] [INFO] [llmInferenceRuntime.cpp:129:initializeCommon] Runtime batch size set to: 1 (from engine bundle)
[13:48:29.742] [INFO] [ropeCache.cpp:103:getOrCreate] RopeCache: creating new entry (rotaryDim=128, maxSeqLen=8192)
[13:48:29.753] [INFO] [llmInferenceRuntime.cpp:244:initializeCommon] Runtime tensors successfully allocated.
[13:48:29.753] [INFO] [llmInferenceRuntime.cpp:272:initializeCommon] Start loading tokenizer from model directory: /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm
[13:48:31.837] [INFO] [tokenizer.cpp:385:loadVocabulary] Loaded 151643 vocabulary tokens
[13:48:32.044] [INFO] [tokenizer.cpp:96:loadFromHF] Loaded 26 special tokens
[13:48:32.163] [INFO] [tokenizer.cpp:782:loadChatTemplate] Successfully loaded chat template from /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm/processed_chat_template.json (for model: /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca)
[13:48:32.163] [INFO] [tokenizer.cpp:123:loadFromHF] Successfully loaded tokenizer from /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm (vocab_size=151669)
[13:48:32.191] [INFO] [llmInferenceRuntime.cpp:274:initializeCommon] Tokenizer successfully loaded from model directory: /home/adas/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca/.edgellm/onnx/llm/.edgellm/engine/i4096_b1_kv8192/llm
[13:48:32.193] [INFO] [llmInferenceRuntime.cpp:372:initializeCommon] Setup shared execution context memory: 117441024 bytes (base requires: 117441024, strategy requires: 0, vision requires: 0, audio requires: 0, action requires: 0)
[13:48:32.194] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[13:48:32.294] [INFO] [engineExecutor.cpp:219:captureGraph] captured graph (hash=0xd5d500c02b15a11f)
[13:48:32.294] [INFO] [decoderRegistry.cpp:84:captureCudaGraphs] Successfully captured decoding CUDA graphs for active decoding strategies.
[13:48:32.295] [INFO] [TensorRT] Switching optimization profile from: 1 to 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[13:48:32.331] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
lec





1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1




Steps/Code to reproduce bug

Build configuration:


mkdir -p build
cd build
cmake .. \
  -DTRT_PACKAGE_DIR=/usr \
  -DCUDA_CTK_VERSION=13.0 \
  -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
  -DEMBEDDED_TARGET=jetson-thor \
  -DENABLE_CUTE_DSL=ALL \
  -DBUILD_PYTHON_BINDINGS=ON
make -j$(nproc)
cd ..

Runtime command used:

python - <<'PY'
from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-0.6B")
outputs = llm.generate(
    ["What is the capital of the United States?"],
    SamplingParams(max_tokens=128),
)
print(outputs[0].text)
PY

Expected behavior

i would expect correct text answer, but the actual output is

lec

1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1

System information Edge Device

  • Platform: NVIDIA Jetson AGX Thor Developer Kit
  • JetPack package version: 7.1-b112
  • L4T / Jetson Linux release: # R38 (release), REVISION: 4.0, GCID: 43443517, BOARD: generic, EABI: aarch64, DATE: Wed Dec 31 00:15:19 UTC 2025
    KERNEL_VARIANT: oot
    TARGET_USERSPACE_LIB_DIR=nvidia
    TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia
    INSTALL_TYPE=
  • CPU architecture: aarch64
  • GPU compute capability: SM110
  • Total device memory: 122Gi
  • Build type: Release
  • Library versions:
    • TensorRT Edge-LLM version or commit hash: ?
    • CUDA: 13.0
    • TensorRT: 10.13.3
    • C++ compiler: GCC 13.3.0
  • CMake options used:
    • CMAKE_TOOLCHAIN_FILE: ?
    • EMBEDDED_TARGET: ?
    • TRT_PACKAGE_DIR: ?
  • Any other details that may help: ?
    ======================================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions