Describe the bug
qwen3_5_moe GPTQ-Int4 MoE checkpoints (the documented reference Qwen/Qwen3.5-35B-A3B-GPTQ-Int4, and also Qwen3.6-35B-A3B-GPTQ-Int4 variants) fail to produce correct results with TensorRT Edge-LLM v0.8.0. There are two distinct problems:
Bug 1 (blocker, reproducible from a clean checkout) — export crashes on shared_expert_gate
tensorrt-edgellm-export crashes during GPTQ weight repacking. In Qwen3_5SparseMoeBlock, shared_expert_gate is a Linear(hidden_size, 1) and is instantiated as a GPTQLinear. Because out_features == 1, the GPTQ qzeros layout has out_features // 8 == 0 columns, i.e. an empty [num_groups, 0] tensor, so repack_gptq_to_plugin fails:
File ".../tensorrt_edgellm/checkpoint/repacking.py", in repack_gptq_to_plugin
zeros[:, k::8] = (qz >> (4 * k)) & 0xF
RuntimeError: The expanded size of the tensor (1) must match the existing size (0)
at non-singleton dimension 1. Target sizes: [16, 1]. Tensor sizes: [16, 0]
Failing module (all 40 layers): model.layers.N.mlp.shared_expert_gate, with qweight=[256, 1], qzeros=[16, 0].
The checkpoint actually stores shared_expert_gate.weight as a plain BF16 [1, hidden_size] tensor (it is not GPTQ-quantized), so make_linear should not select GPTQLinear for this module. Forcing it to FP16Linear(hidden_size, 1) lets export complete. (Same pattern as the MoE router Qwen3MoERouter, which is already a plain fp16 parameter.)
Bug 2 (correctness) — incoherent / empty inference output
After working around Bug 1 (forcing shared_expert_gate to FP16Linear), the model exports, builds, and runs on Jetson AGX Orin, but inference output is broken for both checkpoints with greedy decoding (temperature=0.0, top_k=1):
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 → repeating gibberish, e.g.
oleyus Relationshipsdbuf Reh$ar=-dayalog劳动能力ránBVettel-policy ... 番PEndе--[ereumXR尔夫ajaraOC_literals Correspond (repeats), finish_reason=max-length.
Qwen3.6-35B-A3B-GPTQ-Int4 → empty output_text; the model emits only special/out-of-text token ids (in the padded range [248077, 248320)) and never emits <|im_end|>, so it runs to max-length with no decodable text.
Prompt used: "What is the capital of South Korea? Answer briefly." (and several others). Engine builds cleanly; the Int4GroupwiseGemmPlugin / GDN CuTe DSL kernels load successfully. Engine config parsed as:
hiddenSize=2048 numDecoderLayers=40 numAttentionLayers=10 numLinearAttnLayers=30 numKVHeads=2 headDim=256 vocabSize=248320.
Because the documented reference checkpoint itself produces garbage, this looks like a framework-level correctness issue in the qwen3_5_moe (GDN-hybrid linear-attention + MoE, head-dim 256 / KV-ratio 8) export/runtime path rather than a user/model-selection problem. Note Bug 2 was observed on a source tree patched to get past Bug 1; if Bug 1 is fixed differently upstream, please re-verify Bug 2.
Impact: blocker — no usable text generation from qwen3_5_moe GPTQ-Int4 on Jetson Orin.
Steps/Code to reproduce bug
Installation method: pip install -e . from source (v0.8.0, commit f9cc746).
Commands used:
# 1) Download the documented reference MoE checkpoint
hf download Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 --local-dir ./Qwen3.5-35B-A3B-GPTQ-Int4
# 2) Export to ONNX --> CRASHES here (Bug 1) on a clean checkout
tensorrt-edgellm-export ./Qwen3.5-35B-A3B-GPTQ-Int4 ./qwen3_5_moe_onnx
# --- after forcing shared_expert_gate to FP16Linear to get past Bug 1 ---
# 3) Build engine on Jetson Orin
./build/examples/llm/llm_build \
--onnxDir <dest>/onnx/llm \
--engineDir <dest>/engine/llm \
--maxInputLen 4096 --maxKVCacheCapacity 4096 --maxBatchSize 1
# 4) Inference --> garbage / empty output (Bug 2)
echo '{"max_generate_length": 64, "requests": [{"messages":[{"role":"user","content":"What is the capital of South Korea? Answer briefly."}],"sampling_params":{"temperature":0.0,"top_k":1}}]}' > in.json
./build/examples/llm/llm_inference \
--engineDir <dest>/engine/llm \
--inputFile in.json --outputFile out.json --dumpOutput
Expected behavior
tensorrt-edgellm-export completes without manual patching, and llm_inference returns a coherent answer (e.g. "Seoul.") for the reference Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 checkpoint.
System information
Export host (x86 with GPU)
- OS: Ubuntu 24.04.3 LTS
- CPU architecture: x86_64
- GPU: NVIDIA H100 80GB HBM3 (79.6 GB), 1 GPU
- Library versions:
- Python: 3.12.3
- TensorRT Edge-LLM: 0.8.0 (commit f9cc746)
- CUDA: 13.1
- PyTorch: 2.11.0+cu130
- Transformers: 5.8.1
- ONNX: 1.19.0
- ModelOpt: 0.44.0
Edge device (inference)
- Platform: NVIDIA Jetson AGX Orin Developer Kit (SM87)
- Software release: L4T R36.5.0 (JetPack 6.x), nvidia-l4t-core 36.5.0-20260115194252
- CPU architecture: aarch64
- Total device memory: 61 GB
- Library versions:
- TensorRT Edge-LLM: 0.8.0
- CUDA: 12.6
- TensorRT: 10.3.0.30 (+cuda12.5)
- C++ compiler: GCC 11.4.0
Additional note (separate, minor)
Running tensorrt-edgellm-quantize llm --quantization int4_awq on the BF16 base (Qwen/Qwen3.6-35B-A3B) also crashes earlier, in MTP-draft quantization, because Qwen3_5MoeTextConfig has no intermediate_size attribute (only moe_intermediate_size):
File ".../quantization/models/mtp_draft.py", in __init__
MtpDecoderLayer(hs, config.intermediate_size, ...)
AttributeError: 'Qwen3_5MoeTextConfig' object has no attribute 'intermediate_size'
Describe the bug
qwen3_5_moeGPTQ-Int4 MoE checkpoints (the documented referenceQwen/Qwen3.5-35B-A3B-GPTQ-Int4, and alsoQwen3.6-35B-A3B-GPTQ-Int4variants) fail to produce correct results with TensorRT Edge-LLM v0.8.0. There are two distinct problems:Bug 1 (blocker, reproducible from a clean checkout) — export crashes on
shared_expert_gatetensorrt-edgellm-exportcrashes during GPTQ weight repacking. InQwen3_5SparseMoeBlock,shared_expert_gateis aLinear(hidden_size, 1)and is instantiated as aGPTQLinear. Becauseout_features == 1, the GPTQqzeroslayout hasout_features // 8 == 0columns, i.e. an empty[num_groups, 0]tensor, sorepack_gptq_to_pluginfails:Failing module (all 40 layers):
model.layers.N.mlp.shared_expert_gate, withqweight=[256, 1],qzeros=[16, 0].The checkpoint actually stores
shared_expert_gate.weightas a plain BF16[1, hidden_size]tensor (it is not GPTQ-quantized), somake_linearshould not selectGPTQLinearfor this module. Forcing it toFP16Linear(hidden_size, 1)lets export complete. (Same pattern as the MoE routerQwen3MoERouter, which is already a plain fp16 parameter.)Bug 2 (correctness) — incoherent / empty inference output
After working around Bug 1 (forcing
shared_expert_gatetoFP16Linear), the model exports, builds, and runs on Jetson AGX Orin, but inference output is broken for both checkpoints with greedy decoding (temperature=0.0, top_k=1):Qwen/Qwen3.5-35B-A3B-GPTQ-Int4→ repeating gibberish, e.g.oleyus Relationshipsdbuf Reh$ar=-dayalog劳动能力ránBVettel-policy ... 番PEndе--[ereumXR尔夫ajaraOC_literals Correspond(repeats),finish_reason=max-length.Qwen3.6-35B-A3B-GPTQ-Int4→ emptyoutput_text; the model emits only special/out-of-text token ids (in the padded range[248077, 248320)) and never emits<|im_end|>, so it runs tomax-lengthwith no decodable text.Prompt used:
"What is the capital of South Korea? Answer briefly."(and several others). Engine builds cleanly; theInt4GroupwiseGemmPlugin/ GDN CuTe DSL kernels load successfully. Engine config parsed as:hiddenSize=2048 numDecoderLayers=40 numAttentionLayers=10 numLinearAttnLayers=30 numKVHeads=2 headDim=256 vocabSize=248320.Because the documented reference checkpoint itself produces garbage, this looks like a framework-level correctness issue in the
qwen3_5_moe(GDN-hybrid linear-attention + MoE, head-dim 256 / KV-ratio 8) export/runtime path rather than a user/model-selection problem. Note Bug 2 was observed on a source tree patched to get past Bug 1; if Bug 1 is fixed differently upstream, please re-verify Bug 2.Impact: blocker — no usable text generation from
qwen3_5_moeGPTQ-Int4 on Jetson Orin.Steps/Code to reproduce bug
Installation method:
pip install -e .from source (v0.8.0, commitf9cc746).Commands used:
Expected behavior
tensorrt-edgellm-exportcompletes without manual patching, andllm_inferencereturns a coherent answer (e.g. "Seoul.") for the referenceQwen/Qwen3.5-35B-A3B-GPTQ-Int4checkpoint.System information
Export host (x86 with GPU)
Edge device (inference)
Additional note (separate, minor)
Running
tensorrt-edgellm-quantize llm --quantization int4_awqon the BF16 base (Qwen/Qwen3.6-35B-A3B) also crashes earlier, in MTP-draft quantization, becauseQwen3_5MoeTextConfighas nointermediate_sizeattribute (onlymoe_intermediate_size):