Skip to content

[qwen3_5_moe] GPTQ-Int4 MoE: export crash on shared_expert_gate + garbage inference on Jetson Orin #110

Description

@hanwhale11

Describe the bug

qwen3_5_moe GPTQ-Int4 MoE checkpoints (the documented reference Qwen/Qwen3.5-35B-A3B-GPTQ-Int4, and also Qwen3.6-35B-A3B-GPTQ-Int4 variants) fail to produce correct results with TensorRT Edge-LLM v0.8.0. There are two distinct problems:

Bug 1 (blocker, reproducible from a clean checkout) — export crashes on shared_expert_gate

tensorrt-edgellm-export crashes during GPTQ weight repacking. In Qwen3_5SparseMoeBlock, shared_expert_gate is a Linear(hidden_size, 1) and is instantiated as a GPTQLinear. Because out_features == 1, the GPTQ qzeros layout has out_features // 8 == 0 columns, i.e. an empty [num_groups, 0] tensor, so repack_gptq_to_plugin fails:

File ".../tensorrt_edgellm/checkpoint/repacking.py", in repack_gptq_to_plugin
    zeros[:, k::8] = (qz >> (4 * k)) & 0xF
RuntimeError: The expanded size of the tensor (1) must match the existing size (0)
at non-singleton dimension 1.  Target sizes: [16, 1].  Tensor sizes: [16, 0]

Failing module (all 40 layers): model.layers.N.mlp.shared_expert_gate, with qweight=[256, 1], qzeros=[16, 0].

The checkpoint actually stores shared_expert_gate.weight as a plain BF16 [1, hidden_size] tensor (it is not GPTQ-quantized), so make_linear should not select GPTQLinear for this module. Forcing it to FP16Linear(hidden_size, 1) lets export complete. (Same pattern as the MoE router Qwen3MoERouter, which is already a plain fp16 parameter.)

Bug 2 (correctness) — incoherent / empty inference output

After working around Bug 1 (forcing shared_expert_gate to FP16Linear), the model exports, builds, and runs on Jetson AGX Orin, but inference output is broken for both checkpoints with greedy decoding (temperature=0.0, top_k=1):

  • Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 → repeating gibberish, e.g.
    oleyus Relationshipsdbuf Reh$ar=-dayalog劳动能力ránBVettel-policy ... 番PEndе--[ereumXR尔夫ajaraOC_literals Correspond (repeats), finish_reason=max-length.
  • Qwen3.6-35B-A3B-GPTQ-Int4 → empty output_text; the model emits only special/out-of-text token ids (in the padded range [248077, 248320)) and never emits <|im_end|>, so it runs to max-length with no decodable text.

Prompt used: "What is the capital of South Korea? Answer briefly." (and several others). Engine builds cleanly; the Int4GroupwiseGemmPlugin / GDN CuTe DSL kernels load successfully. Engine config parsed as:
hiddenSize=2048 numDecoderLayers=40 numAttentionLayers=10 numLinearAttnLayers=30 numKVHeads=2 headDim=256 vocabSize=248320.

Because the documented reference checkpoint itself produces garbage, this looks like a framework-level correctness issue in the qwen3_5_moe (GDN-hybrid linear-attention + MoE, head-dim 256 / KV-ratio 8) export/runtime path rather than a user/model-selection problem. Note Bug 2 was observed on a source tree patched to get past Bug 1; if Bug 1 is fixed differently upstream, please re-verify Bug 2.

Impact: blocker — no usable text generation from qwen3_5_moe GPTQ-Int4 on Jetson Orin.

Steps/Code to reproduce bug

Installation method: pip install -e . from source (v0.8.0, commit f9cc746).

Commands used:

# 1) Download the documented reference MoE checkpoint
hf download Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 --local-dir ./Qwen3.5-35B-A3B-GPTQ-Int4

# 2) Export to ONNX  --> CRASHES here (Bug 1) on a clean checkout
tensorrt-edgellm-export ./Qwen3.5-35B-A3B-GPTQ-Int4 ./qwen3_5_moe_onnx

# --- after forcing shared_expert_gate to FP16Linear to get past Bug 1 ---

# 3) Build engine on Jetson Orin
./build/examples/llm/llm_build \
    --onnxDir  <dest>/onnx/llm \
    --engineDir <dest>/engine/llm \
    --maxInputLen 4096 --maxKVCacheCapacity 4096 --maxBatchSize 1

# 4) Inference  --> garbage / empty output (Bug 2)
echo '{"max_generate_length": 64, "requests": [{"messages":[{"role":"user","content":"What is the capital of South Korea? Answer briefly."}],"sampling_params":{"temperature":0.0,"top_k":1}}]}' > in.json
./build/examples/llm/llm_inference \
    --engineDir <dest>/engine/llm \
    --inputFile in.json --outputFile out.json --dumpOutput

Expected behavior

tensorrt-edgellm-export completes without manual patching, and llm_inference returns a coherent answer (e.g. "Seoul.") for the reference Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 checkpoint.

System information

Export host (x86 with GPU)

  • OS: Ubuntu 24.04.3 LTS
  • CPU architecture: x86_64
  • GPU: NVIDIA H100 80GB HBM3 (79.6 GB), 1 GPU
  • Library versions:
    • Python: 3.12.3
    • TensorRT Edge-LLM: 0.8.0 (commit f9cc746)
    • CUDA: 13.1
    • PyTorch: 2.11.0+cu130
    • Transformers: 5.8.1
    • ONNX: 1.19.0
    • ModelOpt: 0.44.0

Edge device (inference)

  • Platform: NVIDIA Jetson AGX Orin Developer Kit (SM87)
  • Software release: L4T R36.5.0 (JetPack 6.x), nvidia-l4t-core 36.5.0-20260115194252
  • CPU architecture: aarch64
  • Total device memory: 61 GB
  • Library versions:
    • TensorRT Edge-LLM: 0.8.0
    • CUDA: 12.6
    • TensorRT: 10.3.0.30 (+cuda12.5)
    • C++ compiler: GCC 11.4.0

Additional note (separate, minor)

Running tensorrt-edgellm-quantize llm --quantization int4_awq on the BF16 base (Qwen/Qwen3.6-35B-A3B) also crashes earlier, in MTP-draft quantization, because Qwen3_5MoeTextConfig has no intermediate_size attribute (only moe_intermediate_size):

File ".../quantization/models/mtp_draft.py", in __init__
    MtpDecoderLayer(hs, config.intermediate_size, ...)
AttributeError: 'Qwen3_5MoeTextConfig' object has no attribute 'intermediate_size'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions