[qwen3_5_moe] GPTQ-Int4 MoE: export crash on shared_expert_gate + garbage inference on Jetson Orin

# Describe the bug

`qwen3_5_moe` GPTQ-Int4 MoE checkpoints (the documented reference **`Qwen/Qwen3.5-35B-A3B-GPTQ-Int4`**, and also `Qwen3.6-35B-A3B-GPTQ-Int4` variants) fail to produce correct results with TensorRT Edge-LLM v0.8.0. There are **two distinct problems**:

### Bug 1 (blocker, reproducible from a clean checkout) — export crashes on `shared_expert_gate`

`tensorrt-edgellm-export` crashes during GPTQ weight repacking. In `Qwen3_5SparseMoeBlock`, `shared_expert_gate` is a `Linear(hidden_size, 1)` and is instantiated as a `GPTQLinear`. Because `out_features == 1`, the GPTQ `qzeros` layout has `out_features // 8 == 0` columns, i.e. an empty `[num_groups, 0]` tensor, so `repack_gptq_to_plugin` fails:

```
File ".../tensorrt_edgellm/checkpoint/repacking.py", in repack_gptq_to_plugin
    zeros[:, k::8] = (qz >> (4 * k)) & 0xF
RuntimeError: The expanded size of the tensor (1) must match the existing size (0)
at non-singleton dimension 1.  Target sizes: [16, 1].  Tensor sizes: [16, 0]
```

Failing module (all 40 layers): `model.layers.N.mlp.shared_expert_gate`, with `qweight=[256, 1]`, `qzeros=[16, 0]`.

The checkpoint actually stores `shared_expert_gate.weight` as a plain BF16 `[1, hidden_size]` tensor (it is **not** GPTQ-quantized), so `make_linear` should not select `GPTQLinear` for this module. Forcing it to `FP16Linear(hidden_size, 1)` lets export complete. (Same pattern as the MoE router `Qwen3MoERouter`, which is already a plain fp16 parameter.)

### Bug 2 (correctness) — incoherent / empty inference output

After working around Bug 1 (forcing `shared_expert_gate` to `FP16Linear`), the model exports, builds, and runs on Jetson AGX Orin, but inference output is broken for **both** checkpoints with greedy decoding (`temperature=0.0, top_k=1`):

- **`Qwen/Qwen3.5-35B-A3B-GPTQ-Int4`** → repeating gibberish, e.g.
  `oleyus Relationshipsdbuf Reh$ar=-dayalog劳动能力ránBVettel-policy ... 番PEndе--[ereumXR尔夫ajaraOC_literals Correspond` (repeats), `finish_reason=max-length`.
- **`Qwen3.6-35B-A3B-GPTQ-Int4`** → empty `output_text`; the model emits only special/out-of-text token ids (in the padded range `[248077, 248320)`) and never emits `<|im_end|>`, so it runs to `max-length` with no decodable text.

Prompt used: `"What is the capital of South Korea? Answer briefly."` (and several others). Engine builds cleanly; the `Int4GroupwiseGemmPlugin` / GDN CuTe DSL kernels load successfully. Engine config parsed as:
`hiddenSize=2048 numDecoderLayers=40 numAttentionLayers=10 numLinearAttnLayers=30 numKVHeads=2 headDim=256 vocabSize=248320`.

Because the **documented reference checkpoint itself** produces garbage, this looks like a framework-level correctness issue in the `qwen3_5_moe` (GDN-hybrid linear-attention + MoE, head-dim 256 / KV-ratio 8) export/runtime path rather than a user/model-selection problem. Note Bug 2 was observed on a source tree patched to get past Bug 1; if Bug 1 is fixed differently upstream, please re-verify Bug 2.

**Impact:** blocker — no usable text generation from `qwen3_5_moe` GPTQ-Int4 on Jetson Orin.

### Steps/Code to reproduce bug

**Installation method:** `pip install -e .` from source (v0.8.0, commit `f9cc746`).

**Commands used:**
```bash
# 1) Download the documented reference MoE checkpoint
hf download Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 --local-dir ./Qwen3.5-35B-A3B-GPTQ-Int4

# 2) Export to ONNX  --> CRASHES here (Bug 1) on a clean checkout
tensorrt-edgellm-export ./Qwen3.5-35B-A3B-GPTQ-Int4 ./qwen3_5_moe_onnx

# --- after forcing shared_expert_gate to FP16Linear to get past Bug 1 ---

# 3) Build engine on Jetson Orin
./build/examples/llm/llm_build \
    --onnxDir  <dest>/onnx/llm \
    --engineDir <dest>/engine/llm \
    --maxInputLen 4096 --maxKVCacheCapacity 4096 --maxBatchSize 1

# 4) Inference  --> garbage / empty output (Bug 2)
echo '{"max_generate_length": 64, "requests": [{"messages":[{"role":"user","content":"What is the capital of South Korea? Answer briefly."}],"sampling_params":{"temperature":0.0,"top_k":1}}]}' > in.json
./build/examples/llm/llm_inference \
    --engineDir <dest>/engine/llm \
    --inputFile in.json --outputFile out.json --dumpOutput
```

### Expected behavior
`tensorrt-edgellm-export` completes without manual patching, and `llm_inference` returns a coherent answer (e.g. "Seoul.") for the reference `Qwen/Qwen3.5-35B-A3B-GPTQ-Int4` checkpoint.

## System information

### Export host (x86 with GPU)
- OS: Ubuntu 24.04.3 LTS
- CPU architecture: x86_64
- GPU: NVIDIA H100 80GB HBM3 (79.6 GB), 1 GPU
- Library versions:
  - Python: 3.12.3
  - TensorRT Edge-LLM: 0.8.0 (commit f9cc746)
  - CUDA: 13.1
  - PyTorch: 2.11.0+cu130
  - Transformers: 5.8.1
  - ONNX: 1.19.0
  - ModelOpt: 0.44.0

### Edge device (inference)
- Platform: NVIDIA Jetson AGX Orin Developer Kit (SM87)
- Software release: L4T R36.5.0 (JetPack 6.x), nvidia-l4t-core 36.5.0-20260115194252
- CPU architecture: aarch64
- Total device memory: 61 GB
- Library versions:
  - TensorRT Edge-LLM: 0.8.0
  - CUDA: 12.6
  - TensorRT: 10.3.0.30 (+cuda12.5)
  - C++ compiler: GCC 11.4.0

### Additional note (separate, minor)
Running `tensorrt-edgellm-quantize llm --quantization int4_awq` on the **BF16** base (`Qwen/Qwen3.6-35B-A3B`) also crashes earlier, in MTP-draft quantization, because `Qwen3_5MoeTextConfig` has no `intermediate_size` attribute (only `moe_intermediate_size`):
```
File ".../quantization/models/mtp_draft.py", in __init__
    MtpDecoderLayer(hs, config.intermediate_size, ...)
AttributeError: 'Qwen3_5MoeTextConfig' object has no attribute 'intermediate_size'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[qwen3_5_moe] GPTQ-Int4 MoE: export crash on shared_expert_gate + garbage inference on Jetson Orin #110

Describe the bug

Bug 1 (blocker, reproducible from a clean checkout) — export crashes on `shared_expert_gate`

Bug 2 (correctness) — incoherent / empty inference output

Steps/Code to reproduce bug

Expected behavior

System information

Export host (x86 with GPU)

Edge device (inference)

Additional note (separate, minor)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[qwen3_5_moe] GPTQ-Int4 MoE: export crash on shared_expert_gate + garbage inference on Jetson Orin #110

Description

Describe the bug

Bug 1 (blocker, reproducible from a clean checkout) — export crashes on shared_expert_gate

Bug 2 (correctness) — incoherent / empty inference output

Steps/Code to reproduce bug

Expected behavior

System information

Export host (x86 with GPU)

Edge device (inference)

Additional note (separate, minor)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 (blocker, reproducible from a clean checkout) — export crashes on `shared_expert_gate`