[Bug] Qwen3.5-4B-AWQ uses ~13GB RAM on Orin NX 16GB — GDN mamba workspace pool prevents ASR+TTS+LLM co-residency

## Describe the bug

`tensorrt-edge-llm` v0.7.1 (commit `5136119`) loads `Qwen3.5-4B-AWQ` with **12.9–13.5 GB** RAM on Jetson Orin NX 16 GB at `maxInputLen=4096`, which prevents the typical edge stack (ASR + TTS + LLM co-residency) from fitting on the device. The same hardware comfortably hosts `Qwen3-4B-AWQ` (pure attention) at **~3 GB** — **roughly 4× the RAM footprint** for the same parameter count, indicating the cost is specific to the Gated DeltaNet hybrid mamba/attention layout as currently exposed by the runtime, not an inherent property of mamba models.

Impact: **blocker** for any on-device deployment that needs LLM + other models to share Orin NX 16 GB unified memory. Note the listed `Qwen/Qwen3.5-4B` is officially supported (`docs/source/user_guide/getting_started/supported-models.md`).

Architecture of the model:
- 32 hidden layers: **24 mamba/SSM (GDN) + 8 attention**
- `hidden_size=2560`, `num_attn_heads=16`, `head_dim=256`, `kv_heads=8`
- `vocab_size=248320` (Qwen3.5 expanded vocab)

After exhausting common configuration knobs (full table below), only ~1.5 GB of further savings looks achievable in software (mmap embedding sidecar + reduced vocab). Weight streaming — the canonical TRT lever — is a no-op here because AWQ weights live in plugin/constant buffers and aren't in the streamable pool.

### Steps/Code to reproduce bug

**Engine build (cross from x86 / native on Orin NX):**
```bash
python -m llm_loader.export_all_cli \
    --src harvestsu/Qwen3.5-4B-AWQ \
    --dst ./onnx-mtp \
    --mtp

./build/examples/llm/llm_build \
    --onnxDir ./onnx-mtp/llm \
    --engineDir ./engines/qwen35-4b-awq/base \
    --maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 \
    --specBase --maxVerifyTreeSize 8
```

Resulting artifacts:
- `eagle_base.engine` 2.1 GB
- `eagle_draft.engine` 374 MB
- `embedding.safetensors` 1.27 GB

**Runtime command used:**
```bash
EDGELLM_PLUGIN_PATH=/path/to/libNvInfer_edgellm_plugin.so \
python -m experimental.server \
    --engine-dir /path/to/engines/qwen35-4b-awq/base \
    --served-model-name qwen3.5-4b-awq \
    --port 8100 --host 0.0.0.0
```

**Observed RAM after engine load (no requests yet):**
```
$ free -m
               total        used        free   ...
Mem:           15656       12947         385   ...
```

TRT log line at engine load:
```
[INFO] [TensorRT] [MemUsageChange] Init cuBLAS/cuBLASLt: ... now: CPU 79, GPU 11891 (MiB)
```
→ **~12 GB is pre-allocated** in the GPU pool right after cuBLAS init, before any inference. The shared execution context manager later reports its own 2 GB allocation on top.

Same hardware, same toolchain, with `Qwen3-4B-AWQ` (pure attention 32 layers, vocab 152064, similar AWQ INT4 + FP16 setup) at `maxInputLen=4096`:
```
Mem:            ~3000   used   ...
```

### Expected behavior

A 4 B parameter AWQ INT4 model should fit comfortably alongside an ASR engine and a TTS engine on Orin NX 16 GB unified memory. The community guidance for Qwen3.5-4B states the Q4-quantized model "needs only ~2.5 GB" disk size and fits "4–6 GB GPUs" — far from the 12+ GB we measure at runtime. NVIDIA's own Nemotron-H (92% Mamba-2 + 8% attention) is documented as **3× faster than Llama-3.1 at matched accuracy** thanks to mamba's constant-memory recurrent state. The expected memory advantage of the hybrid layout is not present in our v0.7.1 deployment of Qwen3.5-4B.

### What we tried (and why it didn't help)

| Lever | Expected | Measured | Notes |
|---|---|---|---|
| Lower `maxInputLen` 8192 → 4096 | Linear with seq | −1.0 GB | Workspace pool clearly not pure-seq-linear |
| Disable MTP draft engine | −400-500 MB | −500 MB | Draft is minor |
| TRT Weight Streaming (`setWeightStreamingBudgetV2`) | −2.5–3.7 GB | **~0** | AWQ weights are in plugin/constant buffers, not in `getStreamableWeightsSize()` (reports only 2.1 MB streamable for base, 92 KB for draft). All four budget values (`unset`, `min`, `1g`, `off`) produced identical RAM / tok-s, variance ≤ 1%. |

### Hypothesis

The runtime / engine builder appears to size the **mamba/SSM scan and state buffers pessimistically** for every hidden layer at engine load time, regardless of actual prefill batch shape. With 24 mamba layers at `hidden_size=2560` + parallel-scan workspace + Conv state caches, the worst-case pool dominates the 12 GB.

By contrast, pure-attention models like Qwen3-4B only allocate KV cache (a few hundred MB at 4 k context for 8 KV heads) plus regular attention scratch, both of which scale much more gracefully.

If this is correct, the fix is in how the mamba/GDN layers' workspace pool is sized — potentially making it elastic with the batched prefill shape rather than fixed at `maxInputLen` worst-case.

## System information (Edge Device)

- Platform: **NVIDIA Jetson Orin NX 16GB**
- Software release: **JetPack 6.2**
- CPU architecture: aarch64
- GPU compute capability: SM87
- Total device memory: 16 GB (unified)
- Build type: Release
- Library versions:
  - TensorRT Edge-LLM version or commit hash: **v0.7.1** + customvoice product layer migration (commit `5136119`)
  - CUDA: **12.6.68**
  - TensorRT: **10.3.0.30**
- Model: `harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine` (private — happy to share access)
- Engine build flags: `--maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 --specBase --maxVerifyTreeSize 8`

## Asks

1. Are mamba/GDN layer workspace allocations meant to be elastic at runtime in v0.7.1, or are they intentionally pre-sized at `maxInputLen`? If the latter, can a future release expose a knob to clamp this pool independently of `maxInputLen` (e.g., a `mambaPrefillMaxSeq`)?
2. Can AWQ weights for hybrid models be migrated into the regular TRT streamable weight pool, so `setWeightStreamingBudgetV2` becomes effective for AWQ engines? Today `getStreamableWeightsSize()` returns only ~2 MB on this engine, while ~2 GB of AWQ weights sit in plugin constant buffers.
3. Any pointers on profiling the 12 GB pool composition would also be very welcome (e.g., per-layer or per-buffer breakdown) — `--verbose` only shows aggregate `MemUsageChange` lines.

## Reproducer artifacts

- Engine: `harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine` (private — willing to share access)
- Server log + `free -m` snapshots: available on request
- Full investigation writeup mirrored at `docs/known-issues/qwen35-orin-nx-oom.md` on our fork `suharvest/TensorRT-Edge-LLM` branch `v071/customvoice-product`

Thanks for the great work on the framework!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly