## Describe the bug `tensorrt-edge-llm` v0.7.1 (commit `5136119`) loads `Qwen3.5-4B-AWQ` with **12.9–13.5 GB** RAM on Jetson Orin NX 16 GB at `maxInputLen=4096`, which prevents the typical edge stack (ASR + TTS + LLM co-residency) from fitting on the device. The same hardware comfortably hosts `Qwen3-4B-AWQ` (pure attention) at **~3 GB** — **roughly 4× the RAM footprint** for the same parameter count, indicating the cost is specific to the Gated DeltaNet hybrid mamba/attention layout as currently exposed by the runtime, not an inherent property of mamba models. Impact: **blocker** for any on-device deployment that needs LLM + other models to share Orin NX 16 GB unified memory. Note the listed `Qwen/Qwen3.5-4B` is officially supported (`docs/source/user_guide/getting_started/supported-models.md`). Architecture of the model: - 32 hidden layers: **24 mamba/SSM (GDN) + 8 attention** - `hidden_size=2560`, `num_attn_heads=16`, `head_dim=256`, `kv_heads=8` - `vocab_size=248320` (Qwen3.5 expanded vocab) After exhausting common configuration knobs (full table below), only ~1.5 GB of further savings looks achievable in software (mmap embedding sidecar + reduced vocab). Weight streaming — the canonical TRT lever — is a no-op here because AWQ weights live in plugin/constant buffers and aren't in the streamable pool. ### Steps/Code to reproduce bug **Engine build (cross from x86 / native on Orin NX):** ```bash python -m llm_loader.export_all_cli \ --src harvestsu/Qwen3.5-4B-AWQ \ --dst ./onnx-mtp \ --mtp ./build/examples/llm/llm_build \ --onnxDir ./onnx-mtp/llm \ --engineDir ./engines/qwen35-4b-awq/base \ --maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 \ --specBase --maxVerifyTreeSize 8 ``` Resulting artifacts: - `eagle_base.engine` 2.1 GB - `eagle_draft.engine` 374 MB - `embedding.safetensors` 1.27 GB **Runtime command used:** ```bash EDGELLM_PLUGIN_PATH=/path/to/libNvInfer_edgellm_plugin.so \ python -m experimental.server \ --engine-dir /path/to/engines/qwen35-4b-awq/base \ --served-model-name qwen3.5-4b-awq \ --port 8100 --host 0.0.0.0 ``` **Observed RAM after engine load (no requests yet):** ``` $ free -m total used free ... Mem: 15656 12947 385 ... ``` TRT log line at engine load: ``` [INFO] [TensorRT] [MemUsageChange] Init cuBLAS/cuBLASLt: ... now: CPU 79, GPU 11891 (MiB) ``` → **~12 GB is pre-allocated** in the GPU pool right after cuBLAS init, before any inference. The shared execution context manager later reports its own 2 GB allocation on top. Same hardware, same toolchain, with `Qwen3-4B-AWQ` (pure attention 32 layers, vocab 152064, similar AWQ INT4 + FP16 setup) at `maxInputLen=4096`: ``` Mem: ~3000 used ... ``` ### Expected behavior A 4 B parameter AWQ INT4 model should fit comfortably alongside an ASR engine and a TTS engine on Orin NX 16 GB unified memory. The community guidance for Qwen3.5-4B states the Q4-quantized model "needs only ~2.5 GB" disk size and fits "4–6 GB GPUs" — far from the 12+ GB we measure at runtime. NVIDIA's own Nemotron-H (92% Mamba-2 + 8% attention) is documented as **3× faster than Llama-3.1 at matched accuracy** thanks to mamba's constant-memory recurrent state. The expected memory advantage of the hybrid layout is not present in our v0.7.1 deployment of Qwen3.5-4B. ### What we tried (and why it didn't help) | Lever | Expected | Measured | Notes | |---|---|---|---| | Lower `maxInputLen` 8192 → 4096 | Linear with seq | −1.0 GB | Workspace pool clearly not pure-seq-linear | | Disable MTP draft engine | −400-500 MB | −500 MB | Draft is minor | | TRT Weight Streaming (`setWeightStreamingBudgetV2`) | −2.5–3.7 GB | **~0** | AWQ weights are in plugin/constant buffers, not in `getStreamableWeightsSize()` (reports only 2.1 MB streamable for base, 92 KB for draft). All four budget values (`unset`, `min`, `1g`, `off`) produced identical RAM / tok-s, variance ≤ 1%. | ### Hypothesis The runtime / engine builder appears to size the **mamba/SSM scan and state buffers pessimistically** for every hidden layer at engine load time, regardless of actual prefill batch shape. With 24 mamba layers at `hidden_size=2560` + parallel-scan workspace + Conv state caches, the worst-case pool dominates the 12 GB. By contrast, pure-attention models like Qwen3-4B only allocate KV cache (a few hundred MB at 4 k context for 8 KV heads) plus regular attention scratch, both of which scale much more gracefully. If this is correct, the fix is in how the mamba/GDN layers' workspace pool is sized — potentially making it elastic with the batched prefill shape rather than fixed at `maxInputLen` worst-case. ## System information (Edge Device) - Platform: **NVIDIA Jetson Orin NX 16GB** - Software release: **JetPack 6.2** - CPU architecture: aarch64 - GPU compute capability: SM87 - Total device memory: 16 GB (unified) - Build type: Release - Library versions: - TensorRT Edge-LLM version or commit hash: **v0.7.1** + customvoice product layer migration (commit `5136119`) - CUDA: **12.6.68** - TensorRT: **10.3.0.30** - Model: `harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine` (private — happy to share access) - Engine build flags: `--maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 --specBase --maxVerifyTreeSize 8` ## Asks 1. Are mamba/GDN layer workspace allocations meant to be elastic at runtime in v0.7.1, or are they intentionally pre-sized at `maxInputLen`? If the latter, can a future release expose a knob to clamp this pool independently of `maxInputLen` (e.g., a `mambaPrefillMaxSeq`)? 2. Can AWQ weights for hybrid models be migrated into the regular TRT streamable weight pool, so `setWeightStreamingBudgetV2` becomes effective for AWQ engines? Today `getStreamableWeightsSize()` returns only ~2 MB on this engine, while ~2 GB of AWQ weights sit in plugin constant buffers. 3. Any pointers on profiling the 12 GB pool composition would also be very welcome (e.g., per-layer or per-buffer breakdown) — `--verbose` only shows aggregate `MemUsageChange` lines. ## Reproducer artifacts - Engine: `harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine` (private — willing to share access) - Server log + `free -m` snapshots: available on request - Full investigation writeup mirrored at `docs/known-issues/qwen35-orin-nx-oom.md` on our fork `suharvest/TensorRT-Edge-LLM` branch `v071/customvoice-product` Thanks for the great work on the framework!