Background
InfiniLM currently needs Marlin-compatible weight layouts for AWQ/GPTQ inference. CPU-side repack works but increases model loading cost and does not use the new InfiniCore GPU repack operators.
Scope
- Use InfiniCore AWQ/GPTQ Marlin repack operators when available.
- Keep fallback behavior for environments without Marlin support.
- Integrate the repacked weights into the existing quantized linear process-weight flow.
Follow-up Work
- Improve workspace/cache reuse.
- Optimize server integration.
- Reduce communication overhead for tensor parallel inference.
- Continue profiling decode performance against vLLM.
Background
InfiniLM currently needs Marlin-compatible weight layouts for AWQ/GPTQ inference. CPU-side repack works but increases model loading cost and does not use the new InfiniCore GPU repack operators.
Scope
Follow-up Work