Skip to content

[DEV] feat: Enable GPU Marlin repack in InfiniLM AWQ/GPTQ weight processing #458

Description

@qinyiqun

Background

InfiniLM currently needs Marlin-compatible weight layouts for AWQ/GPTQ inference. CPU-side repack works but increases model loading cost and does not use the new InfiniCore GPU repack operators.

Scope

  • Use InfiniCore AWQ/GPTQ Marlin repack operators when available.
  • Keep fallback behavior for environments without Marlin support.
  • Integrate the repacked weights into the existing quantized linear process-weight flow.

Follow-up Work

  • Improve workspace/cache reuse.
  • Optimize server integration.
  • Reduce communication overhead for tensor parallel inference.
  • Continue profiling decode performance against vLLM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions