fix: wire InfiniOps paged attention workspace#1333
Merged
Conversation
wooway777
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PagedAttentionInfinilmcall.--cuda_archinto the external InfiniOps CMake build throughCMAKE_CUDA_ARCHITECTURES.Motivation
The new InfiniOps InfiniLM paged attention decode path uses a split-KV fast path for head size 128 when sufficient workspace is provided. InfiniCore currently calls the operator without workspace, so InfiniOps falls back to the slower decode path. In addition, the external InfiniOps CMake build defaults to
sm_75unlessCMAKE_CUDA_ARCHITECTURESis set, even when InfiniCore is configured with--cuda_arch=sm_80.Together, these two integration gaps explain the large performance loss seen after switching paged attention to InfiniOps.
This PR is the InfiniCore integration side. The InfiniOps split-count tuning is tracked separately in InfiniTensor/InfiniOps#741. After that PR lands, InfiniCore will also need the
submodules/InfiniOpspointer bumped to a commit that contains it to reproduce the final benchmark numbers from a fresh checkout.Validation
On
ssh nvidia-1, using/tmp/codex-infinicore-perf-nvidia1-20260623-1703:After rebuilding InfiniLM against the rebuilt InfiniCore root:
Smoke benchmark:
Formatting/checks:
Both passed.
ruff format --check .andruff check .still fail on existing Python files unrelated to this PR; no Python files are changed here.Performance Notes
With this PR plus InfiniTensor/InfiniOps#741 and a local submodule checkout containing #741, the full requested InfiniLM benchmark command group completed on
nvidia-1. Decode throughput deltas versus no-InfiniOps baseline:The large negative first-sample outliers in some 8B batches are cold-start/first graph-capture effects; the duplicated warm cases return to near parity or better.