fix: wire InfiniOps paged attention workspace by voltjia · Pull Request #1333 · InfiniTensor/InfiniCore

voltjia · 2026-06-23T22:45:04Z

Summary

Allocate and pass workspace for the InfiniOps PagedAttentionInfinilm call.
Propagate xmake --cuda_arch into the external InfiniOps CMake build through CMAKE_CUDA_ARCHITECTURES.

Motivation

The new InfiniOps InfiniLM paged attention decode path uses a split-KV fast path for head size 128 when sufficient workspace is provided. InfiniCore currently calls the operator without workspace, so InfiniOps falls back to the slower decode path. In addition, the external InfiniOps CMake build defaults to sm_75 unless CMAKE_CUDA_ARCHITECTURES is set, even when InfiniCore is configured with --cuda_arch=sm_80.

Together, these two integration gaps explain the large performance loss seen after switching paged attention to InfiniOps.

This PR is the InfiniCore integration side. The InfiniOps split-count tuning is tracked separately in InfiniTensor/InfiniOps#741. After that PR lands, InfiniCore will also need the submodules/InfiniOps pointer bumped to a commit that contains it to reproduce the final benchmark numbers from a fresh checkout.

Validation

On ssh nvidia-1, using /tmp/codex-infinicore-perf-nvidia1-20260623-1703:

GPU_ID=0 KEEP_BUILD=0 BASE=/tmp/codex-infinicore-perf-nvidia1-20260623-1703 \
  bash scripts/docker_build_core.sh ops true

infinicore import ok
CMAKE_CUDA_ARCHITECTURES:UNINITIALIZED=80

After rebuilding InfiniLM against the rebuilt InfiniCore root:

GPU_ID=0 BASE=/tmp/codex-infinicore-perf-nvidia1-20260623-1703 \
  bash scripts/docker_build_lm.sh ops

imports ok

Smoke benchmark:

python examples/bench.py --device nvidia \
  --model=/data-aisoft/mechdancer/models/9g_8b_thinking_llama \
  --enable-paged-attn --enable-graph \
  --input-len=32,32 --output-len=256 --batch-size=1

Decode throughput: 85.71 tok/s, 85.86 tok/s

Formatting/checks:

clang-format-18 --dry-run -Werror src/infinicore/ops/paged_attention/paged_attention_infiniops.cc
git diff --check

Both passed.

ruff format --check . and ruff check . still fail on existing Python files unrelated to this PR; no Python files are changed here.

Performance Notes

With this PR plus InfiniTensor/InfiniOps#741 and a local submodule checkout containing #741, the full requested InfiniLM benchmark command group completed on nvidia-1. Decode throughput deltas versus no-InfiniOps baseline:

Case	Avg	Warm Avg
8B bs1	+7.83%	+8.33%
8B bs4	+2.62%	+5.35%
8B bs16	-2.66%	+0.93%
8B bs64	-1.64%	+1.33%
70B bs1	+9.81%	+10.46%
70B bs4	+8.24%	+8.78%
70B bs16	-0.71%	-0.66%

The large negative first-sample outliers in some 8B batches are cold-start/first graph-capture effects; the duplicated warm cases return to near parity or better.

fix: wire InfiniOps paged attention workspace

0f520cb

voltjia requested a review from a team June 23, 2026 22:45

wooway777 approved these changes Jun 24, 2026

View reviewed changes

voltjia merged commit b17ccd9 into main Jun 24, 2026
10 checks passed

voltjia deleted the fix/infinops-paged-attention-integration branch June 24, 2026 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: wire InfiniOps paged attention workspace#1333

fix: wire InfiniOps paged attention workspace#1333
voltjia merged 1 commit into
mainfrom
fix/infinops-paged-attention-integration

voltjia commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

voltjia commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Validation

Performance Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

voltjia commented Jun 23, 2026 •

edited

Loading