Skip to content

fix: wire InfiniOps paged attention workspace#1333

Merged
voltjia merged 1 commit into
mainfrom
fix/infinops-paged-attention-integration
Jun 24, 2026
Merged

fix: wire InfiniOps paged attention workspace#1333
voltjia merged 1 commit into
mainfrom
fix/infinops-paged-attention-integration

Conversation

@voltjia

@voltjia voltjia commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Allocate and pass workspace for the InfiniOps PagedAttentionInfinilm call.
  • Propagate xmake --cuda_arch into the external InfiniOps CMake build through CMAKE_CUDA_ARCHITECTURES.

Motivation

The new InfiniOps InfiniLM paged attention decode path uses a split-KV fast path for head size 128 when sufficient workspace is provided. InfiniCore currently calls the operator without workspace, so InfiniOps falls back to the slower decode path. In addition, the external InfiniOps CMake build defaults to sm_75 unless CMAKE_CUDA_ARCHITECTURES is set, even when InfiniCore is configured with --cuda_arch=sm_80.

Together, these two integration gaps explain the large performance loss seen after switching paged attention to InfiniOps.

This PR is the InfiniCore integration side. The InfiniOps split-count tuning is tracked separately in InfiniTensor/InfiniOps#741. After that PR lands, InfiniCore will also need the submodules/InfiniOps pointer bumped to a commit that contains it to reproduce the final benchmark numbers from a fresh checkout.

Validation

On ssh nvidia-1, using /tmp/codex-infinicore-perf-nvidia1-20260623-1703:

GPU_ID=0 KEEP_BUILD=0 BASE=/tmp/codex-infinicore-perf-nvidia1-20260623-1703 \
  bash scripts/docker_build_core.sh ops true

infinicore import ok
CMAKE_CUDA_ARCHITECTURES:UNINITIALIZED=80

After rebuilding InfiniLM against the rebuilt InfiniCore root:

GPU_ID=0 BASE=/tmp/codex-infinicore-perf-nvidia1-20260623-1703 \
  bash scripts/docker_build_lm.sh ops

imports ok

Smoke benchmark:

python examples/bench.py --device nvidia \
  --model=/data-aisoft/mechdancer/models/9g_8b_thinking_llama \
  --enable-paged-attn --enable-graph \
  --input-len=32,32 --output-len=256 --batch-size=1

Decode throughput: 85.71 tok/s, 85.86 tok/s

Formatting/checks:

clang-format-18 --dry-run -Werror src/infinicore/ops/paged_attention/paged_attention_infiniops.cc
git diff --check

Both passed.

ruff format --check . and ruff check . still fail on existing Python files unrelated to this PR; no Python files are changed here.

Performance Notes

With this PR plus InfiniTensor/InfiniOps#741 and a local submodule checkout containing #741, the full requested InfiniLM benchmark command group completed on nvidia-1. Decode throughput deltas versus no-InfiniOps baseline:

Case Avg Warm Avg
8B bs1 +7.83% +8.33%
8B bs4 +2.62% +5.35%
8B bs16 -2.66% +0.93%
8B bs64 -1.64% +1.33%
70B bs1 +9.81% +10.46%
70B bs4 +8.24% +8.78%
70B bs16 -0.71% -0.66%

The large negative first-sample outliers in some 8B batches are cold-start/first graph-capture effects; the duplicated warm cases return to near parity or better.

@voltjia voltjia requested a review from a team June 23, 2026 22:45
@voltjia voltjia merged commit b17ccd9 into main Jun 24, 2026
10 checks passed
@voltjia voltjia deleted the fix/infinops-paged-attention-integration branch June 24, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants