Skip to content

ch4/ipc/gpu: revise the IPC caching strategy#7862

Open
hzhou wants to merge 10 commits into
pmodels:mainfrom
hzhou:2606_ipc_gpu
Open

ch4/ipc/gpu: revise the IPC caching strategy#7862
hzhou wants to merge 10 commits into
pmodels:mainfrom
hzhou:2606_ipc_gpu

Conversation

@hzhou

@hzhou hzhou commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Pull Request Description

Before this PR, we have:

  1. ch4 sender side handle cache
  2. ch4 receiver size map cache
  3. sender side specialized cache inside src/mpl/src/gpu/mpl_gpu.ze.c

The sender-side handle and receiver-side mapping fundamentally need be synchronized. With CUDA, new mapping will fail with stale overlapping addresses. And with ZE, stale caching entries on either side will prevent memory release and eventually lead to device memory exhaustion.

It is too much complexity to work with 3 separate caching facilities and manage their synchronization issues. In stead, in this new design, we only use a single sender-side cache and use explicit control messages to cache both the handle and remote mappings, thus it ensures consistency.

MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE prevents the cache hoarding device memories. Set MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE=0 effectively disables the caching. In principle, the cvar can be used at runtime to dynamically control the caching behavior.

This PR is partially based on the work by @nmnobre in #7821

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 6 commits June 30, 2026 17:35
It's difficult to maintain to have a separate caching system inside
mpl_gpu_ze. Remove it and rely MPIR layer caching.
The extra attr parameter is used by mpl_gpu_ze's special cache. It is
removed now.
We'll let sender side cache the mapped addresses and synchronize via
active messages.

Also remove MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE. We'll always use ipc
handle cache. To disable the cache, use
MPIR_CVAR_CH4_IPC_GPU_MAX_CACHE_ENTRIES=0.
Move all code related to ipc cache together in gpu_post.c to facilitate
refactor and maintenance.
hzhou added 3 commits June 30, 2026 21:12
Refactor the IPC GPU handle cache from uthash to a static array with
LRU eviction (bounded by IPC_HANDLE_CACHE_MAX). Each cache entry now
tracks remote mapped addresses, enabling a DIRECT IPC path that bypasses
handle exchange and remote mapping on subsequent sends to the same rank.

Key changes:
- Replace uthash-based handle cache with a fixed-size array supporting
  LRU eviction and overlap detection for stale entries.
- Track per-rank mapped addresses in each cache entry; use them to
  switch to DIRECT ipc type on cache hits.
- Add MPIDI_IPC_send_mapaddr AM to notify senders of mapped addresses
  after receiver mapping, and MPIDI_IPC_send_unmap AM for cache eviction.
- Move handle validation into ipc_track_cache_search so callers only
  see valid entries.
- Split MPIDI_GPU_fill_ipc_handle into cached (p2p) and non-cached
  (win/coll) versions.
- Simplify handle_status enum to a bool handle_is_cached.
Now that the old MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE is unused, rename
MPIR_CVAR_CH4_IPC_GPU_MAX_CACHE_ENTRIES to
MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE as the latter is more intuitive to
recall.

Set static IPC_HANDLE_CACHE_MAX to 1024 to allow more run time
experiments.
MPIDIU_get_grank is used in active message paths and active messages
don't really require communicator. Consider usages during init,
finalize, and potentially sessions. The comm is used in the shm active
message path only to lookup lpid via MPIDIU_get_grank. Make it work when
we have the lpid already but not comm_world.
@hzhou

hzhou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

test:mpich/ch4/ofi
test:mpich/ch4/gpu/ofi

When we clear the IPC handle cache at finalize, we send out unmap AM
messages to notify remote processes to unmap. It is not an error if the
remote processes already exit since the unmap is automatic at exit.
@hzhou

hzhou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

test:mpich/ch4/ofi
test:mpich/ch4/gpu/ofi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant