ch4/ipc/gpu: revise the IPC caching strategy#7862
Open
hzhou wants to merge 10 commits into
Open
Conversation
Remove the dead code.
Add notes and design plans.
It's difficult to maintain to have a separate caching system inside mpl_gpu_ze. Remove it and rely MPIR layer caching.
The extra attr parameter is used by mpl_gpu_ze's special cache. It is removed now.
We'll let sender side cache the mapped addresses and synchronize via active messages. Also remove MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE. We'll always use ipc handle cache. To disable the cache, use MPIR_CVAR_CH4_IPC_GPU_MAX_CACHE_ENTRIES=0.
Move all code related to ipc cache together in gpu_post.c to facilitate refactor and maintenance.
Refactor the IPC GPU handle cache from uthash to a static array with LRU eviction (bounded by IPC_HANDLE_CACHE_MAX). Each cache entry now tracks remote mapped addresses, enabling a DIRECT IPC path that bypasses handle exchange and remote mapping on subsequent sends to the same rank. Key changes: - Replace uthash-based handle cache with a fixed-size array supporting LRU eviction and overlap detection for stale entries. - Track per-rank mapped addresses in each cache entry; use them to switch to DIRECT ipc type on cache hits. - Add MPIDI_IPC_send_mapaddr AM to notify senders of mapped addresses after receiver mapping, and MPIDI_IPC_send_unmap AM for cache eviction. - Move handle validation into ipc_track_cache_search so callers only see valid entries. - Split MPIDI_GPU_fill_ipc_handle into cached (p2p) and non-cached (win/coll) versions. - Simplify handle_status enum to a bool handle_is_cached.
Now that the old MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE is unused, rename MPIR_CVAR_CH4_IPC_GPU_MAX_CACHE_ENTRIES to MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE as the latter is more intuitive to recall. Set static IPC_HANDLE_CACHE_MAX to 1024 to allow more run time experiments.
MPIDIU_get_grank is used in active message paths and active messages don't really require communicator. Consider usages during init, finalize, and potentially sessions. The comm is used in the shm active message path only to lookup lpid via MPIDIU_get_grank. Make it work when we have the lpid already but not comm_world.
Contributor
Author
|
test:mpich/ch4/ofi |
When we clear the IPC handle cache at finalize, we send out unmap AM messages to notify remote processes to unmap. It is not an error if the remote processes already exit since the unmap is automatic at exit.
Contributor
Author
|
test:mpich/ch4/ofi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Before this PR, we have:
src/mpl/src/gpu/mpl_gpu.ze.cThe sender-side handle and receiver-side mapping fundamentally need be synchronized. With CUDA, new mapping will fail with stale overlapping addresses. And with ZE, stale caching entries on either side will prevent memory release and eventually lead to device memory exhaustion.
It is too much complexity to work with 3 separate caching facilities and manage their synchronization issues. In stead, in this new design, we only use a single sender-side cache and use explicit control messages to cache both the handle and remote mappings, thus it ensures consistency.
MPIR_CVAR_CH4_IPC_GPU_CACHE_SIZEprevents the cache hoarding device memories. SetMPIR_CVAR_CH4_IPC_GPU_CACHE_SIZE=0effectively disables the caching. In principle, the cvar can be used at runtime to dynamically control the caching behavior.This PR is partially based on the work by @nmnobre in #7821
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.