Implement/migrate Arena allocators to cuda plugin ep.#27931
Implement/migrate Arena allocators to cuda plugin ep.#27931yuslepukhin merged 38 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a design document describing how to integrate arena-backed allocators (BFCArena + CUDA mempool arena) into the CUDA Plugin Execution Provider to close the performance gap vs the in-tree CUDA EP.
Changes:
- Document current allocator behavior differences between in-tree CUDA EP vs CUDA plugin EP.
- Propose an ORT-core-side BFCArena wrapping approach for plugin-provided allocators (shared + per-session paths).
- Outline a plan to migrate
CudaMempoolArenainto the plugin build, including logging andOrtAllocatorwrapper considerations.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
1. Problem Statement & Arena Mode Taxonomy (Sections 1–2)
Positive:
- Clear side-by-side table showing the in-tree vs. plugin allocator gap. This immediately justifies the work.
- The three arena modes (Default BFC, CUDA Mempool, No Arena) are accurately described. Correctly noting that
DisableCpuMemArena()only affects the CPU EP — not the CUDA EP — prevents a common misconception.
2. Part A — BFCArena Integration via Option B (Section 3)
Positive:
- Option B (wrapping at the two ORT core entry points) is clearly superior to Option A (wrapping at all callers). The comparison matrix in Section 3.6 makes the tradeoff transparent.
- The allocator lifecycle diagrams for both Path 1 (shared) and Path 2 (per-session) are accurate when compared against the actual code in
environment.cc(lines 567–582) andep_plugin_provider_interfaces.cc(lines 666–705). - The observation that
allocator_optionsis alwaysnullptrtoday in both paths is confirmed by the code. This is the correct gap to fix. - The recommendation to extract a shared
MaybeWrapInArena(...)helper (Section 3.7, Decided 4) is sound — the two sites need identical logic and must stay in sync.
Concern:
-
⚠️ CreateAllocator(AllocatorCreationInfo)requires anAllocatorFactoryfunction, not an existingIAllocator: The design says to "wrap the returnedIAllocatorin BFCArena viaCreateAllocator(AllocatorCreationInfo{...})." However,AllocatorCreationInfotakes adevice_alloc_factory(astd::function<std::unique_ptr<IAllocator>(OrtDevice::DeviceId)>) — it constructs a new allocator, not wrapping an existing one. The proposedMaybeWrapInArenahelper will need to either (a) directly constructBFCArena/StreamAwareBFCArenafrom the existingIAllocator(bypassingCreateAllocator), or (b) wrap the existing allocator in a factory lambda. The design should clarify which approach is intended, since option (a) duplicates the config-resolution logic inCreateAllocator, and option (b) has an awkward move-into-lambda lifecycle.Suggested clarification: construct
BFCArena/StreamAwareBFCArenadirectly, reusing the config-resolution logic fromCreateAllocatorinside the helper, or factor out the config-resolution into a separate function. -
⚠️ Stream-awareness determination assumes memory info alone is sufficient: Decided 1 says the helper determines stream-awareness fromOrtMemoryInfo— "if the memory is on a GPU device, createStreamAwareBFCArena; if it is host-accessible (pinned), createBFCArena." ButOrtMemoryInfodoesn't have an explicit "is GPU vs. pinned" flag. The code would need to compareOrtDevice::Type()(GPU) andOrtDevice::MemType()(DEFAULT vs. CUDA_PINNED). This is workable but the design should be explicit about which fields are used so the helper doesn't silently pick the wrong arena type for future EPs with different device topologies (e.g., a unified-memory accelerator). -
⚠️ RegisterExecutionProviderLibraryshared allocators may be too early for meaningful arena config: The design acknowledges this in Section 3.3 — shared allocator creation happens at environment level before any session. The proposed fix requires the user to put arena config inCreateEnvWithOptionsconfig entries withep_factory.<name>.arena.*keys. This is a new usage pattern that doesn't exist today. The document should call out that this is a user-facing API change requiring documentation, even though it reuses the existingOrtEnvCreationOptions::config_entriesmechanism. If no env-level arena config is provided, the fallback to defaultOrtArenaCfg{0, -1, -1, -1, -1, -1L}means all plugin EPs that register shared allocators will get BFCArena wrapping by default. Confirm this is intentional — today plugin EPs get raw allocators and some may rely on that behavior.
3. Arena-Already-Handled Signal — Name Comparison (Section 3.5)
Positive:
- Listing four candidate signal mechanisms and evaluating their tradeoffs is thorough.
- Option (c) (allowing
OrtArenaAllocatortype) is correctly rejected — it reverses an explicitly enforced invariant in bothCreateSharedAllocatorImplandCreatePreferredAllocators.
Concern:
-
⚠️ Option (d) — name comparison — is fragile and has an inversion risk: The signal is "if the returned allocator's memory info name matches theOrtEpDevice's memory info name, it's a raw allocator → wrap in BFCArena; otherwise it's a self-contained arena → skip." This means:- Any future factory that intentionally returns an allocator with a different name (e.g., for a debug allocator, a profiling wrapper, or a sub-device allocator) would be silently treated as "already arena-wrapped" and never get BFCArena.
- Conversely, if a mempool implementation accidentally uses the same name as the device's baseline, it would get double-wrapped.
- The convention is implicit — there's no compile-time or runtime assertion that the factory intentionally chose a different name.
Consider adding a comment/requirement in the EP factory API documentation that the name mismatch is the arena signal. Better yet, add an explicit
bool is_arenafield toOrtAllocatoror a new optional APIOrtAllocator::GetTraits()that can be checked. This would be a small, forward-compatible API addition that prevents silent misclassification.Alternate approach: use option (c) but gate it differently — instead of rejecting
OrtArenaAllocatoroutright, change the two call sites to skip BFCArena wrapping when the type isOrtArenaAllocatorinstead of failing. This is more explicit and less fragile than name comparison. The check is already there in both functions; it just needs to branch differently.
4. Config Conflict & Prefix Mismatch (Sections 3.7–3.8)
Positive:
- The "conflict blindness" analysis is honest and correct. The EP only sees one config source at each lifecycle point, so conflicts are inherently undetectable.
- The acceptance rationale — shared and per-session allocators are independent and don't compete — is valid for the arena use case.
Concern:
⚠️ Prefix mismatch (ep_factory.cuda.*vsep.cudapluginexecutionprovider.*) is user-hostile: The document correctly identifies this as "a guaranteed source of user confusion" but defers to "documentation and examples." This is the kind of inconsistency that causes support burden. Consider adding a validation helper at environment creation time that warns if arena keys appear under only one prefix but not the other, or if the registration name vs. EP type name can be mapped automatically. Even a log-level warning would significantly reduce debugging time.
5. Part B — CudaMempoolArena Migration (Section 4)
Positive:
- The dependency audit table in Section 4.1 is precise. The Logger dependency is correctly identified as the primary blocker — all other dependencies are plugin-safe.
- Option 1 (conditional logger with macro bridge) is the right recommendation. The 6 logging sites genuinely aid debugging and shouldn't be stripped.
- The
CudaMempoolOrtAllocatorwrapper design is clean and follows the same pattern as the existingCudaDeviceAllocator/CudaPinnedAllocator.
Concern:
-
⚠️ CudaMempoolOrtAllocatormust reportOrtDeviceAllocator, notOrtArenaAllocator— but Section 3.5 relies on name difference to detect it as an arena: These two constraints are consistent in the current design (Section 4.4 says "alloc_type must beOrtDeviceAllocator" and Section 3.5 uses name comparison to detect arenas). But Section 4.4's phrasing "ORT must not double-wrap it" could confuse a reader into thinkingOrtArenaAllocatoris the mechanism — add a forward reference to Section 3.5's decided signal mechanism. -
⚠️ OrtSyncStream*→cudaStream_tinAllocOnStreamImpl: The design mentionsOrtEpApi::SyncStream_GetHandle()for resolving stream handles. Confirm that this API is available from within the plugin factory's allocator context (i.e., that theOrtEpApifunction table is accessible whenAllocOnStreamis called, not just during factory initialization). In the current plugin architecture,OrtEpApiis typically stored on the factory. The allocator wrapper would need a back-pointer to the factory or a cached function pointer. -
⚠️ #ifdef BUILD_CUDA_EP_AS_PLUGINin shared source — maintenance cost: Introducing build-conditional code incuda_mempool_arena.h/ccadds maintenance risk. Every future change to the logging in this file must consider both branches. Consider whether a thinICudaMempoolLoggerinterface (virtual, two impls) would be cleaner than preprocessor branching, at the cost of one virtual dispatch per log call (acceptable — logging is inherently slow-path).
6. Phased Plan (Section 5)
Positive:
- The three-phase plan is well-sequenced — Phase 1 (BFCArena infrastructure) is a prerequisite for Phase 2 (CudaMempoolArena), and Phase 3 (validation) ensures parity.
- Phase 3's checklist items are the right ones: mode selection, benchmarking,
DisableCpuMemArenaisolation, and shared-allocator replacement.
Concern:
-
⚠️ Missing: rollback / fallback plan if BFCArena wrapping causes regressions: Since Phase 1 changes ORT core behavior for all plugin EPs (shared allocators get BFCArena by default), there should be a session option or environment config key to disable arena wrapping for debugging/rollback. Something likeep_factory.<name>.arena.enable=0that theMaybeWrapInArenahelper checks before wrapping. -
⚠️ Missing: thread-safety of shared allocator arena wrapping:RegisterExecutionProviderLibrarycan be called from multiple threads (the environment is shared). BFCArena's construction is not inherently thread-safe if multiple sessions start simultaneously. The existingshared_allocators_access is presumably mutex-protected, but the design should confirm that arena wrapping doesn't introduce additional synchronization requirements (e.g., BFCArena's internal state initialization is thread-safe if the allocator is shared across sessions).
7. Decisions & Open Questions (Section 6)
Positive:
- The four "Decided" items are clear and well-reasoned.
- Decided 2's pinned-allocator exception (always default config, never user-configurable) matches the in-tree EP behavior and avoids unnecessary complexity.
- The "Needs validation" note about
max_mem=0upfront allocation behavior is a critical item — BFCArena withDEFAULT_MAX_MEM = std::numeric_limits<size_t>::max()doesn't pre-allocate at construction (it grows on demand starting withDEFAULT_INITIAL_CHUNK_SIZE_BYTES = 1MB), so this should be safe. But worth confirming.
Concern:
⚠️ Missing open question: what happens to existing plugin EP instances that callCreateSharedAllocatorafterRegisterExecutionProviderLibrary? The public APIOrtApi::CreateSharedAllocator(not fromRegisterExecutionProviderLibrary) also flows throughCreateSharedAllocatorImpl. If arena wrapping is added there, user-created shared allocators would also get wrapped. Is this intended? The helper should only wrap when the allocator comes from a plugin factory'sCreateAllocator, not when the user provides a custom allocator via the public API.
Summary of Concerns
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | High | Arena Signal (§3.5) | Name-based arena detection is fragile; consider OrtArenaAllocator type as a skip-wrapping signal or an explicit trait |
| 2 | High | Phase 1 Rollback | No fallback mechanism to disable BFCArena wrapping if it causes regressions for plugin EPs |
| 3 | Suggestion | BFCArena Wrapping (§3.2) | CreateAllocator(AllocatorCreationInfo) takes a factory, not an existing allocator — design should clarify the wrapping mechanics |
| 4 | Suggestion | Shared Allocator Default (§3.2) | All plugin EPs get BFCArena by default — confirm this is intentional and won't break existing plugins |
| 5 | Suggestion | Config Prefix (§3.8) | Add validation/warning for mismatched ep_factory.* vs ep.* arena keys |
| 6 | Suggestion | Stream Handle (§4.4) | Confirm OrtEpApi::SyncStream_GetHandle() accessibility in allocator callback context |
| 7 | Suggestion | Public API (§6) | Distinguish plugin-factory allocators from user-provided allocators in CreateSharedAllocatorImpl to avoid unintended wrapping |
| 8 | Nitpick | Logger (§4.3) | Consider virtual ICudaMempoolLogger interface over #ifdef branching for maintainability |
| 9 | Nitpick | Stream-awareness (§3.2, Decided 1) | Document which OrtMemoryInfo/OrtDevice fields determine GPU-vs-pinned for the stream-awareness decision |
Verdict
COMMENT — The design is sound and Option B is the right architectural choice. The two high-priority items (fragile name-based arena signal and missing rollback mechanism) should be resolved before implementation begins. The remaining suggestions are refinements that can be addressed during implementation PRs.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
This PR implements a complete arena allocator system for the CUDA plugin EP, closing the allocation-performance gap with the in-tree CUDA EP. The work is well-structured: thorough design document (12 review iterations), BFC arena adapted from the example plugin EP, native CUDA mempool allocator, per-device shared arena lifecycle with ref counting, session/environment-level config plumbing, and a new Shrink API on OrtAllocator.
Code quality is high overall — exception handling is disciplined at the C ABI boundary, overflow checks are present, and pointer stability guarantees are well-documented.
2 high-priority concerns (use-after-free risk in arena pointer access; empty non-MSVC GetEnvironmentVar branch), 3 suggestions, and 3 nitpicks below.
Positives
AllocGuardRAII inExtend()prevents memory leaks on throw- Overflow protection in power-of-two doubling path
- Config validation: negative value rejection for
std::stoull,ORT_TRY/ORT_CATCHaround every parser,IsValid()final check - Per-device
DeviceCacheEntrywith documented lock ordering (device_cache_mutex_beforearena_mutex) strtol()with full bounds checking replacesatoi()- Version-gated
Shrink/ReserveC++ wrappers prevent UB on older allocators - Comprehensive test coverage: config parsing, chunk splitting/coalescing, OOM, shrink, mempool, defensive tests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 45 out of 45 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Good progress — the arena migration architecture is sound, the per-device cache keyed by CUDA ordinal is correct, and many earlier review items have been addressed. Three blocking issues remain.
Addressed since last review
- ✅
Extend()memory leak — fixed withAllocGuardRAII - ✅ Refcount underflow — now has
LogWarning+ early return instead of assert-only - ✅
provider_api_shims.ccempty non-MSVC branch — fixed - ✅ Use-after-free in stream plugin — fixed via
ResetDeviceArenaChunksUsingStreamholding arena_mutex - ✅
IndexForreturn type — changed tosize_t - ✅
max_memnegative value wrapping — now rejects leading- - ✅
ToKeyValuePairsalways emits stats (unconditional) - ✅ Dead
total_allocated_member removed from mempool allocator - ✅ Redundant inner
kNextPowerOfTwocheck removed fromExtend - ✅ Lock ordering + pointer stability comments added
- ✅ Duplicate
DEFAULT_*constants removed fromArenaImpl
Summary of remaining concerns
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | High | CudaArenaAllocator / CudaMempoolOrtAllocator / EXCEPTION_TO_STATUS_END |
ORT_HANDLE_EXCEPTION discards lambda return value — OrtStatus* leaks, errors silently swallowed |
| 2 | High | cuda_arena.cc SplitChunk / ResetChunksUsingStream |
Split remainder chunks keep stale stream pointers; ResetChunksUsingStream only clears tracked handles |
| 3 | High | ep_plugin_provider_interfaces.cc |
Per-session plugin arenas always wrapped as IAllocatorImplWrapping (not IArenaImplWrapping), so ShrinkMemoryArenas cannot find them |
| 4 | Suggestion | cuda_ep_factory.cc CreateAllocatorImpl |
Session arena options silently ignored after default shared allocator is created |
| 5 | Suggestion | cuda_arena.h config parsing |
std::stoi silently rejects valid >2GB chunk sizes; invalid extend_strategy silently defaults to power-of-two |
| 6 | Suggestion | cuda_mempool_allocator_plugin.cc AllocInternal |
ORT_THROW on non-OOM CUDA error → abort() under ORT_NO_EXCEPTIONS |
| 7 | Suggestion | Test coverage | No tests for CudaMempoolOrtAllocator or multi-GPU arena isolation |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
APPROVE — This is a high-quality, well-designed implementation that brings the CUDA plugin EP to feature parity with the in-tree EP's arena allocation. The ORT core changes (AsArena(), IArenaImplWrappingOrtAllocator, SafeArenaCast simplification) are clean improvements that benefit the entire plugin EP ecosystem. The concerns are all suggestions for additional documentation/comments rather than correctness issues. The test coverage is excellent.
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | Suggestion | IArenaImplWrappingOrtAllocator |
AllocOnStream casts Stream* → OrtSyncStream* without precondition comment |
| 2 | Suggestion | ArenaImpl::SplitChunk |
Stream-cleared remainder chunks may prevent coalescing, causing fragmentation over many runs |
| 3 | Suggestion | CudaMempoolOrtAllocator::AllocImpl |
Default-stream semantics with per-thread streams could cause cross-thread ordering issues (caller responsibility, but worth documenting) |
| 4 | Suggestion | CudaMempoolOrtAllocator destructor |
Flow between cudaFreeAsync loop → SyncAllKnownStreams → clear is correct but could use a clarifying comment |
| 5 | Nitpick | CudaEpFactory |
Lock ordering between CreateAllocatorImpl and ReleaseAllocatorImpl is subtly different — correct but deserves a cross-reference comment |
This pull request makes several improvements to the CUDA plugin Execution Provider (EP) in ONNX Runtime, with a major focus on integrating advanced memory arena support, improving CMake build logic, and updating documentation to reflect these enhancements. The most important changes are summarized below. **Memory Arena Integration and Documentation:** * Added a full BFC-style arena (`CudaArenaAllocator` / `ArenaImpl`) and a CUDA native mempool allocator (`CudaMempoolOrtAllocator`) to the plugin, with detailed documentation of their design, integration with the ONNX Runtime core, and file layout. This enables stream-aware, shrinkable memory allocation for the CUDA plugin EP, matching ORT core capabilities. * Updated the file layout and design documentation to reflect the new arena and allocator classes, and documented how plugin allocators are now visible to session-level arena management via the standard `IArena` interface. * Removed the previous "future work" section about memory arena parity, as it is now implemented, and updated the "future work" section to focus on remaining tasks such as profiling, stream/adapter parity, and kernel registration validation. **CMake Build Improvements:** * Refactored force-include logic for adapter headers in `cmake/onnxruntime_providers_cuda_plugin.cmake` to use generator expressions, removing redundant MSVC/GCC/Clang blocks and simplifying the build configuration. * Mirrored the source directory structure in the Visual Studio solution for better organization, and set the solution folder for the CUDA plugin target. [[1]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fR114-R117) [[2]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fL290-R294) * Added compile options to force `/permissive` and include `iso646.h` for MSVC builds, ensuring compatibility with CUTLASS headers and alternative C++ tokens. * Updated unit test CMake to include plugin-specific CUDA test sources when the plugin is enabled. **Documentation Fixes:** * Fixed heading levels and updated descriptions for device synchronization and NHWC layout transformation support in the CUDA plugin EP design document. [[1]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL273-R273) [[2]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL324-R324) * Clarified which files have been replaced or refactored in the plugin (e.g., `cuda_mempool_arena.cc` is now replaced by plugin-native arena allocator files). These changes collectively improve the maintainability, feature parity, and documentation quality of the CUDA plugin EP in ONNX Runtime.
…cherry picks (#28035) ### Description FF release cut to last merge prior to version bump. This also cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | 4e1c42e | #27931 | Implement/migrate Arena allocators to cuda plugin ep. | | fbba40a | #27786 | Model Package Support | | c159603 | #28015 | Remove unnecessary model package test | | 87b0643 | #27997 | centralise feed authentication for ADO pipelines | ### Motivation and Context Minimise # of cherry picks. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: eserscor <erscor@microsoft.com>
This pull request makes several improvements to the CUDA plugin Execution Provider (EP) in ONNX Runtime, with a major focus on integrating advanced memory arena support, improving CMake build logic, and updating documentation to reflect these enhancements. The most important changes are summarized below.
Memory Arena Integration and Documentation:
CudaArenaAllocator/ArenaImpl) and a CUDA native mempool allocator (CudaMempoolOrtAllocator) to the plugin, with detailed documentation of their design, integration with the ONNX Runtime core, and file layout. This enables stream-aware, shrinkable memory allocation for the CUDA plugin EP, matching ORT core capabilities.IArenainterface.CMake Build Improvements:
cmake/onnxruntime_providers_cuda_plugin.cmaketo use generator expressions, removing redundant MSVC/GCC/Clang blocks and simplifying the build configuration./permissiveand includeiso646.hfor MSVC builds, ensuring compatibility with CUTLASS headers and alternative C++ tokens.Documentation Fixes:
cuda_mempool_arena.ccis now replaced by plugin-native arena allocator files).These changes collectively improve the maintainability, feature parity, and documentation quality of the CUDA plugin EP in ONNX Runtime.