Skip to content

Implement/migrate Arena allocators to cuda plugin ep.#27931

Merged
yuslepukhin merged 38 commits intomainfrom
yuslepukhin/cuda_arena_ep
Apr 8, 2026
Merged

Implement/migrate Arena allocators to cuda plugin ep.#27931
yuslepukhin merged 38 commits intomainfrom
yuslepukhin/cuda_arena_ep

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin commented Apr 1, 2026

This pull request makes several improvements to the CUDA plugin Execution Provider (EP) in ONNX Runtime, with a major focus on integrating advanced memory arena support, improving CMake build logic, and updating documentation to reflect these enhancements. The most important changes are summarized below.

Memory Arena Integration and Documentation:

  • Added a full BFC-style arena (CudaArenaAllocator / ArenaImpl) and a CUDA native mempool allocator (CudaMempoolOrtAllocator) to the plugin, with detailed documentation of their design, integration with the ONNX Runtime core, and file layout. This enables stream-aware, shrinkable memory allocation for the CUDA plugin EP, matching ORT core capabilities.
  • Updated the file layout and design documentation to reflect the new arena and allocator classes, and documented how plugin allocators are now visible to session-level arena management via the standard IArena interface.
  • Removed the previous "future work" section about memory arena parity, as it is now implemented, and updated the "future work" section to focus on remaining tasks such as profiling, stream/adapter parity, and kernel registration validation.

CMake Build Improvements:

  • Refactored force-include logic for adapter headers in cmake/onnxruntime_providers_cuda_plugin.cmake to use generator expressions, removing redundant MSVC/GCC/Clang blocks and simplifying the build configuration.
  • Mirrored the source directory structure in the Visual Studio solution for better organization, and set the solution folder for the CUDA plugin target. [1] [2]
  • Added compile options to force /permissive and include iso646.h for MSVC builds, ensuring compatibility with CUTLASS headers and alternative C++ tokens.
  • Updated unit test CMake to include plugin-specific CUDA test sources when the plugin is enabled.

Documentation Fixes:

  • Fixed heading levels and updated descriptions for device synchronization and NHWC layout transformation support in the CUDA plugin EP design document. [1] [2]
  • Clarified which files have been replaced or refactored in the plugin (e.g., cuda_mempool_arena.cc is now replaced by plugin-native arena allocator files).

These changes collectively improve the maintainability, feature parity, and documentation quality of the CUDA plugin EP in ONNX Runtime.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document describing how to integrate arena-backed allocators (BFCArena + CUDA mempool arena) into the CUDA Plugin Execution Provider to close the performance gap vs the in-tree CUDA EP.

Changes:

  • Document current allocator behavior differences between in-tree CUDA EP vs CUDA plugin EP.
  • Propose an ORT-core-side BFCArena wrapping approach for plugin-provided allocators (shared + per-session paths).
  • Outline a plan to migrate CudaMempoolArena into the plugin build, including logging and OrtAllocator wrapper considerations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Problem Statement & Arena Mode Taxonomy (Sections 1–2)

Positive:

  • Clear side-by-side table showing the in-tree vs. plugin allocator gap. This immediately justifies the work.
  • The three arena modes (Default BFC, CUDA Mempool, No Arena) are accurately described. Correctly noting that DisableCpuMemArena() only affects the CPU EP — not the CUDA EP — prevents a common misconception.

2. Part A — BFCArena Integration via Option B (Section 3)

Positive:

  • Option B (wrapping at the two ORT core entry points) is clearly superior to Option A (wrapping at all callers). The comparison matrix in Section 3.6 makes the tradeoff transparent.
  • The allocator lifecycle diagrams for both Path 1 (shared) and Path 2 (per-session) are accurate when compared against the actual code in environment.cc (lines 567–582) and ep_plugin_provider_interfaces.cc (lines 666–705).
  • The observation that allocator_options is always nullptr today in both paths is confirmed by the code. This is the correct gap to fix.
  • The recommendation to extract a shared MaybeWrapInArena(...) helper (Section 3.7, Decided 4) is sound — the two sites need identical logic and must stay in sync.

Concern:

  • ⚠️ CreateAllocator(AllocatorCreationInfo) requires an AllocatorFactory function, not an existing IAllocator: The design says to "wrap the returned IAllocator in BFCArena via CreateAllocator(AllocatorCreationInfo{...})." However, AllocatorCreationInfo takes a device_alloc_factory (a std::function<std::unique_ptr<IAllocator>(OrtDevice::DeviceId)>) — it constructs a new allocator, not wrapping an existing one. The proposed MaybeWrapInArena helper will need to either (a) directly construct BFCArena/StreamAwareBFCArena from the existing IAllocator (bypassing CreateAllocator), or (b) wrap the existing allocator in a factory lambda. The design should clarify which approach is intended, since option (a) duplicates the config-resolution logic in CreateAllocator, and option (b) has an awkward move-into-lambda lifecycle.

    Suggested clarification: construct BFCArena/StreamAwareBFCArena directly, reusing the config-resolution logic from CreateAllocator inside the helper, or factor out the config-resolution into a separate function.

  • ⚠️ Stream-awareness determination assumes memory info alone is sufficient: Decided 1 says the helper determines stream-awareness from OrtMemoryInfo — "if the memory is on a GPU device, create StreamAwareBFCArena; if it is host-accessible (pinned), create BFCArena." But OrtMemoryInfo doesn't have an explicit "is GPU vs. pinned" flag. The code would need to compare OrtDevice::Type() (GPU) and OrtDevice::MemType() (DEFAULT vs. CUDA_PINNED). This is workable but the design should be explicit about which fields are used so the helper doesn't silently pick the wrong arena type for future EPs with different device topologies (e.g., a unified-memory accelerator).

  • ⚠️ RegisterExecutionProviderLibrary shared allocators may be too early for meaningful arena config: The design acknowledges this in Section 3.3 — shared allocator creation happens at environment level before any session. The proposed fix requires the user to put arena config in CreateEnvWithOptions config entries with ep_factory.<name>.arena.* keys. This is a new usage pattern that doesn't exist today. The document should call out that this is a user-facing API change requiring documentation, even though it reuses the existing OrtEnvCreationOptions::config_entries mechanism. If no env-level arena config is provided, the fallback to default OrtArenaCfg{0, -1, -1, -1, -1, -1L} means all plugin EPs that register shared allocators will get BFCArena wrapping by default. Confirm this is intentional — today plugin EPs get raw allocators and some may rely on that behavior.


3. Arena-Already-Handled Signal — Name Comparison (Section 3.5)

Positive:

  • Listing four candidate signal mechanisms and evaluating their tradeoffs is thorough.
  • Option (c) (allowing OrtArenaAllocator type) is correctly rejected — it reverses an explicitly enforced invariant in both CreateSharedAllocatorImpl and CreatePreferredAllocators.

Concern:

  • ⚠️ Option (d) — name comparison — is fragile and has an inversion risk: The signal is "if the returned allocator's memory info name matches the OrtEpDevice's memory info name, it's a raw allocator → wrap in BFCArena; otherwise it's a self-contained arena → skip." This means:

    • Any future factory that intentionally returns an allocator with a different name (e.g., for a debug allocator, a profiling wrapper, or a sub-device allocator) would be silently treated as "already arena-wrapped" and never get BFCArena.
    • Conversely, if a mempool implementation accidentally uses the same name as the device's baseline, it would get double-wrapped.
    • The convention is implicit — there's no compile-time or runtime assertion that the factory intentionally chose a different name.

    Consider adding a comment/requirement in the EP factory API documentation that the name mismatch is the arena signal. Better yet, add an explicit bool is_arena field to OrtAllocator or a new optional API OrtAllocator::GetTraits() that can be checked. This would be a small, forward-compatible API addition that prevents silent misclassification.

    Alternate approach: use option (c) but gate it differently — instead of rejecting OrtArenaAllocator outright, change the two call sites to skip BFCArena wrapping when the type is OrtArenaAllocator instead of failing. This is more explicit and less fragile than name comparison. The check is already there in both functions; it just needs to branch differently.


4. Config Conflict & Prefix Mismatch (Sections 3.7–3.8)

Positive:

  • The "conflict blindness" analysis is honest and correct. The EP only sees one config source at each lifecycle point, so conflicts are inherently undetectable.
  • The acceptance rationale — shared and per-session allocators are independent and don't compete — is valid for the arena use case.

Concern:

  • ⚠️ Prefix mismatch (ep_factory.cuda.* vs ep.cudapluginexecutionprovider.*) is user-hostile: The document correctly identifies this as "a guaranteed source of user confusion" but defers to "documentation and examples." This is the kind of inconsistency that causes support burden. Consider adding a validation helper at environment creation time that warns if arena keys appear under only one prefix but not the other, or if the registration name vs. EP type name can be mapped automatically. Even a log-level warning would significantly reduce debugging time.

5. Part B — CudaMempoolArena Migration (Section 4)

Positive:

  • The dependency audit table in Section 4.1 is precise. The Logger dependency is correctly identified as the primary blocker — all other dependencies are plugin-safe.
  • Option 1 (conditional logger with macro bridge) is the right recommendation. The 6 logging sites genuinely aid debugging and shouldn't be stripped.
  • The CudaMempoolOrtAllocator wrapper design is clean and follows the same pattern as the existing CudaDeviceAllocator / CudaPinnedAllocator.

Concern:

  • ⚠️ CudaMempoolOrtAllocator must report OrtDeviceAllocator, not OrtArenaAllocator — but Section 3.5 relies on name difference to detect it as an arena: These two constraints are consistent in the current design (Section 4.4 says "alloc_type must be OrtDeviceAllocator" and Section 3.5 uses name comparison to detect arenas). But Section 4.4's phrasing "ORT must not double-wrap it" could confuse a reader into thinking OrtArenaAllocator is the mechanism — add a forward reference to Section 3.5's decided signal mechanism.

  • ⚠️ OrtSyncStream*cudaStream_t in AllocOnStreamImpl: The design mentions OrtEpApi::SyncStream_GetHandle() for resolving stream handles. Confirm that this API is available from within the plugin factory's allocator context (i.e., that the OrtEpApi function table is accessible when AllocOnStream is called, not just during factory initialization). In the current plugin architecture, OrtEpApi is typically stored on the factory. The allocator wrapper would need a back-pointer to the factory or a cached function pointer.

  • ⚠️ #ifdef BUILD_CUDA_EP_AS_PLUGIN in shared source — maintenance cost: Introducing build-conditional code in cuda_mempool_arena.h/cc adds maintenance risk. Every future change to the logging in this file must consider both branches. Consider whether a thin ICudaMempoolLogger interface (virtual, two impls) would be cleaner than preprocessor branching, at the cost of one virtual dispatch per log call (acceptable — logging is inherently slow-path).


6. Phased Plan (Section 5)

Positive:

  • The three-phase plan is well-sequenced — Phase 1 (BFCArena infrastructure) is a prerequisite for Phase 2 (CudaMempoolArena), and Phase 3 (validation) ensures parity.
  • Phase 3's checklist items are the right ones: mode selection, benchmarking, DisableCpuMemArena isolation, and shared-allocator replacement.

Concern:

  • ⚠️ Missing: rollback / fallback plan if BFCArena wrapping causes regressions: Since Phase 1 changes ORT core behavior for all plugin EPs (shared allocators get BFCArena by default), there should be a session option or environment config key to disable arena wrapping for debugging/rollback. Something like ep_factory.<name>.arena.enable=0 that the MaybeWrapInArena helper checks before wrapping.

  • ⚠️ Missing: thread-safety of shared allocator arena wrapping: RegisterExecutionProviderLibrary can be called from multiple threads (the environment is shared). BFCArena's construction is not inherently thread-safe if multiple sessions start simultaneously. The existing shared_allocators_ access is presumably mutex-protected, but the design should confirm that arena wrapping doesn't introduce additional synchronization requirements (e.g., BFCArena's internal state initialization is thread-safe if the allocator is shared across sessions).


7. Decisions & Open Questions (Section 6)

Positive:

  • The four "Decided" items are clear and well-reasoned.
  • Decided 2's pinned-allocator exception (always default config, never user-configurable) matches the in-tree EP behavior and avoids unnecessary complexity.
  • The "Needs validation" note about max_mem=0 upfront allocation behavior is a critical item — BFCArena with DEFAULT_MAX_MEM = std::numeric_limits<size_t>::max() doesn't pre-allocate at construction (it grows on demand starting with DEFAULT_INITIAL_CHUNK_SIZE_BYTES = 1MB), so this should be safe. But worth confirming.

Concern:

  • ⚠️ Missing open question: what happens to existing plugin EP instances that call CreateSharedAllocator after RegisterExecutionProviderLibrary? The public API OrtApi::CreateSharedAllocator (not from RegisterExecutionProviderLibrary) also flows through CreateSharedAllocatorImpl. If arena wrapping is added there, user-created shared allocators would also get wrapped. Is this intended? The helper should only wrap when the allocator comes from a plugin factory's CreateAllocator, not when the user provides a custom allocator via the public API.

Summary of Concerns

# Severity Component Issue
1 High Arena Signal (§3.5) Name-based arena detection is fragile; consider OrtArenaAllocator type as a skip-wrapping signal or an explicit trait
2 High Phase 1 Rollback No fallback mechanism to disable BFCArena wrapping if it causes regressions for plugin EPs
3 Suggestion BFCArena Wrapping (§3.2) CreateAllocator(AllocatorCreationInfo) takes a factory, not an existing allocator — design should clarify the wrapping mechanics
4 Suggestion Shared Allocator Default (§3.2) All plugin EPs get BFCArena by default — confirm this is intentional and won't break existing plugins
5 Suggestion Config Prefix (§3.8) Add validation/warning for mismatched ep_factory.* vs ep.* arena keys
6 Suggestion Stream Handle (§4.4) Confirm OrtEpApi::SyncStream_GetHandle() accessibility in allocator callback context
7 Suggestion Public API (§6) Distinguish plugin-factory allocators from user-provided allocators in CreateSharedAllocatorImpl to avoid unintended wrapping
8 Nitpick Logger (§4.3) Consider virtual ICudaMempoolLogger interface over #ifdef branching for maintainability
9 Nitpick Stream-awareness (§3.2, Decided 1) Document which OrtMemoryInfo/OrtDevice fields determine GPU-vs-pinned for the stream-awareness decision

Verdict

COMMENT — The design is sound and Option B is the right architectural choice. The two high-priority items (fragile name-based arena signal and missing rollback mechanism) should be resolved before implementation begins. The remaining suggestions are refinements that can be addressed during implementation PRs.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Comment thread docs/cuda_plugin_ep/arena_allocator_migration_design.md Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR implements a complete arena allocator system for the CUDA plugin EP, closing the allocation-performance gap with the in-tree CUDA EP. The work is well-structured: thorough design document (12 review iterations), BFC arena adapted from the example plugin EP, native CUDA mempool allocator, per-device shared arena lifecycle with ref counting, session/environment-level config plumbing, and a new Shrink API on OrtAllocator.

Code quality is high overall — exception handling is disciplined at the C ABI boundary, overflow checks are present, and pointer stability guarantees are well-documented.

2 high-priority concerns (use-after-free risk in arena pointer access; empty non-MSVC GetEnvironmentVar branch), 3 suggestions, and 3 nitpicks below.

Positives

  • AllocGuard RAII in Extend() prevents memory leaks on throw
  • Overflow protection in power-of-two doubling path
  • Config validation: negative value rejection for std::stoull, ORT_TRY/ORT_CATCH around every parser, IsValid() final check
  • Per-device DeviceCacheEntry with documented lock ordering (device_cache_mutex_ before arena_mutex)
  • strtol() with full bounds checking replaces atoi()
  • Version-gated Shrink/Reserve C++ wrappers prevent UB on older allocators
  • Comprehensive test coverage: config parsing, chunk splitting/coalescing, OOM, shrink, mempool, defensive tests

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
Comment thread onnxruntime/core/providers/cuda/plugin/provider_api_shims.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.cc
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 45 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/onnxruntime/core/session/onnxruntime_cxx_inline.h Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress — the arena migration architecture is sound, the per-device cache keyed by CUDA ordinal is correct, and many earlier review items have been addressed. Three blocking issues remain.

Addressed since last review

  • Extend() memory leak — fixed with AllocGuard RAII
  • ✅ Refcount underflow — now has LogWarning + early return instead of assert-only
  • provider_api_shims.cc empty non-MSVC branch — fixed
  • ✅ Use-after-free in stream plugin — fixed via ResetDeviceArenaChunksUsingStream holding arena_mutex
  • IndexFor return type — changed to size_t
  • max_mem negative value wrapping — now rejects leading -
  • ToKeyValuePairs always emits stats (unconditional)
  • ✅ Dead total_allocated_ member removed from mempool allocator
  • ✅ Redundant inner kNextPowerOfTwo check removed from Extend
  • ✅ Lock ordering + pointer stability comments added
  • ✅ Duplicate DEFAULT_* constants removed from ArenaImpl

Summary of remaining concerns

# Severity Component Issue
1 High CudaArenaAllocator / CudaMempoolOrtAllocator / EXCEPTION_TO_STATUS_END ORT_HANDLE_EXCEPTION discards lambda return value — OrtStatus* leaks, errors silently swallowed
2 High cuda_arena.cc SplitChunk / ResetChunksUsingStream Split remainder chunks keep stale stream pointers; ResetChunksUsingStream only clears tracked handles
3 High ep_plugin_provider_interfaces.cc Per-session plugin arenas always wrapped as IAllocatorImplWrapping (not IArenaImplWrapping), so ShrinkMemoryArenas cannot find them
4 Suggestion cuda_ep_factory.cc CreateAllocatorImpl Session arena options silently ignored after default shared allocator is created
5 Suggestion cuda_arena.h config parsing std::stoi silently rejects valid >2GB chunk sizes; invalid extend_strategy silently defaults to power-of-two
6 Suggestion cuda_mempool_allocator_plugin.cc AllocInternal ORT_THROW on non-OOM CUDA error → abort() under ORT_NO_EXCEPTIONS
7 Suggestion Test coverage No tests for CudaMempoolOrtAllocator or multi-GPU arena isolation

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.cc
Comment thread onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_arena.h Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/framework/ep_plugin_provider_test.cc
Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md
@yuslepukhin yuslepukhin requested a review from tianleiwu April 8, 2026 17:25
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE — This is a high-quality, well-designed implementation that brings the CUDA plugin EP to feature parity with the in-tree EP's arena allocation. The ORT core changes (AsArena(), IArenaImplWrappingOrtAllocator, SafeArenaCast simplification) are clean improvements that benefit the entire plugin EP ecosystem. The concerns are all suggestions for additional documentation/comments rather than correctness issues. The test coverage is excellent.

# Severity Component Issue
1 Suggestion IArenaImplWrappingOrtAllocator AllocOnStream casts Stream*OrtSyncStream* without precondition comment
2 Suggestion ArenaImpl::SplitChunk Stream-cleared remainder chunks may prevent coalescing, causing fragmentation over many runs
3 Suggestion CudaMempoolOrtAllocator::AllocImpl Default-stream semantics with per-thread streams could cause cross-thread ordering issues (caller responsibility, but worth documenting)
4 Suggestion CudaMempoolOrtAllocator destructor Flow between cudaFreeAsync loop → SyncAllKnownStreams → clear is correct but could use a clarifying comment
5 Nitpick CudaEpFactory Lock ordering between CreateAllocatorImpl and ReleaseAllocatorImpl is subtly different — correct but deserves a cross-reference comment

@yuslepukhin yuslepukhin merged commit 4e1c42e into main Apr 8, 2026
96 checks passed
@yuslepukhin yuslepukhin deleted the yuslepukhin/cuda_arena_ep branch April 8, 2026 18:24
sanaa-hamel-microsoft pushed a commit that referenced this pull request Apr 10, 2026
This pull request makes several improvements to the CUDA plugin
Execution Provider (EP) in ONNX Runtime, with a major focus on
integrating advanced memory arena support, improving CMake build logic,
and updating documentation to reflect these enhancements. The most
important changes are summarized below.

**Memory Arena Integration and Documentation:**

* Added a full BFC-style arena (`CudaArenaAllocator` / `ArenaImpl`) and
a CUDA native mempool allocator (`CudaMempoolOrtAllocator`) to the
plugin, with detailed documentation of their design, integration with
the ONNX Runtime core, and file layout. This enables stream-aware,
shrinkable memory allocation for the CUDA plugin EP, matching ORT core
capabilities.
* Updated the file layout and design documentation to reflect the new
arena and allocator classes, and documented how plugin allocators are
now visible to session-level arena management via the standard `IArena`
interface.
* Removed the previous "future work" section about memory arena parity,
as it is now implemented, and updated the "future work" section to focus
on remaining tasks such as profiling, stream/adapter parity, and kernel
registration validation.

**CMake Build Improvements:**

* Refactored force-include logic for adapter headers in
`cmake/onnxruntime_providers_cuda_plugin.cmake` to use generator
expressions, removing redundant MSVC/GCC/Clang blocks and simplifying
the build configuration.
* Mirrored the source directory structure in the Visual Studio solution
for better organization, and set the solution folder for the CUDA plugin
target.
[[1]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fR114-R117)
[[2]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fL290-R294)
* Added compile options to force `/permissive` and include `iso646.h`
for MSVC builds, ensuring compatibility with CUTLASS headers and
alternative C++ tokens.
* Updated unit test CMake to include plugin-specific CUDA test sources
when the plugin is enabled.

**Documentation Fixes:**

* Fixed heading levels and updated descriptions for device
synchronization and NHWC layout transformation support in the CUDA
plugin EP design document.
[[1]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL273-R273)
[[2]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL324-R324)
* Clarified which files have been replaced or refactored in the plugin
(e.g., `cuda_mempool_arena.cc` is now replaced by plugin-native arena
allocator files).

These changes collectively improve the maintainability, feature parity,
and documentation quality of the CUDA plugin EP in ONNX Runtime.
sanaa-hamel-microsoft added a commit that referenced this pull request Apr 14, 2026
…cherry picks (#28035)

### Description

FF release cut to last merge prior to version bump.

This also cherry-picks the following commits for the release:

| Commit ID | PR Number | Commit Title |
|-----------|-----------|-------------|
| 4e1c42e | #27931 | Implement/migrate Arena allocators to cuda
plugin ep. |
| fbba40a | #27786 | Model Package Support |
| c159603 | #28015 | Remove unnecessary model package test |
| 87b0643 | #27997 | centralise feed authentication for ADO pipelines
|


### Motivation and Context

Minimise # of cherry picks.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: eserscor <erscor@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants