CUDA Plugin EP: Test Coverage & Bug Fixes#27817
Merged
tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom Mar 26, 2026
Merged
CUDA Plugin EP: Test Coverage & Bug Fixes#27817tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom
tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom
Conversation
- Add test_cuda_plugin_ep.py: comprehensive 5-stage test suite covering registration, ONNX ops, NHWC layout, contrib ops, and op-level validation - Add cuda_plugin_ep_helper.py: helper for resolving CudaPluginExecutionProvider in existing tests - Fix test_gqa.py: correct total_sequence_length tensor placement to CPU (was incorrectly on CUDA device) and route tests through plugin EP - Update test_moe_cuda.py: route MoE tests through plugin EP when available - Fix temp file collision risk in _run_model_test using tempfile module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a CUDA Plugin Execution Provider (EP) Python test harness and updates existing transformer CUDA tests to optionally route execution through the plugin EP, alongside a small fix to GQA IO binding for stricter device placement requirements.
Changes:
- Introduces
test_cuda_plugin_ep.pyto validate CUDA plugin EP registration and run a growing set of operator correctness checks (including NHWC preference and selected contrib ops). - Adds
cuda_plugin_ep_helper.pyto auto-register and transparently map"CUDAExecutionProvider"→"CudaPluginExecutionProvider"for tests. - Updates
test_gqa.pyandtest_moe_cuda.pyto use the helper, plus fixestotal_sequence_lengthbinding to CPU in GQA.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| onnxruntime/test/python/transformers/test_moe_cuda.py | Routes provider selection via plugin EP resolver; currently sets an env var at import time. |
| onnxruntime/test/python/transformers/test_gqa.py | Routes sessions via resolver and fixes total_sequence_length device placement. |
| onnxruntime/test/python/transformers/test_cuda_plugin_ep.py | New plugin EP test suite covering registration and operator validation stages. |
| onnxruntime/test/python/transformers/cuda_plugin_ep_helper.py | New helper for plugin EP registration + provider name resolution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yuslepukhin
reviewed
Mar 23, 2026
yuslepukhin
reviewed
Mar 23, 2026
yuslepukhin
reviewed
Mar 23, 2026
yuslepukhin
reviewed
Mar 23, 2026
4 tasks
tianleiwu
added a commit
that referenced
this pull request
Mar 31, 2026
## Description This PR adds a standalone CUDA Plugin Execution Provider (`CudaPluginExecutionProvider`) built as a dynamically loadable shared library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks. ## Summary of Changes ### Build system and CMake | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN` build option, records plugin build info, and includes the plugin-specific CMake file. | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the plugin shared-library target: collects `.cc`/`.cu` sources from `core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components. | | `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose include paths needed by plugin builds. | | `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration. | | `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration support for the plugin build path. | ### Plugin runtime implementation (new files) | File | Purpose | |------|---------| | `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization. | | `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from `ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels). | | `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device and pinned memory, exposed through the EP API. | | `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc` contexts). | | `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings. | | `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path. | | `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`, `Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow hooks instead of inheriting from in-tree CPU base implementations. | | `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points (`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and release the CUDA EP factory. | | `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides `CudaKernel` base class, error-return macros, type helpers (`ToCudaType`), handle-management abstractions, and stream adapters. Force-included in all plugin `.cc` files to transparently adapt existing kernel code. | | `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel definitions via `PluginKernelCollector` macro overrides, replacing the centralized registration tables used in the bundled build. | | `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin (logging, error checking, config parsing). | | `plugin/provider_api_shims.cc` | Stub implementations for shared-provider bridge functions that are not needed in the plugin path. | | `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export definitions for the plugin DLL. | ### EP adapter and API extensions | File | Change | |------|--------| | `include/onnxruntime/ep/api.h` | Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading. | | `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels. | | `include/onnxruntime/ep/adapter/op_kernel.h` | Adds `RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts. | | `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter kernel-info with attribute and config accessors required by migrated kernels. | | `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter adjustments for plugin compatibility. | | `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel definition builder hooks for plugin registration. | | `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only `Tensor::Create` compatibility path for kernels relying on the older static factory form. | | `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution. | ### CUDA kernel compatibility migration - Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites). - Moves or templates reusable helper logic in shared CPU/CUDA headers (`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`, `ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile in adapter mode. - Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops. - Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression. ### Python integration | File | Change | |------|--------| | `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends `get_available_providers()` to surface dynamically registered plugin EPs discovered from `OrtEpDevice` enumeration. | | `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python session creation to instantiate providers from registered plugin EP devices, including `device_id` selection, instead of only built-in or legacy dynamic-load EP paths. | | `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query support for plugin-registered operators. | ### Testing and validation | File | Change | |------|--------| | `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation. | | `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192 lines). Utility for transparently routing existing tests to the plugin EP. | | `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length` tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP. | | `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP when available. | | `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration. | | `test/unittest_util/base_tester.cc` | Routes CUDA test requests to `CudaPluginExecutionProvider` when registered, allowing existing CUDA provider tests to exercise the plugin path. | | `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support. | | `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build instructions, C++ and Python usage examples, and known limitations. | ### Other | File | Change | |------|--------| | `tools/python/gen_opkernel_doc.py` | Extended to generate documentation for plugin-registered kernels. | | `orttraining/.../reduction_ops.cc` | Minor compatibility fix for training reduction ops under the plugin build configuration. | ## Testing - **Build**: Configure with `--build_cuda_ep_as_plugin` (or `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify `libonnxruntime_providers_cuda_plugin.so` is produced alongside existing CUDA provider artifacts. - **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester` routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new `dynamic_plugin_ep_test` for load/enumerate validation. - **Python tests**: Register the plugin library, confirm `onnxruntime.get_available_providers()` includes `CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation). - **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py` to verify kernel coverage parity between bundled and plugin builds. - **Backward compatibility**: Verify unchanged behavior for the in-tree CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`). - **Known limitation**: CUDA graph support remains disabled in the plugin path and is documented as deferred. ## Motivation and Context The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle. **Related**: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description) - [ ] CI passes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_cuda_plugin_ep.py) covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validationcuda_plugin_ep_helper.pyutility for transparently routing existing tests to the plugin EPtest_gqa.py: correctstotal_sequence_lengthtensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout) and routes tests through plugin EPtest_moe_cuda.pyto route through plugin EP when available_run_model_testby usingtempfile.NamedTemporaryFileDepends on: #27816
Test plan
python test_cuda_plugin_ep.pyon a CUDA-capable machine with the plugin EP builtpython -m pytest test_gqa.pyand confirm thetotal_sequence_lengthfix resolves the CPU/GPU tensor mismatchpython -m pytest test_moe_cuda.pyand confirm plugin EP routing works🤖 Generated with Claude Code