Skip to content

CUDA Plugin EP: Test Coverage & Bug Fixes#27817

Merged
tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom
tlwu/20260320/cuda_plugin_tests
Mar 26, 2026
Merged

CUDA Plugin EP: Test Coverage & Bug Fixes#27817
tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom
tlwu/20260320/cuda_plugin_tests

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Summary

  • Adds comprehensive test suite for the CUDA Plugin EP (test_cuda_plugin_ep.py) covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation
  • Adds cuda_plugin_ep_helper.py utility for transparently routing existing tests to the plugin EP
  • Fixes test_gqa.py: corrects total_sequence_length tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout) and routes tests through plugin EP
  • Updates test_moe_cuda.py to route through plugin EP when available
  • Fixes temp file collision risk in _run_model_test by using tempfile.NamedTemporaryFile

Depends on: #27816

Test plan

  • Run python test_cuda_plugin_ep.py on a CUDA-capable machine with the plugin EP built
  • Verify all 5 test stages pass (registration, ONNX ops, NHWC, contrib ops, op validation)
  • Run python -m pytest test_gqa.py and confirm the total_sequence_length fix resolves the CPU/GPU tensor mismatch
  • Run python -m pytest test_moe_cuda.py and confirm plugin EP routing works
  • Verify no temp file collisions when running tests in parallel

🤖 Generated with Claude Code

- Add test_cuda_plugin_ep.py: comprehensive 5-stage test suite covering
  registration, ONNX ops, NHWC layout, contrib ops, and op-level validation
- Add cuda_plugin_ep_helper.py: helper for resolving CudaPluginExecutionProvider
  in existing tests
- Fix test_gqa.py: correct total_sequence_length tensor placement to CPU
  (was incorrectly on CUDA device) and route tests through plugin EP
- Update test_moe_cuda.py: route MoE tests through plugin EP when available
- Fix temp file collision risk in _run_model_test using tempfile module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CUDA Plugin Execution Provider (EP) Python test harness and updates existing transformer CUDA tests to optionally route execution through the plugin EP, alongside a small fix to GQA IO binding for stricter device placement requirements.

Changes:

  • Introduces test_cuda_plugin_ep.py to validate CUDA plugin EP registration and run a growing set of operator correctness checks (including NHWC preference and selected contrib ops).
  • Adds cuda_plugin_ep_helper.py to auto-register and transparently map "CUDAExecutionProvider""CudaPluginExecutionProvider" for tests.
  • Updates test_gqa.py and test_moe_cuda.py to use the helper, plus fixes total_sequence_length binding to CPU in GQA.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
onnxruntime/test/python/transformers/test_moe_cuda.py Routes provider selection via plugin EP resolver; currently sets an env var at import time.
onnxruntime/test/python/transformers/test_gqa.py Routes sessions via resolver and fixes total_sequence_length device placement.
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py New plugin EP test suite covering registration and operator validation stages.
onnxruntime/test/python/transformers/cuda_plugin_ep_helper.py New helper for plugin EP registration + provider name resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/python/transformers/cuda_plugin_ep_helper.py
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_gqa.py
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_moe_cuda.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/cuda_plugin_ep_helper.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
@tianleiwu tianleiwu marked this pull request as draft March 23, 2026 22:40
@tianleiwu tianleiwu requested a review from Copilot March 26, 2026 16:59
@tianleiwu tianleiwu marked this pull request as ready for review March 26, 2026 16:59
@tianleiwu tianleiwu merged commit 3c7e3e0 into tlwu/20260320/cuda_plugin Mar 26, 2026
4 checks passed
@tianleiwu tianleiwu deleted the tlwu/20260320/cuda_plugin_tests branch March 26, 2026 17:24
tianleiwu added a commit that referenced this pull request Mar 31, 2026
## Description

This PR adds a standalone CUDA Plugin Execution Provider
(`CudaPluginExecutionProvider`) built as a dynamically loadable shared
library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP
Plugin API. The implementation reuses the existing CUDA kernel stack
through adapter/shim layers (force-included headers and macro-based
registration overrides), eliminating the need to maintain a parallel
copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally
deferred until the plugin-facing EP API exposes the required session
callbacks.

## Summary of Changes

### Build system and CMake

| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN`
build option, records plugin build info, and includes the
plugin-specific CMake file. |
| `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the
plugin shared-library target: collects `.cc`/`.cu` sources from
`core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion
filters for incompatible files (tunable, controlflow, registration
tables), force-includes adapter headers, and links CUDA/cuDNN/ORT
components. |
| `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose
include paths needed by plugin builds. |
| `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage
in provider tests and fills in missing CUDA include/link settings for
the plugin configuration. |
| `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration
support for the plugin build path. |

### Plugin runtime implementation (new files)

| File | Purpose |
|------|---------|
| `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device
enumeration, session-option parsing, allocator registration, kernel
registry creation, and all static C-compatible plugin callbacks.
Thread-safe lazy kernel registry initialization. |
| `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from
`ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference,
TF32, cuDNN algorithm selection, convolution workspace, attention
kernels). |
| `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device
and pinned memory, exposed through the EP API. |
| `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS,
cuBLASLt, and cuDNN handle management. Provides two stream adapter modes
(`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc`
contexts). |
| `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for
host↔device copies used by plugin-backed tensors and Python bindings. |
| `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel
implementations for the plugin path. |
| `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`,
`Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow
hooks instead of inheriting from in-tree CPU base implementations. |
| `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points
(`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and
release the CUDA EP factory. |
| `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides
`CudaKernel` base class, error-return macros, type helpers
(`ToCudaType`), handle-management abstractions, and stream adapters.
Force-included in all plugin `.cc` files to transparently adapt existing
kernel code. |
| `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel
definitions via `PluginKernelCollector` macro overrides, replacing the
centralized registration tables used in the bundled build. |
| `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin
(logging, error checking, config parsing). |
| `plugin/provider_api_shims.cc` | Stub implementations for
shared-provider bridge functions that are not needed in the plugin path.
|
| `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export
definitions for the plugin DLL. |

### EP adapter and API extensions

| File | Change |
|------|--------|
| `include/onnxruntime/ep/api.h` | Makes plugin API initialization
thread-safe; preserves access to ORT, EP, and model editor API tables
during plugin loading. |
| `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors
(operator domain, optional-output handling) needed by reused CUDA
kernels. |
| `include/onnxruntime/ep/adapter/op_kernel.h` | Adds
`RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing
CUDA kernels run against plugin adapter contexts. |
| `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter
kernel-info with attribute and config accessors required by migrated
kernels. |
| `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter
adjustments for plugin compatibility. |
| `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel
definition builder hooks for plugin registration. |
| `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only
`Tensor::Create` compatibility path for kernels relying on the older
static factory form. |
| `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the
shared-provider bridge into a no-op for plugin builds so the EP adapter
facade owns type resolution. |

### CUDA kernel compatibility migration

- Adapts ~80 core CUDA and contrib CUDA kernel source files to compile
under the plugin build via macro-based registration overrides and
targeted compatibility fixes (not operator rewrites).
- Moves or templates reusable helper logic in shared CPU/CUDA headers
(`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`,
`ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile
in adapter mode.
- Key contrib kernel adaptations: attention variants (MHA, GQA, paged,
sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse,
bias-dropout, matmul-nbits, qordered ops.
- Key core kernel adaptations: softmax, topk, conv/conv-transpose,
batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum,
identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze,
gather-nd, concat, dropout, non-max-suppression.

### Python integration

| File | Change |
|------|--------|
| `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends
`get_available_providers()` to surface dynamically registered plugin EPs
discovered from `OrtEpDevice` enumeration. |
| `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python
session creation to instantiate providers from registered plugin EP
devices, including `device_id` selection, instead of only built-in or
legacy dynamic-load EP paths. |
| `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query
support for plugin-registered operators. |

### Testing and validation

| File | Change |
|------|--------|
| `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861
lines). Comprehensive test suite covering 5 stages: registration, ONNX
ops, NHWC layout preference, contrib ops, and op-level validation. |
| `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192
lines). Utility for transparently routing existing tests to the plugin
EP. |
| `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length`
tensor placement from CUDA to CPU (was causing failures under the plugin
EP's stricter memory layout); routes tests through plugin EP. |
| `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP
when available. |
| `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++
unit test exercising dynamic plugin EP loading and device enumeration. |
| `test/unittest_util/base_tester.cc` | Routes CUDA test requests to
`CudaPluginExecutionProvider` when registered, allowing existing CUDA
provider tests to exercise the plugin path. |
| `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines).
Comparison script that produces a parity report of ops in bundled-only
vs. plugin-only vs. both builds, via static parsing or runtime registry
interrogation. |

### Documentation

| File | Change |
|------|--------|
| `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines).
Plugin architecture, build/deployment flow, operator exclusions, adapter
design, and the decision to defer CUDA Graph support. |
| `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build
instructions, C++ and Python usage examples, and known limitations. |

### Other

| File | Change |
|------|--------|
| `tools/python/gen_opkernel_doc.py` | Extended to generate
documentation for plugin-registered kernels. |
| `orttraining/.../reduction_ops.cc` | Minor compatibility fix for
training reduction ops under the plugin build configuration. |

## Testing

- **Build**: Configure with `--build_cuda_ep_as_plugin` (or
`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify
`libonnxruntime_providers_cuda_plugin.so` is produced alongside existing
CUDA provider artifacts.
- **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester`
routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new
`dynamic_plugin_ep_test` for load/enumerate validation.
- **Python tests**: Register the plugin library, confirm
`onnxruntime.get_available_providers()` includes
`CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage
suite: registration → ONNX ops → NHWC → contrib ops → op validation).
- **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py`
to verify kernel coverage parity between bundled and plugin builds.
- **Backward compatibility**: Verify unchanged behavior for the in-tree
CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`).
- **Known limitation**: CUDA graph support remains disabled in the
plugin path and is documented as deferred.

## Motivation and Context

The CUDA EP is currently compiled into the ORT runtime binary, tightly
coupling its release cycle to the core runtime. This PR creates a path
to decouple CUDA EP delivery by implementing it as a standalone plugin
using the EP Plugin API. The key design tradeoff is reusing the existing
~100+ CUDA kernel implementations through force-include adapter headers
and macro-based registration overrides, rather than rewriting them. This
approach validates the plugin EP against current CUDA coverage without
maintaining a second kernel stack, at the cost of introducing
adapter/shim complexity. CUDA Graph support is explicitly deferred until
the EP Plugin API can represent the capture/replay lifecycle.

**Related**: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is
squash-merged into this branch.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (or documented in description)
- [ ] CI passes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants