CUDA Plugin EP: Test Coverage & Bug Fixes by tianleiwu · Pull Request #27817 · microsoft/onnxruntime

tianleiwu · 2026-03-23T20:42:04Z

Summary

Adds comprehensive test suite for the CUDA Plugin EP (test_cuda_plugin_ep.py) covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation
Adds cuda_plugin_ep_helper.py utility for transparently routing existing tests to the plugin EP
Fixes test_gqa.py: corrects total_sequence_length tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout) and routes tests through plugin EP
Updates test_moe_cuda.py to route through plugin EP when available
Fixes temp file collision risk in _run_model_test by using tempfile.NamedTemporaryFile

Depends on: #27816

Test plan

Run python test_cuda_plugin_ep.py on a CUDA-capable machine with the plugin EP built
Verify all 5 test stages pass (registration, ONNX ops, NHWC, contrib ops, op validation)
Run python -m pytest test_gqa.py and confirm the total_sequence_length fix resolves the CPU/GPU tensor mismatch
Run python -m pytest test_moe_cuda.py and confirm plugin EP routing works
Verify no temp file collisions when running tests in parallel

🤖 Generated with Claude Code

- Add test_cuda_plugin_ep.py: comprehensive 5-stage test suite covering registration, ONNX ops, NHWC layout, contrib ops, and op-level validation - Add cuda_plugin_ep_helper.py: helper for resolving CudaPluginExecutionProvider in existing tests - Fix test_gqa.py: correct total_sequence_length tensor placement to CPU (was incorrectly on CUDA device) and route tests through plugin EP - Update test_moe_cuda.py: route MoE tests through plugin EP when available - Fix temp file collision risk in _run_model_test using tempfile module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a CUDA Plugin Execution Provider (EP) Python test harness and updates existing transformer CUDA tests to optionally route execution through the plugin EP, alongside a small fix to GQA IO binding for stricter device placement requirements.

Changes:

Introduces test_cuda_plugin_ep.py to validate CUDA plugin EP registration and run a growing set of operator correctness checks (including NHWC preference and selected contrib ops).
Adds cuda_plugin_ep_helper.py to auto-register and transparently map "CUDAExecutionProvider" → "CudaPluginExecutionProvider" for tests.
Updates test_gqa.py and test_moe_cuda.py to use the helper, plus fixes total_sequence_length binding to CPU in GQA.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File	Description
onnxruntime/test/python/transformers/test_moe_cuda.py	Routes provider selection via plugin EP resolver; currently sets an env var at import time.
onnxruntime/test/python/transformers/test_gqa.py	Routes sessions via resolver and fixes `total_sequence_length` device placement.
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py	New plugin EP test suite covering registration and operator validation stages.
onnxruntime/test/python/transformers/cuda_plugin_ep_helper.py	New helper for plugin EP registration + provider name resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Description This PR adds a standalone CUDA Plugin Execution Provider (`CudaPluginExecutionProvider`) built as a dynamically loadable shared library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks. ## Summary of Changes ### Build system and CMake | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN` build option, records plugin build info, and includes the plugin-specific CMake file. | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the plugin shared-library target: collects `.cc`/`.cu` sources from `core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components. | | `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose include paths needed by plugin builds. | | `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration. | | `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration support for the plugin build path. | ### Plugin runtime implementation (new files) | File | Purpose | |------|---------| | `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization. | | `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from `ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels). | | `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device and pinned memory, exposed through the EP API. | | `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc` contexts). | | `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings. | | `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path. | | `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`, `Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow hooks instead of inheriting from in-tree CPU base implementations. | | `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points (`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and release the CUDA EP factory. | | `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides `CudaKernel` base class, error-return macros, type helpers (`ToCudaType`), handle-management abstractions, and stream adapters. Force-included in all plugin `.cc` files to transparently adapt existing kernel code. | | `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel definitions via `PluginKernelCollector` macro overrides, replacing the centralized registration tables used in the bundled build. | | `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin (logging, error checking, config parsing). | | `plugin/provider_api_shims.cc` | Stub implementations for shared-provider bridge functions that are not needed in the plugin path. | | `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export definitions for the plugin DLL. | ### EP adapter and API extensions | File | Change | |------|--------| | `include/onnxruntime/ep/api.h` | Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading. | | `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels. | | `include/onnxruntime/ep/adapter/op_kernel.h` | Adds `RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts. | | `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter kernel-info with attribute and config accessors required by migrated kernels. | | `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter adjustments for plugin compatibility. | | `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel definition builder hooks for plugin registration. | | `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only `Tensor::Create` compatibility path for kernels relying on the older static factory form. | | `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution. | ### CUDA kernel compatibility migration - Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites). - Moves or templates reusable helper logic in shared CPU/CUDA headers (`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`, `ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile in adapter mode. - Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops. - Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression. ### Python integration | File | Change | |------|--------| | `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends `get_available_providers()` to surface dynamically registered plugin EPs discovered from `OrtEpDevice` enumeration. | | `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python session creation to instantiate providers from registered plugin EP devices, including `device_id` selection, instead of only built-in or legacy dynamic-load EP paths. | | `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query support for plugin-registered operators. | ### Testing and validation | File | Change | |------|--------| | `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation. | | `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192 lines). Utility for transparently routing existing tests to the plugin EP. | | `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length` tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP. | | `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP when available. | | `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration. | | `test/unittest_util/base_tester.cc` | Routes CUDA test requests to `CudaPluginExecutionProvider` when registered, allowing existing CUDA provider tests to exercise the plugin path. | | `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support. | | `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build instructions, C++ and Python usage examples, and known limitations. | ### Other | File | Change | |------|--------| | `tools/python/gen_opkernel_doc.py` | Extended to generate documentation for plugin-registered kernels. | | `orttraining/.../reduction_ops.cc` | Minor compatibility fix for training reduction ops under the plugin build configuration. | ## Testing - **Build**: Configure with `--build_cuda_ep_as_plugin` (or `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify `libonnxruntime_providers_cuda_plugin.so` is produced alongside existing CUDA provider artifacts. - **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester` routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new `dynamic_plugin_ep_test` for load/enumerate validation. - **Python tests**: Register the plugin library, confirm `onnxruntime.get_available_providers()` includes `CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation). - **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py` to verify kernel coverage parity between bundled and plugin builds. - **Backward compatibility**: Verify unchanged behavior for the in-tree CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`). - **Known limitation**: CUDA graph support remains disabled in the plugin path and is documented as deferred. ## Motivation and Context The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle. **Related**: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description) - [ ] CI passes

yuslepukhin requested a review from Copilot March 23, 2026 20:45

Copilot started reviewing on behalf of yuslepukhin March 23, 2026 20:46 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

github-advanced-security AI found potential problems Mar 23, 2026

View reviewed changes

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

tianleiwu marked this pull request as draft March 23, 2026 22:40

tianleiwu added 2 commits March 26, 2026 09:08

update

5c44e89

addresss feedback

0476aff

tianleiwu requested a review from Copilot March 26, 2026 16:59

tianleiwu marked this pull request as ready for review March 26, 2026 16:59

Copilot started reviewing on behalf of tianleiwu March 26, 2026 17:00 View session

tianleiwu added 2 commits March 26, 2026 10:10

refine

4a41843

refactoring

2dc52be

tianleiwu merged commit 3c7e3e0 into tlwu/20260320/cuda_plugin Mar 26, 2026
4 checks passed

tianleiwu deleted the tlwu/20260320/cuda_plugin_tests branch March 26, 2026 17:24

tianleiwu mentioned this pull request Mar 29, 2026

CUDA Plugin EP: Core Implementation #27816

Merged

4 tasks

tianleiwu added the cuda_plugin label Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Plugin EP: Test Coverage & Bug Fixes#27817

CUDA Plugin EP: Test Coverage & Bug Fixes#27817
tianleiwu merged 5 commits intotlwu/20260320/cuda_pluginfrom
tlwu/20260320/cuda_plugin_tests

tianleiwu commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Mar 23, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants