-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Plugin EP data transfer and Stream support. #25254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TODO: Flesh out for usage on ORT API boundary for model inputs/outputs including usage involving interop.
Add CopyTensors to public API and ability to create OrtSyncStream to use with the copy.
|
Will this PR allow streaming the models from disk, thereby reducing device memory usage? |
Can be used with CopyTensors with the cudaStream_t from the OrtSyncStream being passed in via provider options (existing setup). Updated CopyTensors to take a single stream. User can make multiple calls if they need to use multiple streams.
Data copy test with CUDA works, including passing in the cudaStream_t to the session via provider options. Follow up PR will make it possible to provide an OrtSyncStream to the Run instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
Check return status in data copy unit test.
``` error: The difference between expected[i] and results[i] is 8.8930130004882812e-05, which exceeds 1e-5, where expected[i] evaluates to -1.2537282705307007, results[i] evaluates to -1.2536393404006958, and 1e-5 evaluates to 1.0000000000000001e-05. ```
…reateSyncStreamForDevice. An EP implementation may be dependent on the per-session settings or need to create a per-session allocator (WebGPU does) so we need to either provide the OrtEp this way, or add create allocator/stream functions to OrtEp. Add new function implementations to QNN OrtEpFactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for asynchronous data transfer and stream synchronization for plugin execution providers (EPs), including new C API extensions, environment/allocator updates, and example implementations.
- Introduce
OrtSyncStream,OrtSyncNotification, andCopyTensorsAPIs in the C API and internalOrtApis. - Extend
Environmentto register/unregister plugin EP data transfer, allocators (including arena wrap), and manage sync streams. - Provide example stream/notification implementations for CUDA and the plugin EP test library, plus new tests in
test_data_copy.cc.
Reviewed Changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/shared_lib/test_data_copy.cc | Add tests for plugin EP data copy and optional stream use |
| onnxruntime/core/session/ort_apis.h | Extend C API with SyncStream, SyncNotification, and CopyTensors |
| onnxruntime/core/framework/plugin_ep_stream.* | Implement plugin EP stream & notification wrapper classes |
| onnxruntime/core/session/environment.cc | Register plugin EP data transfers and manage shared allocators |
| onnxruntime/core/session/ep_factory_internal.* | Refactor EP factory to support stream creation & data transfer |
Comments suppressed due to low confidence (1)
onnxruntime/test/autoep/library/ep_data_transfer.cc:66
- [nitpick] There's a typo in the comment ('teh'). Consider correcting the spelling or removing this outdated commented line.
// the implementation for a 'real' EP would be something along these lines.
Add functions to create allocator and sync stream to OrtEp struct so an EP can create per-session instances if it chooses to. Needed for CUDA (sync stream needs to read session option for graph capture) and webgpu (session allocator uses different memory type to shared allocator).
|
Hi there! We haven't cut the release branch for this version yet, so I'm removing the |
### Description <!-- Describe your changes. --> Plugin EP data transfer and Stream support. Add the ability for a plugin EP to provide an IDataTransfer implementation and an OrtSyncStream implementation to do async data copy outside of an inference session. Example usage added for CUDA EP. Caveat: Support for providing the OrtSyncStream from the data copy to Session.Run will be a follow up PR. For the CUDA EP we can pass in the native cudaStream_t from the OrtSyncStream used for the data copy to the Run via CUDA EP provider options. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> Plugin EP data transfer and Stream support. Add the ability for a plugin EP to provide an IDataTransfer implementation and an OrtSyncStream implementation to do async data copy outside of an inference session. Example usage added for CUDA EP. Caveat: Support for providing the OrtSyncStream from the data copy to Session.Run will be a follow up PR. For the CUDA EP we can pass in the native cudaStream_t from the OrtSyncStream used for the data copy to the Run via CUDA EP provider options. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Description
Plugin EP data transfer and Stream support.
Add the ability for a plugin EP to provide an IDataTransfer implementation and an OrtSyncStream implementation to do async data copy outside of an inference session.
Example usage added for CUDA EP.
Caveat: Support for providing the OrtSyncStream from the data copy to Session.Run will be a follow up PR. For the CUDA EP we can pass in the native cudaStream_t from the OrtSyncStream used for the data copy to the Run via CUDA EP provider options.
Motivation and Context