Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/execution-providers/OpenVINO-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Use `AUTO:<device 1><device 2>..` as the device name to delegate selection of an
From the application point of view, this is just another device that handles all accelerators in full system.

For more information on Auto-Device plugin of OpenVINO, please refer to the
[Intel OpenVINO Auto Device Plugin](https://docs.openvino.ai/latest/openvino_docs_IE_DG_supported_plugins_AUTO.html).
[Intel OpenVINO Auto Device Plugin](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Hetero_execution.html).

### Model caching feature for OpenVINO EP

Expand All @@ -84,7 +84,7 @@ This feature enables users to save and load the blobs directly. These pre-compil

#### CL Cache capability for iGPU

Starting from version 2021.4 OpenVINO supports [model caching](https://docs.openvino.ai/latest/openvino_docs_IE_DG_Model_caching_overview.html).
Starting from version 2021.4 OpenVINO supports [model caching](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Model_caching_overview.html).

This feature enables users to save and load the cl_cache files directly. These cl_cache files can be directly loaded on to igpu hardware device target and inferencing can be done. This feature is only supported on iGPU hardware device target.

Expand Down
22 changes: 15 additions & 7 deletions docs/performance/tune-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,17 +349,24 @@ While using the CUDA EP, ORT supports the usage of [CUDA Graphs](https://develop
Currently, there are some constraints with regards to using the CUDA Graphs feature which are listed below:

1) Models with control-flow ops (i.e.) models with `If`, `Loop`, and `Scan` ops are not supported

2) Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the CUDA EP
3) The input/output types of models need to be tensors

3) The input/output types of models need to be tensors

4) Shapes of inputs/outputs cannot change across inference calls. Dynamic shape models are supported - the only constraint is that the input/output shapes should be the same across all inference calls

5) By design, [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) is designed to read from/write to the same CUDA virtual memory addresses during the graph replaying step as it does during the graph capturing step. Due to this requirement, usage of this feature requires using IOBinding so as to bind memory which will be used as input(s)/output(s) for the CUDA Graph machinery to read from/write to(please see samples below)

6) While updating the input(s) for subsequent inference calls, the fresh input(s) need to be copied over to the corresponding CUDA memory location(s) of the bound `OrtValue` input(s) (please see samples below to see how this can be achieved). This is due to the fact that the "graph replay" will require reading inputs from the same CUDA virtual memory addresses

7) Multi-threaded usage is not supported currently (i.e.) `Run()` MAY NOT be invoked on the same `InferenceSession` object from multiple threads while using CUDA Graphs

NOTE: The very first `Run()` performs a variety of tasks under the hood like making CUDA memory allocations, capturing the CUDA graph for the model, and then performing a graph replay to ensure that the graph runs. Due to this, the latency associated with the first `Run()` is bound to be high. The subsequent `Run()`s only perform graph replays of the graph captured and cached in the first `Run()`.

* Python
```

```python
providers = [("CUDAExecutionProvider", {"enable_cuda_graph": '1'})]
sess_options = ort.SessionOptions()
sess = ort.InferenceSession("my_model.onnx", sess_options = sess_options, providers=providers)
Expand All @@ -373,26 +380,27 @@ y_ortvalue = onnxrt.OrtValue.ortvalue_from_numpy(y, 'cuda', 0)
session = onnxrt.InferenceSession("matmul_2.onnx", providers=providers)
io_binding = session.io_binding()

'''Bind the input and output'''
# Bind the input and output
io_binding.bind_ortvalue_input('X', x_ortvalue)
io_binding.bind_ortvalue_output('Y', y_ortvalue)

'''One regular run for the necessary memory allocation and cuda graph capturing'''
# One regular run for the necessary memory allocation and cuda graph capturing
session.run_with_iobinding(io_binding)
expected_y = np.array([[5.0], [11.0], [17.0]], dtype=np.float32)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

'''After capturing, CUDA graph replay happens from this Run onwards'''
# After capturing, CUDA graph replay happens from this Run onwards
session.run_with_iobinding(io_binding)
np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05)

'''Update input and then replay CUDA graph with the updated input'''
# Update input and then replay CUDA graph with the updated input
x_ortvalue.update_inplace(np.array([[10.0, 20.0], [30.0, 40.0], [50.0, 60.0]], dtype=np.float32))
session.run_with_iobinding(io_binding)
```

* C/C++
```

```c++
const auto& api = Ort::GetApi();

struct CudaMemoryDeleter {
Expand Down