diff --git a/docs/execution-providers/OpenVINO-ExecutionProvider.md b/docs/execution-providers/OpenVINO-ExecutionProvider.md index 400d6f84e1860..726b3c5ec2fb9 100644 --- a/docs/execution-providers/OpenVINO-ExecutionProvider.md +++ b/docs/execution-providers/OpenVINO-ExecutionProvider.md @@ -73,7 +73,7 @@ Use `AUTO:..` as the device name to delegate selection of an From the application point of view, this is just another device that handles all accelerators in full system. For more information on Auto-Device plugin of OpenVINO, please refer to the -[Intel OpenVINO Auto Device Plugin](https://docs.openvino.ai/latest/openvino_docs_IE_DG_supported_plugins_AUTO.html). +[Intel OpenVINO Auto Device Plugin](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Hetero_execution.html). ### Model caching feature for OpenVINO EP @@ -84,7 +84,7 @@ This feature enables users to save and load the blobs directly. These pre-compil #### CL Cache capability for iGPU -Starting from version 2021.4 OpenVINO supports [model caching](https://docs.openvino.ai/latest/openvino_docs_IE_DG_Model_caching_overview.html). +Starting from version 2021.4 OpenVINO supports [model caching](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Model_caching_overview.html). This feature enables users to save and load the cl_cache files directly. These cl_cache files can be directly loaded on to igpu hardware device target and inferencing can be done. This feature is only supported on iGPU hardware device target. diff --git a/docs/performance/tune-performance.md b/docs/performance/tune-performance.md index 09be513f5a9d7..708a9bdf86890 100644 --- a/docs/performance/tune-performance.md +++ b/docs/performance/tune-performance.md @@ -349,17 +349,24 @@ While using the CUDA EP, ORT supports the usage of [CUDA Graphs](https://develop Currently, there are some constraints with regards to using the CUDA Graphs feature which are listed below: 1) Models with control-flow ops (i.e.) models with `If`, `Loop`, and `Scan` ops are not supported + 2) Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the CUDA EP -3) The input/output types of models need to be tensors + +3) The input/output types of models need to be tensors + 4) Shapes of inputs/outputs cannot change across inference calls. Dynamic shape models are supported - the only constraint is that the input/output shapes should be the same across all inference calls + 5) By design, [CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/) is designed to read from/write to the same CUDA virtual memory addresses during the graph replaying step as it does during the graph capturing step. Due to this requirement, usage of this feature requires using IOBinding so as to bind memory which will be used as input(s)/output(s) for the CUDA Graph machinery to read from/write to(please see samples below) + 6) While updating the input(s) for subsequent inference calls, the fresh input(s) need to be copied over to the corresponding CUDA memory location(s) of the bound `OrtValue` input(s) (please see samples below to see how this can be achieved). This is due to the fact that the "graph replay" will require reading inputs from the same CUDA virtual memory addresses + 7) Multi-threaded usage is not supported currently (i.e.) `Run()` MAY NOT be invoked on the same `InferenceSession` object from multiple threads while using CUDA Graphs NOTE: The very first `Run()` performs a variety of tasks under the hood like making CUDA memory allocations, capturing the CUDA graph for the model, and then performing a graph replay to ensure that the graph runs. Due to this, the latency associated with the first `Run()` is bound to be high. The subsequent `Run()`s only perform graph replays of the graph captured and cached in the first `Run()`. * Python -``` + +```python providers = [("CUDAExecutionProvider", {"enable_cuda_graph": '1'})] sess_options = ort.SessionOptions() sess = ort.InferenceSession("my_model.onnx", sess_options = sess_options, providers=providers) @@ -373,26 +380,27 @@ y_ortvalue = onnxrt.OrtValue.ortvalue_from_numpy(y, 'cuda', 0) session = onnxrt.InferenceSession("matmul_2.onnx", providers=providers) io_binding = session.io_binding() -'''Bind the input and output''' +# Bind the input and output io_binding.bind_ortvalue_input('X', x_ortvalue) io_binding.bind_ortvalue_output('Y', y_ortvalue) -'''One regular run for the necessary memory allocation and cuda graph capturing''' +# One regular run for the necessary memory allocation and cuda graph capturing session.run_with_iobinding(io_binding) expected_y = np.array([[5.0], [11.0], [17.0]], dtype=np.float32) np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05) -'''After capturing, CUDA graph replay happens from this Run onwards''' +# After capturing, CUDA graph replay happens from this Run onwards session.run_with_iobinding(io_binding) np.testing.assert_allclose(expected_y, y_ortvalue.numpy(), rtol=1e-05, atol=1e-05) -'''Update input and then replay CUDA graph with the updated input''' +# Update input and then replay CUDA graph with the updated input x_ortvalue.update_inplace(np.array([[10.0, 20.0], [30.0, 40.0], [50.0, 60.0]], dtype=np.float32)) session.run_with_iobinding(io_binding) ``` * C/C++ -``` + +```c++ const auto& api = Ort::GetApi(); struct CudaMemoryDeleter {