docs/execution-providers/TensorRT-ExecutionProvider.md

-Original file line number
+Diff line change
@@ Expand Up @@
     | **Precision and Performance**                      |                                                                                            |        |
     | Set TensorRT EP GPU memory usage limit             | [trt_max_workspace_size](./TensorRT-ExecutionProvider.md#trt_max_workspace_size)           | int    |
     | Enable FP16 precision for faster performance       | [trt_fp16_enable](./TensorRT-ExecutionProvider.md#trt_fp16_enable)                         | bool   |
+    | Enable BF16 precision for faster performance       | [trt_bf16_enable](./TensorRT-ExecutionProvider.md#trt_bf16_enable)                         | bool   |
     | Enable INT8 precision for quantized inference      | [trt_int8_enable](./TensorRT-ExecutionProvider.md#trt_int8_enable)                         | bool   |
     | Name INT8 calibration table for non-QDQ models     | [trt_int8_calibration_table_name](./TensorRT-ExecutionProvider.md#trt_int8_calibration_table_name) | string |
     | Use native TensorRT calibration tables             | [trt_int8_use_native_calibration_table](./TensorRT-ExecutionProvider.md#trt_int8_use_native_calibration_table) | bool   |
@@ Expand Down Expand Up @@
       > Note: not all Nvidia GPUs support FP16 precision.
+    ##### trt_bf16_enable
+    * Description: enable BF16 mode in TensorRT.
+      > Note: not all Nvidia GPUs support BF16 precision.
     ##### trt_int8_enable
     * Description:  enable INT8 mode in TensorRT.
@@ Expand Down Expand Up @@
     * Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache).
-      * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
+      * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
       > **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
       >
@@ Expand Down Expand Up @@
     * `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision.
+    * `ORT_TENSORRT_BF16_ENABLE`: Enable BF16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support BF16 precision.
     * `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision.
     * `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
@@ Expand All @@
     * `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0.
-    * `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
+    * `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
         * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
             * Model changes (if there are any changes to the model topology, opset version, operators etc.)
             * ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
@@ Expand Down Expand Up / @@ -564,6 +574,9 @@ export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 @@
     # Enable FP16 mode in TensorRT
     export ORT_TENSORRT_FP16_ENABLE=1
+    # Enable BF16 mode in TensorRT
+    export ORT_TENSORRT_BF16_ENABLE=1
     # Enable INT8 mode in TensorRT
     export ORT_TENSORRT_INT8_ENABLE=1
@@ Expand Down @@

Add docs for BF16 in TensorRT provider #26956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

dwyatte wants to merge 1 commit into microsoft:gh-pages from dwyatte:gh-pages

-Original file line number
+Diff line change
@@ Expand Up @@
     | **Precision and Performance**                      |                                                                                            |        |
     | Set TensorRT EP GPU memory usage limit             | [trt_max_workspace_size](./TensorRT-ExecutionProvider.md#trt_max_workspace_size)           | int    |
     | Enable FP16 precision for faster performance       | [trt_fp16_enable](./TensorRT-ExecutionProvider.md#trt_fp16_enable)                         | bool   |
+    | Enable BF16 precision for faster performance       | [trt_bf16_enable](./TensorRT-ExecutionProvider.md#trt_bf16_enable)                         | bool   |
     | Enable INT8 precision for quantized inference      | [trt_int8_enable](./TensorRT-ExecutionProvider.md#trt_int8_enable)                         | bool   |
     | Name INT8 calibration table for non-QDQ models     | [trt_int8_calibration_table_name](./TensorRT-ExecutionProvider.md#trt_int8_calibration_table_name) | string |
     | Use native TensorRT calibration tables             | [trt_int8_use_native_calibration_table](./TensorRT-ExecutionProvider.md#trt_int8_use_native_calibration_table) | bool   |
@@ Expand Down Expand Up @@
       > Note: not all Nvidia GPUs support FP16 precision.
+    ##### trt_bf16_enable
+    * Description: enable BF16 mode in TensorRT.
+      > Note: not all Nvidia GPUs support BF16 precision.
     ##### trt_int8_enable
     * Description:  enable INT8 mode in TensorRT.
@@ Expand Down Expand Up @@
     * Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache).
-      * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
+      * Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again.
       > **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
       >
@@ Expand Down Expand Up @@
     * `ORT_TENSORRT_FP16_ENABLE`: Enable FP16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support FP16 precision.
+    * `ORT_TENSORRT_BF16_ENABLE`: Enable BF16 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support BF16 precision.
     * `ORT_TENSORRT_INT8_ENABLE`: Enable INT8 mode in TensorRT. 1: enabled, 0: disabled. Default value: 0. Note not all Nvidia GPUs support INT8 precision.
     * `ORT_TENSORRT_INT8_CALIBRATION_TABLE_NAME`: Specify INT8 calibration table file for non-QDQ models in INT8 mode. Note calibration table should not be provided for QDQ model because TensorRT doesn't allow calibration table to be loded if there is any Q/DQ node in the model. By default the name is empty.
@@ Expand All @@
     * `ORT_TENSORRT_DLA_CORE`: Specify DLA core to execute on. Default value: 0.
-    * `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
+    * `ORT_TENSORRT_ENGINE_CACHE_ENABLE`: Enable TensorRT engine caching. The purpose of using engine caching is to save engine build time in the case that TensorRT may take long time to optimize and build engine. Engine will be cached when it's built for the first time so next time when new inference session is created the engine can be loaded directly from cache. In order to validate that the loaded engine is usable for current inference, engine profile is also cached and loaded along with engine. If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache). Note each engine is created for specific settings such as model path/name, precision (FP32/FP16/BF16/INT8 etc), workspace, profiles etc, and specific GPUs and it's not portable, so it's essential to make sure those settings are not changing, otherwise the engine needs to be rebuilt and cached again. 1: enabled, 0: disabled. Default value: 0.
         * **Warning: Please clean up any old engine and profile cache files (.engine and .profile) if any of the following changes:**
             * Model changes (if there are any changes to the model topology, opset version, operators etc.)
             * ORT version changes (i.e. moving from ORT version 1.8 to 1.9)
@@ Expand Down Expand Up / @@ -564,6 +574,9 @@ export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=5 @@
     # Enable FP16 mode in TensorRT
     export ORT_TENSORRT_FP16_ENABLE=1
+    # Enable BF16 mode in TensorRT
+    export ORT_TENSORRT_BF16_ENABLE=1
     # Enable INT8 mode in TensorRT
     export ORT_TENSORRT_INT8_ENABLE=1
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for BF16 in TensorRT provider #26956

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Add docs for BF16 in TensorRT provider #26956

Are you sure you want to change the base?

Uh oh!

Add docs for BF16 in TensorRT provider #26956

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!