From 784a6864aa457cb5285f118ddc8d2a213ae71471 Mon Sep 17 00:00:00 2001
From: Adam Stachowicz <astachowicz@habana.ai>
Date: Wed, 4 Dec 2024 16:31:04 +0200
Subject: [PATCH 1/2] Update README example with bf16 sdp option

---
 examples/README.md                        | 13 +++++++++++++
 examples/audio-classification/README.md   | 15 +++++++++++++++
 examples/contrastive-image-text/README.md | 15 +++++++++++++++
 examples/image-classification/README.md   | 14 ++++++++++++++
 examples/question-answering/README.md     | 15 +++++++++++++++
 examples/speech-recognition/README.md     | 14 ++++++++++++++
 examples/stable-diffusion/README.md       | 16 ++++++++++++++++
 examples/text-classification/README.md    | 16 ++++++++++++++++
 8 files changed, 118 insertions(+)

diff --git a/examples/README.md b/examples/README.md
index 9b4a65f31d..af8e13dd8c 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -22,6 +22,19 @@ Other [examples](https://github.com/huggingface/transformers/tree/main/examples/
 - replacing the `TrainingArguments` from 🤗 Transformers with the `GaudiTrainingArguments` from 🤗 Optimum Habana.
 
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with the following setting:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
+
 ## Distributed training
 
 All the PyTorch training scripts in this repository work out of the box with distributed training.
diff --git a/examples/audio-classification/README.md b/examples/audio-classification/README.md
index aaa45425cc..f724583e61 100644
--- a/examples/audio-classification/README.md
+++ b/examples/audio-classification/README.md
@@ -20,6 +20,21 @@ The following examples showcase how to fine-tune `Wav2Vec2` for audio classifica
 
 Speech recognition models that have been pretrained in an unsupervised fashion on audio data alone, *e.g.* [Wav2Vec2](https://huggingface.co/transformers/main/model_doc/wav2vec2.html), have shown to require only very little annotated data to yield good performance on speech classification datasets.
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/contrastive-image-text/README.md b/examples/contrastive-image-text/README.md
index 7a095bc9ca..ab2fd6c23e 100644
--- a/examples/contrastive-image-text/README.md
+++ b/examples/contrastive-image-text/README.md
@@ -23,6 +23,21 @@ This folder contains two examples:
 
 Such models can be used for natural language image search and potentially zero-shot image classification.
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/image-classification/README.md b/examples/image-classification/README.md
index 08c4d67123..b03c5f87d1 100644
--- a/examples/image-classification/README.md
+++ b/examples/image-classification/README.md
@@ -18,6 +18,20 @@ limitations under the License.
 
 This directory contains a script that showcases how to fine-tune any model supported by the [`AutoModelForImageClassification` API](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForImageClassification) (such as [ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) or [Swin Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/swin)) on HPUs. They can be used to fine-tune models on both [datasets from the hub](#using-datasets-from-hub) as well as on [your own custom data](#using-your-own-data). This directory also contains a script to demonstrate a single HPU inference for [PyTorch-Image-Models/TIMM](https://huggingface.co/docs/timm/index).
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
 
 ## Requirements
 
diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md
index 654a9e02ad..5993663503 100755
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -26,6 +26,21 @@ uses special features of those tokenizers. You can check if your favorite model
 
 Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`.
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md
index 02e4b53d66..dbb86915fe 100644
--- a/examples/speech-recognition/README.md
+++ b/examples/speech-recognition/README.md
@@ -26,6 +26,20 @@ limitations under the License.
 	- [Fine tuning](#single-hpu-whisper-fine-tuning-with-seq2seq)
 	- [Inference](#single-hpu-seq2seq-inference)
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
 
 ## Requirements
 
diff --git a/examples/stable-diffusion/README.md b/examples/stable-diffusion/README.md
index f4df474f09..62b5ab30ab 100644
--- a/examples/stable-diffusion/README.md
+++ b/examples/stable-diffusion/README.md
@@ -21,6 +21,22 @@ This directory contains a script that showcases how to perform text-to-image gen
 Stable Diffusion was proposed in [Stable Diffusion Announcement](https://stability.ai/blog/stable-diffusion-announcement) by Patrick Esser and Robin Rombach and the Stability AI team.
 
 
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
+
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md
index 32bc3fd5f8..83ffba65d8 100644
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -27,6 +27,22 @@ and can also be used for a dataset hosted on our [hub](https://huggingface.co/da
 
 GLUE is made up of a total of 9 different tasks where the task name can be cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte or wnli.
 
+
+## Important Note on Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
+1. Using the following setting in your script:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+2. Using the --sdp_on_bf16 switch when calling the example script.
+
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
 ## Requirements
 
 First, you should install the requirements:

From dbb5056ddff564a8d4778aa0130a84d8307ee49d Mon Sep 17 00:00:00 2001
From: Adam Stachowicz <astachowicz@habana.ai>
Date: Fri, 6 Dec 2024 16:02:07 +0200
Subject: [PATCH 2/2] Remove per-example readme, add global readme

---
 README.md                                      | 16 ++++++++++++++++
 examples/README.md                             | 13 -------------
 examples/audio-classification/README.md        | 18 +++---------------
 examples/contrastive-image-text/README.md      | 15 ---------------
 examples/image-classification/README.md        | 18 ++++--------------
 examples/question-answering/README.md          | 15 ---------------
 .../paraphrases/training_paraphrases.py        |  1 +
 examples/speech-recognition/README.md          | 14 --------------
 examples/stable-diffusion/README.md            | 16 ----------------
 examples/stable-diffusion/training/README.md   |  7 +++++++
 examples/text-classification/README.md         | 16 ----------------
 11 files changed, 31 insertions(+), 118 deletions(-)

diff --git a/README.md b/README.md
index 7776f4cc5a..28dd124121 100644
--- a/README.md
+++ b/README.md
@@ -175,6 +175,22 @@ outputs = generator(
 ```
 
 
+## Important Note on Pytorch 2.5 Performance Degradation
+
+With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
+
+"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
+
+For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with the following setting:
+```python
+torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
+```
+Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
+
+More info:
+- https://pytorch.org/docs/stable/notes/numerical_accuracy.html
+
+
 ### Documentation
 
 Check out [the documentation of Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/habana/index) for more advanced usage.
diff --git a/examples/README.md b/examples/README.md
index af8e13dd8c..9b4a65f31d 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -22,19 +22,6 @@ Other [examples](https://github.com/huggingface/transformers/tree/main/examples/
 - replacing the `TrainingArguments` from 🤗 Transformers with the `GaudiTrainingArguments` from 🤗 Optimum Habana.
 
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with the following setting:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
-
 ## Distributed training
 
 All the PyTorch training scripts in this repository work out of the box with distributed training.
diff --git a/examples/audio-classification/README.md b/examples/audio-classification/README.md
index f724583e61..dafced7a58 100644
--- a/examples/audio-classification/README.md
+++ b/examples/audio-classification/README.md
@@ -20,21 +20,6 @@ The following examples showcase how to fine-tune `Wav2Vec2` for audio classifica
 
 Speech recognition models that have been pretrained in an unsupervised fashion on audio data alone, *e.g.* [Wav2Vec2](https://huggingface.co/transformers/main/model_doc/wav2vec2.html), have shown to require only very little annotated data to yield good performance on speech classification datasets.
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
 ## Requirements
 
 First, you should install the requirements:
@@ -71,6 +56,7 @@ python run_audio_classification.py \
     --use_hpu_graphs_for_inference \
     --gaudi_config_name Habana/wav2vec2 \
     --throughput_warmup_steps 3 \
+    --sdp_on_bf16 \
     --bf16 \
     --trust_remote_code True
 ```
@@ -108,6 +94,7 @@ PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
     --use_lazy_mode False\
     --gaudi_config_name Habana/wav2vec2 \
     --throughput_warmup_steps 3 \
+    --sdp_on_bf16 \
     --bf16 \
     --trust_remote_code True \
     --torch_compile \
@@ -188,6 +175,7 @@ python run_audio_classification.py \
     --use_lazy_mode \
     --use_hpu_graphs_for_inference \
     --gaudi_config_name Habana/wav2vec2 \
+    --sdp_on_bf16 \
     --bf16 \
     --trust_remote_code True\
     --torch_compile \
diff --git a/examples/contrastive-image-text/README.md b/examples/contrastive-image-text/README.md
index ab2fd6c23e..7a095bc9ca 100644
--- a/examples/contrastive-image-text/README.md
+++ b/examples/contrastive-image-text/README.md
@@ -23,21 +23,6 @@ This folder contains two examples:
 
 Such models can be used for natural language image search and potentially zero-shot image classification.
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/image-classification/README.md b/examples/image-classification/README.md
index b03c5f87d1..01b19b25ba 100644
--- a/examples/image-classification/README.md
+++ b/examples/image-classification/README.md
@@ -18,20 +18,6 @@ limitations under the License.
 
 This directory contains a script that showcases how to fine-tune any model supported by the [`AutoModelForImageClassification` API](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForImageClassification) (such as [ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) or [Swin Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/swin)) on HPUs. They can be used to fine-tune models on both [datasets from the hub](#using-datasets-from-hub) as well as on [your own custom data](#using-your-own-data). This directory also contains a script to demonstrate a single HPU inference for [PyTorch-Image-Models/TIMM](https://huggingface.co/docs/timm/index).
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
 
 ## Requirements
 
@@ -71,6 +57,7 @@ PT_HPU_LAZY_MODE=0 python run_image_classification.py \
     --gaudi_config_name Habana/vit \
     --throughput_warmup_steps 6 \
     --dataloader_num_workers 1 \
+    --sdp_on_bf16 \
     --bf16
 ```
 
@@ -121,6 +108,7 @@ PT_HPU_LAZY_MODE=0 python run_image_classification.py \
     --gaudi_config_name Habana/vit \
     --throughput_warmup_steps 3 \
     --dataloader_num_workers 1 \
+    --sdp_on_bf16 \
     --bf16
 ```
 
@@ -225,6 +213,7 @@ PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
     --gaudi_config_name Habana/vit \
     --throughput_warmup_steps 8 \
     --dataloader_num_workers 1 \
+    --sdp_on_bf16 \
     --bf16
 ```
 
@@ -312,6 +301,7 @@ python run_image_classification.py \
     --use_hpu_graphs_for_inference \
     --gaudi_config_name Habana/vit \
     --dataloader_num_workers 1 \
+    --sdp_on_bf16 \
     --bf16
 ```
 
diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md
index 5993663503..654a9e02ad 100755
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -26,21 +26,6 @@ uses special features of those tokenizers. You can check if your favorite model
 
 Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`.
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/sentence-transformers-training/paraphrases/training_paraphrases.py b/examples/sentence-transformers-training/paraphrases/training_paraphrases.py
index d31bfd5796..67cb54f12b 100644
--- a/examples/sentence-transformers-training/paraphrases/training_paraphrases.py
+++ b/examples/sentence-transformers-training/paraphrases/training_paraphrases.py
@@ -101,6 +101,7 @@
     warmup_ratio=0.1,
     # fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
     # bf16=False,  # Set to True if you have a GPU that supports BF16
+    # sdp_on_bf16=True, #Set to True for better performance (but this setting can affect accuracy)
     batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
     # We can use ROUND_ROBIN or PROPORTIONAL - to avoid focusing too much on one dataset, we will
     # use round robin, which samples the same amount of batches from each dataset, until one dataset is empty
diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md
index dbb86915fe..02e4b53d66 100644
--- a/examples/speech-recognition/README.md
+++ b/examples/speech-recognition/README.md
@@ -26,20 +26,6 @@ limitations under the License.
 	- [Fine tuning](#single-hpu-whisper-fine-tuning-with-seq2seq)
 	- [Inference](#single-hpu-seq2seq-inference)
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
 
 ## Requirements
 
diff --git a/examples/stable-diffusion/README.md b/examples/stable-diffusion/README.md
index 62b5ab30ab..f4df474f09 100644
--- a/examples/stable-diffusion/README.md
+++ b/examples/stable-diffusion/README.md
@@ -21,22 +21,6 @@ This directory contains a script that showcases how to perform text-to-image gen
 Stable Diffusion was proposed in [Stable Diffusion Announcement](https://stability.ai/blog/stable-diffusion-announcement) by Patrick Esser and Robin Rombach and the Stability AI team.
 
 
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
-
 ## Requirements
 
 First, you should install the requirements:
diff --git a/examples/stable-diffusion/training/README.md b/examples/stable-diffusion/training/README.md
index a10c194066..afa4a0a61f 100644
--- a/examples/stable-diffusion/training/README.md
+++ b/examples/stable-diffusion/training/README.md
@@ -198,6 +198,7 @@ python train_controlnet.py \
  --train_batch_size=4 \
  --throughput_warmup_steps=3 \
  --use_hpu_graphs \
+ --sdp_on_bf16 \
  --bf16 \
  --trust_remote_code
 ```
@@ -217,6 +218,7 @@ python ../../gaudi_spawn.py --use_mpi --world_size 8 train_controlnet.py \
   --train_batch_size=4 \
   --throughput_warmup_steps 3 \
   --use_hpu_graphs \
+  --sdp_on_bf16 \
   --bf16 \
   --trust_remote_code
 ```
@@ -295,6 +297,7 @@ python train_text_to_image_sdxl.py \
   --gaudi_config_name Habana/stable-diffusion \
   --throughput_warmup_steps 3 \
   --dataloader_num_workers 8 \
+  --sdp_on_bf16 \
   --bf16 \
   --use_hpu_graphs_for_training \
   --use_hpu_graphs_for_inference \
@@ -330,6 +333,7 @@ python ../../gaudi_spawn.py --world_size 8 --use_mpi train_text_to_image_sdxl.py
   --gaudi_config_name Habana/stable-diffusion \
   --throughput_warmup_steps 3 \
   --dataloader_num_workers 8 \
+  --sdp_on_bf16 \
   --bf16 \
   --use_hpu_graphs_for_training \
   --use_hpu_graphs_for_inference \
@@ -365,6 +369,7 @@ python train_text_to_image_sdxl.py \
   --use_hpu_graphs_for_training \
   --use_hpu_graphs_for_inference \
   --checkpointing_steps 3000 \
+  --sdp_on_bf16 \
   --bf16
 ```
 
@@ -498,6 +503,7 @@ python ../text_to_image_generation.py \
     --use_habana \
     --use_hpu_graphs \
     --gaudi_config Habana/stable-diffusion \
+    --sdp_on_bf16 \
     --bf16
 ```
 
@@ -695,5 +701,6 @@ python ../text_to_image_generation.py \
     --use_habana \
     --use_hpu_graphs \
     --gaudi_config Habana/stable-diffusion \
+    --sdp_on_bf16 \
     --bf16
 ```
diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md
index 83ffba65d8..32bc3fd5f8 100644
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -27,22 +27,6 @@ and can also be used for a dataset hosted on our [hub](https://huggingface.co/da
 
 GLUE is made up of a total of 9 different tasks where the task name can be cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte or wnli.
 
-
-## Important Note on Performance Degradation
-
-With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:
-
-"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."
-
-For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with one of the following options:
-1. Using the following setting in your script:
-```python
-torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
-```
-2. Using the --sdp_on_bf16 switch when calling the example script.
-
-Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.
-
 ## Requirements
 
 First, you should install the requirements: