diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
index c882de2629..6762f8d846 100644
--- a/docs/source/quickstart.mdx
+++ b/docs/source/quickstart.mdx
@@ -121,7 +121,7 @@ pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
 With DeepSpeed successfully installed we can now run a distributed GPT-2 inference on an 8 HPU system as follows:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf \
     --use_hpu_graphs \
@@ -167,7 +167,7 @@ python run_clm.py \
 To train GPT-2 model using multi-card Gaudi system:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \
diff --git a/docs/source/usage_guides/multi_node_training.mdx b/docs/source/usage_guides/multi_node_training.mdx
index 19bdb80e54..3fe7ee9396 100644
--- a/docs/source/usage_guides/multi_node_training.mdx
+++ b/docs/source/usage_guides/multi_node_training.mdx
@@ -92,7 +92,7 @@ We are going to use the [causal language modeling example which is given in the
 
 The first step consists in training the model on several nodes with this command:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --hostfile path_to_hostfile --use_deepspeed run_clm.py \
     --model_name_or_path gpt2-xl \
     --gaudi_config_name Habana/gpt2 \
diff --git a/examples/contrastive-image-text/README.md b/examples/contrastive-image-text/README.md
index def6d74ec0..e1bbc8e8cd 100644
--- a/examples/contrastive-image-text/README.md
+++ b/examples/contrastive-image-text/README.md
@@ -173,7 +173,7 @@ For training BridgeTower, you need to run the `run_bridgetower.py` script.
 For instance, to reproduce the results presented in [this blog post](https://huggingface.co/blog/bridgetower), you should run:
 
 ```bash
-python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
   --output_dir /tmp/bridgetower-test \
   --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
   --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \
diff --git a/examples/image-classification/README.md b/examples/image-classification/README.md
index 01b19b25ba..61edea6eb8 100644
--- a/examples/image-classification/README.md
+++ b/examples/image-classification/README.md
@@ -176,7 +176,7 @@ $ huggingface-cli login
 3. When running the script, pass the following arguments:
 
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --push_to_hub \
     --push_to_hub_model_id <name-your-model> \
     ...
@@ -288,7 +288,7 @@ To run only inference, you can start from the commands above and you just have t
 
 For instance, you can run inference with ViT on Cifar10 on 1 Gaudi card with the following command:
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
     --output_dir /tmp/outputs/ \
diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md
index 5cce1528dc..b5daec00b7 100644
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -79,7 +79,7 @@ python run_clm.py \
 ### Multi-card Training (GPT2)
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \
@@ -109,7 +109,7 @@ Fine tuning on 8 HPU cards takes around 6 minutes with a batch size of 32 (4 per
 It reaches a perplexity of 14.011.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_clm.py \
     --model_name_or_path EleutherAI/gpt-j-6b \
     --dataset_name wikitext \
@@ -143,7 +143,7 @@ It reaches a perplexity of 10.469.
 > Please refer to [this page](https://github.com/huggingface/optimum-habana/tree/main/examples/multi-node-training) for performing multi-node training properly.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --hostfile path_to_my_hostfile --use_deepspeed run_clm.py \
     --model_name_or_path EleutherAI/gpt-neox-20b \
     --dataset_name wikitext \
@@ -175,7 +175,7 @@ converge slightly slower (over-fitting takes more epochs).
 ### Multi-card Training
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_mlm.py \
     --model_name_or_path roberta-base \
     --dataset_name wikitext \
@@ -292,7 +292,7 @@ python3 run_lora_clm.py \
 
 - Multi-card finetuning of gemma2 using chat template:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 2 --use_mpi run_lora_clm.py \
     --model_name_or_path google/gemma-2b-it \
     --per_device_train_batch_size 16 \
@@ -509,7 +509,7 @@ Default `peft_type` is `prompt_tuning`, you could enable prefix-tuning or p-tuni
 
 Use the prompt finetuned model for text-generation:
 ```bash
-python3 ../text-generation/run_generation.py \
+PT_HPU_LAZY_MODE=1 python3 ../text-generation/run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf  \
     --max_new_tokens 128 \
     --bf16 \
diff --git a/examples/protein-folding/README.md b/examples/protein-folding/README.md
index 8997c75143..b55090676b 100644
--- a/examples/protein-folding/README.md
+++ b/examples/protein-folding/README.md
@@ -50,7 +50,7 @@ python run_zero_shot_eval.py --bf16 --max_seq_length 1024
 ## Multi-HPU finetune for sequence classification task
 
 ```bash
-python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \
     --output_dir ./out \
     --model_name_or_path mila-intel/protst-esm1b-for-sequential-classification \
     --tokenizer_name facebook/esm1b_t33_650M_UR50S \
diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md
index d7a83ea5c8..1d11cd533f 100755
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -40,7 +40,7 @@ pip install -r requirements.txt
 
 Here is a command you can run to train a Llama model for question answering:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
   --world_size 8 --use_deepspeed run_qa.py \
   --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
   --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md
index 1f0f8fbe38..3bc553fc5d 100644
--- a/examples/speech-recognition/README.md
+++ b/examples/speech-recognition/README.md
@@ -102,7 +102,7 @@ On a single HPU, this script should run in *ca.* 6 hours and yield a CTC loss of
 The following command shows how to fine-tune [wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60) on [Librispeech](https://huggingface.co/datasets/librispeech_asr) using 8 HPUs.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_speech_recognition_ctc.py \
     --dataset_name librispeech_asr \
     --model_name_or_path facebook/wav2vec2-large-lv60 \
@@ -154,7 +154,7 @@ DeepSpeed can be used with almost the same command as for a multi-card run:
 
 For example:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_speech_recognition_ctc.py \
     --dataset_name librispeech_asr \
     --model_name_or_path facebook/wav2vec2-large-lv60 \
@@ -273,7 +273,7 @@ If training on a different language, you should be sure to change the `language`
 ### Multi HPU Whisper Training with Seq2Seq
 The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
     --model_name_or_path="openai/whisper-large" \
     --dataset_name="mozilla-foundation/common_voice_11_0" \
diff --git a/examples/stable-diffusion/README.md b/examples/stable-diffusion/README.md
index 9919780543..1a3a2455d8 100644
--- a/examples/stable-diffusion/README.md
+++ b/examples/stable-diffusion/README.md
@@ -72,7 +72,7 @@ python text_to_image_generation.py \
 Distributed inference with multiple HPUs is also supported. Below is an example demonstrating how to generate images with two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 2 text_to_image_generation.py \
     --model_name_or_path CompVis/stable-diffusion-v1-4 \
     --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \
@@ -147,7 +147,8 @@ python text_to_image_generation.py \
 Here is how to generate images and depth maps with two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
     --model_name_or_path "Intel/ldm3d-4c" \
     --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \
     --num_images_per_prompt 10 \
@@ -219,7 +220,8 @@ python text_to_image_generation.py \
 SDXL also supports distributed inferencing with Intel Gaudi accelerators. Below is an example of generating SDXL images in a distributed manner using two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
     --model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \
     --prompts "Sailing ship painting by Van Gogh" "A shiny flying horse taking off" \
     --prompts_2 "Red tone" "Blue tone" \
@@ -481,7 +483,30 @@ The ControlNet example can be run with multiple prompts by supplying more than o
 Additionally, it supports distributed execution. Below is an example of generating images conditioned by the Canny edge model using two prompts on two HPUs:
 
 ```bash
+<<<<<<< HEAD
 python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+=======
+python text_to_image_generation.py \
+    --model_name_or_path CompVis/stable-diffusion-v1-4 \
+    --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \
+    --prompts "futuristic-looking woman" "a rusty robot" \
+    --control_image https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png \
+    --num_images_per_prompt 28 \
+    --batch_size 7 \
+    --image_save_dir /tmp/controlnet_images \
+    --use_habana \
+    --use_hpu_graphs \
+    --gaudi_config Habana/stable-diffusion \
+    --sdp_on_bf16 \
+    --bf16
+```
+
+Here is how to generate images conditioned by canny edge model and with two prompts on two HPUs:
+
+```bash
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
+>>>>>>> c6d15a26 ([SW-218526] Updated Readme files for explicite lazy mode (#174))
     --model_name_or_path CompVis/stable-diffusion-v1-4 \
     --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \
     --prompts "futuristic-looking woman" "a rusty robot" \
diff --git a/examples/summarization/README.md b/examples/summarization/README.md
index bdaef78edf..b934aca5ab 100644
--- a/examples/summarization/README.md
+++ b/examples/summarization/README.md
@@ -152,7 +152,7 @@ And as with the CSV files, you can specify which values to select from the file,
 
 Here is an example on 8 HPUs:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_summarization.py \
     --model_name_or_path t5-small \
     --do_train \
diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md
index 9ffc78ae43..03f9b04826 100644
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -72,7 +72,7 @@ python run_glue.py \
 Here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using the `run_glue` script, with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_glue.py \
     --model_name_or_path bert-large-uncased-whole-word-masking \
     --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
@@ -101,7 +101,7 @@ python ../gaudi_spawn.py \
 Similarly to multi-card training, here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using DeepSpeed with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_glue.py \
     --model_name_or_path bert-large-uncased-whole-word-masking \
     --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
@@ -176,7 +176,7 @@ Llama Guard can be used for text classification. The Transformers library will c
 Llama Guard can be fine-tuned with DeepSpeed, here is how you would do it on the text classification MRPC task using DeepSpeed with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_glue.py \
     --model_name_or_path meta-llama/LlamaGuard-7b \
     --gaudi_config Habana/llama \
diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
index 161057b699..3e20936610 100755
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -44,16 +44,16 @@ In this section, we present how to benchmark a model on Intel Gaudi AI Accelerat
 To run generation with DeepSpeed-inference, you must launch the script as follows:
 
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS
 ```
 
 To run multiple DeepSpeed tasks simultaneously, you can launch them with different `master_port` and [`HABANA_VISIBLE_MODULES`](https://docs.habana.ai/en/latest/PyTorch/PT_Multiple_Tenants_on_HPU/Multiple_Dockers_each_with_Single_Workload.html#running-distributed-workload-inside-the-docker-container), for example:
 
 ```bash
 # the following tasks could run simultaneously in a container with 8 HPUs
-HABANA_VISIBLE_MODULES="0,1" python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py ARGS     # using the default master_port=29500
-HABANA_VISIBLE_MODULES="2,3,4,5" python ../gaudi_spawn.py --use_deepspeed --world_size 4 --master_port 29501 run_generation.py ARGS
-HABANA_VISIBLE_MODULES="6,7" python ../gaudi_spawn.py --use_deepspeed --world_size 2 --master_port 29502 run_generation.py ARGS
+HABANA_VISIBLE_MODULES="0,1" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py ARGS     # using the default master_port=29500
+HABANA_VISIBLE_MODULES="2,3,4,5" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 4 --master_port 29501 run_generation.py ARGS
+HABANA_VISIBLE_MODULES="6,7" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 2 --master_port 29502 run_generation.py ARGS
 ```
 
 Without DeepSpeed-inference, you can run the script with:
@@ -136,7 +136,7 @@ Here are a few settings you may be interested in:
 
 For example, you can reproduce the results presented in [this blog post](https://huggingface.co/blog/habana-gaudi-2-bloom) with the following command:
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path bigscience/bloom \
 --batch_size 1 \
 --use_hpu_graphs \
@@ -147,7 +147,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 
 You can also run Llama2-70B on Gaudi2 with all optimizations enabled using the following command:
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path meta-llama/Llama-2-70b-hf \
 --max_new_tokens 4096 \
 --bf16 \
@@ -176,7 +176,7 @@ python run_generation.py \
 
 To run Falcon-40B inference on 8 Gaudi2 cards, use the following command:
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path tiiuae/falcon-40b \
 --max_new_tokens 2048 \
 --bf16 \
@@ -190,7 +190,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 
 To run Llama3-405B inference on 8 Gaudi3 cards use the following command:
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
 --max_new_tokens 2048 \
 --bf16 \
@@ -375,7 +375,7 @@ https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP
 
 Here is an example to measure the tensor quantization statistics on LLama2-70b:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_lm_eval.py \
 -o acc_70b_bs1_measure.txt \
 --model_name_or_path meta-llama/Llama-2-70b-hf \
@@ -393,7 +393,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py
 
 Here is an example to quantize the model based on previous measurements for LLama2-70b:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_lm_eval.py \
 -o acc_70b_bs1_quant.txt \
 --model_name_or_path meta-llama/Llama-2-70b-hf \
@@ -411,7 +411,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
 
 Alternatively, here is another example to quantize the model based on previous measurements for LLama2-70b:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path meta-llama/Llama-2-70b-hf \
 --attn_softmax_bf16 \
@@ -457,7 +457,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generati
 Here is an example to measure the tensor quantization statistics on Falcon-180B with 8 cards:
 > Please note that Falcon-180B is a gated model, and users are required to request access to it. Please refer to the instructions provided in the StarCoder example above.
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_lm_eval.py \
 -o acc_falcon180b_bs1_quant.txt \
 --model_name_or_path tiiuae/falcon-180B \
@@ -474,7 +474,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ..
 
 Here is an example to quantize the model based on previous measurements for Falcon-180B with 8 cards:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path tiiuae/falcon-180B \
 --use_hpu_graphs \
@@ -494,7 +494,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
 Here is an example to measure the tensor quantization statistics on Llama3-405B with 8 cards:
 > Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_lm_eval.py \
 -o acc_llama3_405b_bs1_quant.txt \
 --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
@@ -512,7 +512,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ..
 Here is an example to quantize the model based on previous measurements for Llama3-405B with 8 cards:
 > Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
+QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
 --use_hpu_graphs \
@@ -670,7 +670,7 @@ You can load pre-quantized FP8 models using the `--load_quantized_model_with_inc
 
 Below is an example of how to load `neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8` on two cards.
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 --use_deepspeed --world_size 2 run_lm_eval.py \
 -o acc_load_fp8_model.txt \
 --model_name_or_path neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
@@ -745,7 +745,7 @@ Habana Flash Attention addresses large sequence lengths on prompt stage of infer
 Below example uses `flash_attention_recompute` mode in order to reduce memory consumption on prompt stage. Additionally since all sequences in a batch are of the same length it uses `flash_attention_causal_mask` which will further improve performance by taking advantage of specific lower-diagonal shape of inputs to softmax operation.
 
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
 --model_name_or_path meta-llama/Llama-2-70b-hf \
 --use_hpu_graphs \
 --limit_hpu_graphs \
diff --git a/examples/translation/README.md b/examples/translation/README.md
index 1d705d23fc..e417fc8495 100644
--- a/examples/translation/README.md
+++ b/examples/translation/README.md
@@ -135,7 +135,7 @@ python run_translation.py \
  Here is an example of distributing training on 8 HPUs:
 
  ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_translation.py \
     --model_name_or_path t5-small \
     --do_train \
@@ -167,7 +167,7 @@ python ../gaudi_spawn.py \
  Here is an example with DeepSpeed on 8 HPUs:
 
  ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_translation.py \
     --model_name_or_path t5-small \
     --do_train \
diff --git a/examples/trl/README.md b/examples/trl/README.md
index 5e488e7072..4b028f5dca 100644
--- a/examples/trl/README.md
+++ b/examples/trl/README.md
@@ -46,7 +46,7 @@ $ pip install -U -r requirements.txt
 2. Supervised fine-tuning of the mistralai/Mixtral-8x7B-Instruct-v0.1 on 4 cards:
 
     ```
-    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
+    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
         --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
         --dataset_name "philschmid/dolly-15k-oai-style" \
         --subset 'data/' \
@@ -88,7 +88,7 @@ steps like:
 1. Supervised fine-tuning of the base llama-v2-70b model to create llama-v2-70b-se:
 
     ```
-    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py \
+    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py \
         --model_name_or_path meta-llama/Llama-2-70b-hf \
         --dataset_name "lvwerra/stack-exchange-paired" \
         --deepspeed ../language-modeling/llama2_ds_zero3_config.json \
@@ -120,7 +120,7 @@ steps like:
 
 2. Run the DPO trainer using the model saved by the previous step:
     ```
-    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed dpo.py \
+    DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed dpo.py \
         --model_name_or_path="sft/final_merged_checkpoint" \
         --tokenizer_name_or_path=meta-llama/Llama-2-70b-hf \
         --deepspeed ../language-modeling/llama2_ds_zero3_config.json \
@@ -147,7 +147,7 @@ which will also push the model to your HuggingFace hub account.
 We can load the DPO-trained LoRA adaptors which were saved by the DPO training step and run it through the [text-generation example](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation).
 
 ```
-python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_generation.py \
 --model_name_or_path ../trl/stack-llama-2/ \
 --use_hpu_graphs --use_kv_cache --batch_size 1 --bf16 --max_new_tokens 100 \
 --prompt "Here is my prompt"
@@ -163,7 +163,7 @@ The following example is for the creation of StackLlaMa 2: a Stack exchange llam
 There are three main steps to the PPO training process:
 1. Supervised fine-tuning of the base llama-v2-7b model to create llama-v2-7b-se:
     ```
-    python ../gaudi_spawn.py --world_size 8 --use_mpi sft.py \
+    PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi sft.py \
         --model_name_or_path meta-llama/Llama-2-7b-hf \
         --dataset_name "lvwerra/stack-exchange-paired" \
         --output_dir="./sft" \
@@ -193,7 +193,7 @@ There are three main steps to the PPO training process:
     ```
 2. Reward modeling using dialog pairs from the SE dataset on the llama-v2-7b-se to create llama-v2-7b-se-rm
     ```
-    python ../gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py \
+    PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py \
         --model_name_or_path=./sft/final_merged_checkpoint \
         --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \
         --output_dir=./rm
@@ -206,7 +206,7 @@ There are three main steps to the PPO training process:
 
 3. RL fine-tuning of llama-v2-7b-se with the llama-v2-7b-se-rm reward model:
     ```
-    python ../gaudi_spawn.py --world_size 8 --use_mpi ppo.py \
+    PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi ppo.py \
         --model_name_or_path=./sft/final_merged_checkpoint \
         --reward_model_name=./rm_merged_checkpoint \
         --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \