huggingface · jasi306 · Feb 24, 2025 · Mar 26, 2025
@@ -121,7 +121,7 @@ pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
 With DeepSpeed successfully installed we can now run a distributed GPT-2 inference on an 8 HPU system as follows:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf \
     --use_hpu_graphs \
@@ -167,7 +167,7 @@ python run_clm.py \
 To train GPT-2 model using multi-card Gaudi system:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \

@@ -92,7 +92,7 @@ We are going to use the [causal language modeling example which is given in the
 
 The first step consists in training the model on several nodes with this command:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --hostfile path_to_hostfile --use_deepspeed run_clm.py \
     --model_name_or_path gpt2-xl \
     --gaudi_config_name Habana/gpt2 \

@@ -173,7 +173,7 @@ For training BridgeTower, you need to run the `run_bridgetower.py` script.
 For instance, to reproduce the results presented in [this blog post](https://huggingface.co/blog/bridgetower), you should run:
 
 ```bash
-python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
   --output_dir /tmp/bridgetower-test \
   --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
   --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \

@@ -176,7 +176,7 @@ $ huggingface-cli login
 3. When running the script, pass the following arguments:
 
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --push_to_hub \
     --push_to_hub_model_id <name-your-model> \
     ...
@@ -288,7 +288,7 @@ To run only inference, you can start from the commands above and you just have t
 
 For instance, you can run inference with ViT on Cifar10 on 1 Gaudi card with the following command:
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
     --output_dir /tmp/outputs/ \

@@ -79,7 +79,7 @@ python run_clm.py \
 ### Multi-card Training (GPT2)
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \
@@ -109,7 +109,7 @@ Fine tuning on 8 HPU cards takes around 6 minutes with a batch size of 32 (4 per
 It reaches a perplexity of 14.011.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_clm.py \
     --model_name_or_path EleutherAI/gpt-j-6b \
     --dataset_name wikitext \
@@ -143,7 +143,7 @@ It reaches a perplexity of 10.469.
 > Please refer to [this page](https://github.com/huggingface/optimum-habana/tree/main/examples/multi-node-training) for performing multi-node training properly.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --hostfile path_to_my_hostfile --use_deepspeed run_clm.py \
     --model_name_or_path EleutherAI/gpt-neox-20b \
     --dataset_name wikitext \
@@ -175,7 +175,7 @@ converge slightly slower (over-fitting takes more epochs).
 ### Multi-card Training
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_mlm.py \
     --model_name_or_path roberta-base \
     --dataset_name wikitext \
@@ -292,7 +292,7 @@ python3 run_lora_clm.py \
 
 - Multi-card finetuning of gemma2 using chat template:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 2 --use_mpi run_lora_clm.py \
     --model_name_or_path google/gemma-2b-it \
     --per_device_train_batch_size 16 \
@@ -509,7 +509,7 @@ Default `peft_type` is `prompt_tuning`, you could enable prefix-tuning or p-tuni
 
 Use the prompt finetuned model for text-generation:
 ```bash
-python3 ../text-generation/run_generation.py \
+PT_HPU_LAZY_MODE=1 python3 ../text-generation/run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf  \
     --max_new_tokens 128 \
     --bf16 \

@@ -50,7 +50,7 @@ python run_zero_shot_eval.py --bf16 --max_seq_length 1024
 ## Multi-HPU finetune for sequence classification task
 
 ```bash
-python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \
     --output_dir ./out \
     --model_name_or_path mila-intel/protst-esm1b-for-sequential-classification \
     --tokenizer_name facebook/esm1b_t33_650M_UR50S \

@@ -40,7 +40,7 @@ pip install -r requirements.txt
 
 Here is a command you can run to train a Llama model for question answering:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
   --world_size 8 --use_deepspeed run_qa.py \
   --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
   --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \

@@ -102,7 +102,7 @@ On a single HPU, this script should run in *ca.* 6 hours and yield a CTC loss of
 The following command shows how to fine-tune [wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60) on [Librispeech](https://huggingface.co/datasets/librispeech_asr) using 8 HPUs.
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_speech_recognition_ctc.py \
     --dataset_name librispeech_asr \
     --model_name_or_path facebook/wav2vec2-large-lv60 \
@@ -154,7 +154,7 @@ DeepSpeed can be used with almost the same command as for a multi-card run:
 
 For example:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_speech_recognition_ctc.py \
     --dataset_name librispeech_asr \
     --model_name_or_path facebook/wav2vec2-large-lv60 \
@@ -273,7 +273,7 @@ If training on a different language, you should be sure to change the `language`
 ### Multi HPU Whisper Training with Seq2Seq
 The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
     --model_name_or_path="openai/whisper-large" \
     --dataset_name="mozilla-foundation/common_voice_11_0" \

@@ -72,7 +72,7 @@ python text_to_image_generation.py \
 Distributed inference with multiple HPUs is also supported. Below is an example demonstrating how to generate images with two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 2 text_to_image_generation.py \
     --model_name_or_path CompVis/stable-diffusion-v1-4 \
     --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \
@@ -147,7 +147,8 @@ python text_to_image_generation.py \
 Here is how to generate images and depth maps with two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
     --model_name_or_path "Intel/ldm3d-4c" \
     --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \
     --num_images_per_prompt 10 \
@@ -219,7 +220,8 @@ python text_to_image_generation.py \
 SDXL also supports distributed inferencing with Intel Gaudi accelerators. Below is an example of generating SDXL images in a distributed manner using two prompts on two HPUs:
 
 ```bash
-python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
     --model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \
     --prompts "Sailing ship painting by Van Gogh" "A shiny flying horse taking off" \
     --prompts_2 "Red tone" "Blue tone" \
@@ -481,7 +483,30 @@ The ControlNet example can be run with multiple prompts by supplying more than o
 Additionally, it supports distributed execution. Below is an example of generating images conditioned by the Canny edge model using two prompts on two HPUs:
 
 ```bash
+<<<<<<< HEAD
 python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \
+=======
+python text_to_image_generation.py \
+    --model_name_or_path CompVis/stable-diffusion-v1-4 \
+    --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \
+    --prompts "futuristic-looking woman" "a rusty robot" \
+    --control_image https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png \
+    --num_images_per_prompt 28 \
+    --batch_size 7 \
+    --image_save_dir /tmp/controlnet_images \
+    --use_habana \
+    --use_hpu_graphs \
+    --gaudi_config Habana/stable-diffusion \
+    --sdp_on_bf16 \
+    --bf16
+```
+
+Here is how to generate images conditioned by canny edge model and with two prompts on two HPUs:
+
+```bash
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
+    --world_size 2 text_to_image_generation.py \
+>>>>>>> c6d15a26 ([SW-218526] Updated Readme files for explicite lazy mode (#174))
     --model_name_or_path CompVis/stable-diffusion-v1-4 \
     --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \
     --prompts "futuristic-looking woman" "a rusty robot" \

@@ -152,7 +152,7 @@ And as with the CSV files, you can specify which values to select from the file,
 
 Here is an example on 8 HPUs:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_summarization.py \
     --model_name_or_path t5-small \
     --do_train \

@@ -72,7 +72,7 @@ python run_glue.py \
 Here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using the `run_glue` script, with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_glue.py \
     --model_name_or_path bert-large-uncased-whole-word-masking \
     --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
@@ -101,7 +101,7 @@ python ../gaudi_spawn.py \
 Similarly to multi-card training, here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using DeepSpeed with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_glue.py \
     --model_name_or_path bert-large-uncased-whole-word-masking \
     --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
@@ -176,7 +176,7 @@ Llama Guard can be used for text classification. The Transformers library will c
 Llama Guard can be fine-tuned with DeepSpeed, here is how you would do it on the text classification MRPC task using DeepSpeed with 8 HPUs:
 
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_glue.py \
     --model_name_or_path meta-llama/LlamaGuard-7b \
     --gaudi_config Habana/llama \