huggingface · regisss · Jun 3, 2025 · Apr 10, 2025
@@ -92,7 +92,7 @@ To be able to run gated models like [Llama-2 7B](https://huggingface.co/meta-lla
 
 Run single Gaudi device (HPU) inference with Llama-2 7B model:
 ```bash
-python run_generation.py \
+PT_HPU_LAZY_MODE=1 python run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf \
     --use_hpu_graphs \
     --use_kv_cache \
@@ -121,7 +121,7 @@ pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.20.0
 With DeepSpeed successfully installed we can now run a distributed GPT-2 inference on an 8 HPU system as follows:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_generation.py \
     --model_name_or_path meta-llama/Llama-2-7b-hf \
     --use_hpu_graphs \
@@ -148,7 +148,7 @@ pip install -r requirements.txt
 
 To train GPT-2 model on a single card, use:
 ```bash
-python run_clm.py \
+PT_HPU_LAZY_MODE=1 python run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \
     --dataset_config_name wikitext-2-raw-v1 \
@@ -167,7 +167,7 @@ python run_clm.py \
 To train GPT-2 model using multi-card Gaudi system:
 ```bash
 number_of_devices=8 \
-python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \
 run_clm.py \
     --model_name_or_path gpt2 \
     --dataset_name wikitext \
@@ -200,7 +200,7 @@ pip install -r requirements.txt
 
 Here is an example of running Stable Diffusion text to image inference on Gaudi:
 ```bash
-python text_to_image_generation.py \
+PT_HPU_LAZY_MODE=1 python text_to_image_generation.py \
     --model_name_or_path CompVis/stable-diffusion-v1-4 \
     --prompts "An image of a squirrel in Picasso style" \
     --num_images_per_prompt 10 \

@@ -68,7 +68,7 @@ All [our examples](https://github.com/huggingface/optimum-habana/tree/main/examp
 The reasoning is the same for every example: run the example script with `--do_eval` and `--per_device_eval_batch_size` and without `--do_train`.
 A simple template is the following:
 ```bash
-python path_to_the_example_script \
+PT_HPU_LAZY_MODE=1 python path_to_the_example_script \
   --model_name_or_path my_model_name \
   --gaudi_config_name my_gaudi_config_name \
   --dataset_name my_dataset_name \

@@ -161,7 +161,7 @@ This will also save memory.
 You just need to pass `torch_dtype=torch.bfloat16` to `from_pretrained` when instantiating your pipeline.
 Here is how to do it:
 
-```py
+```python
 import torch
 
 pipeline = GaudiStableDiffusionPipeline.from_pretrained(

@@ -92,7 +92,7 @@ We are going to use the [causal language modeling example which is given in the
 
 The first step consists in training the model on several nodes with this command:
 ```bash
-python ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --hostfile path_to_hostfile --use_deepspeed run_clm.py \
     --model_name_or_path gpt2-xl \
     --gaudi_config_name Habana/gpt2 \
@@ -115,7 +115,7 @@ Evaluation is not performed in the same command because we do not recommend perf
 Once the model is trained, we can evaluate it with the following command.
 The argument `--model_name_or_path` should be equal to the argument `--output_dir` of the previous command.
 ```bash
-python run_clm.py \
+PT_HPU_LAZY_MODE=1 python run_clm.py \
     --model_name_or_path /tmp/gpt2_xl_multi_node \
     --gaudi_config_name Habana/gpt2 \
     --dataset_name wikitext \

@@ -35,7 +35,7 @@ pip install -r requirements.txt
 The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset on a single HPU.
 
 ```bash
-python run_audio_classification.py \
+PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
     --dataset_name superb \
     --dataset_config_name ks \
@@ -75,7 +75,7 @@ On a single HPU, this script should run in ~13 minutes and yield an accuracy of
 The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/anton-l/common_language) on 8 HPUs.
 
 ```bash
-PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
+python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
     --dataset_name common_language \
@@ -118,7 +118,7 @@ To run only inference, you can start from the commands above and you just have t
 
 For instance, you can run inference with Wav2Vec2 on the Keyword Spotting subset on 1 Gaudi card with the following command:
 ```bash
-python run_audio_classification.py \
+PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
     --dataset_name superb \
     --dataset_config_name ks \

@@ -47,7 +47,7 @@ cd ..
 
 Having downloaded COCO dataset manually you should be able to load with the `ydshieh/coco_dataset_script` dataset loading script:
 
-```py
+```python
 import os
 import datasets
 
@@ -65,7 +65,7 @@ Next, we create a [VisionTextDualEncoderModel](https://huggingface.co/docs/trans
 The `VisionTextDualEncoderModel` class lets you load any vision and text encoder model to create a dual encoder.
 Here is an example of how to load the model using pre-trained vision and text models.
 
-```python3
+```python
 from transformers import (
     VisionTextDualEncoderModel,
     VisionTextDualEncoderProcessor,
@@ -96,7 +96,7 @@ Finally, we can run the example script to train the model.
 Run the following command for single-device training:
 
 ```bash
-PT_HPU_LAZY_MODE=0 python run_clip.py \
+python run_clip.py \
     --output_dir ./clip-roberta-finetuned \
     --model_name_or_path ./clip-roberta \
     --data_dir $PWD/data \
@@ -128,7 +128,7 @@ PT_HPU_LAZY_MODE=0 python run_clip.py \
 Run the following command for distributed training:
 
 ```bash
-PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 \
+PT_ENABLE_INT64_SUPPORT=1 \
 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
     --output_dir=/tmp/clip_roberta \
     --model_name_or_path=./clip-roberta \
@@ -173,7 +173,7 @@ For training BridgeTower, you need to run the `run_bridgetower.py` script.
 For instance, to reproduce the results presented in [this blog post](https://huggingface.co/blog/bridgetower), you should run:
 
 ```bash
-python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
+PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
   --output_dir /tmp/bridgetower-test \
   --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
   --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \
@@ -204,7 +204,7 @@ To run only inference, you can start from the commands above and you just have t
 
 For instance, you can run inference with CLIP on COCO on 1 Gaudi card with the following command:
 ```bash
-python run_clip.py \
+PT_HPU_LAZY_MODE=1 python run_clip.py \
     --output_dir ./clip-roberta-finetuned \
     --model_name_or_path ./clip-roberta \
     --data_dir $PWD/data \

@@ -33,7 +33,7 @@ pip install -r requirements.txt
 Here we show how to fine-tune a Vision Transformer (`ViT`) on Cifar10:
 
 ```bash
-PT_HPU_LAZY_MODE=0 python run_image_classification.py \
+python run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
     --output_dir /tmp/outputs/ \
@@ -94,7 +94,7 @@ root/cat/[...]/asd932_.png
 In other words, you need to organize your images in subfolders, based on their class. You can then run the script like this:
 
 ```bash
-PT_HPU_LAZY_MODE=0 python run_image_classification.py \
+python run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --train_dir <path-to-train-root> \
     --output_dir /tmp/outputs/ \
@@ -176,7 +176,7 @@ $ huggingface-cli login
 3. When running the script, pass the following arguments:
 
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --push_to_hub \
     --push_to_hub_model_id <name-your-model> \
     ...
@@ -188,7 +188,7 @@ python run_image_classification.py \
 Here is how you would fine-tune ViT on Cifar10 using 8 HPUs:
 
 ```bash
-PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
+python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
@@ -230,7 +230,7 @@ For Swin, you need to change/add the following arguments:
 Similarly to multi-HPU training, here is how you would fine-tune ViT on Cifar10 using 8 HPUs with DeepSpeed:
 
 ```bash
-PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
+python ../gaudi_spawn.py \
     --world_size 8 --use_deepspeed run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
@@ -288,7 +288,7 @@ To run only inference, you can start from the commands above and you just have t
 
 For instance, you can run inference with ViT on Cifar10 on 1 Gaudi card with the following command:
 ```bash
-python run_image_classification.py \
+PT_HPU_LAZY_MODE=1 python run_image_classification.py \
     --model_name_or_path google/vit-base-patch16-224-in21k \
     --dataset_name cifar10 \
     --output_dir /tmp/outputs/ \
@@ -312,7 +312,7 @@ This directory contains an example script that demonstrates using FastViT with g
 ### Single-HPU inference
 
 ```bash
-python3 run_timm_example.py \
+PT_HPU_LAZY_MODE=1 python3 run_timm_example.py \
     --model_name_or_path "timm/fastvit_t8.apple_in1k" \
     --image_path "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" \
     --warmup 3 \

@@ -25,7 +25,7 @@ Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.
 To run Llama inference with SDPA, use the following command:
 
 ```bash
-python3 run_pipeline.py \
+PT_HPU_LAZY_MODE=1 python3 run_pipeline.py \
     --model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
     --use_hpu_graphs \
     --bf16 \
@@ -35,20 +35,20 @@ python3 run_pipeline.py \
 
 To run inference with THUDM/glm-4v-9b, use the following command (Note that you need to set the environment variable `GLM=4v` to distinguish between glm4v and chatglm, as these models are customized and share the same model type named "chatglm"):
 ```bash
-GLM=4v python3 run_pipeline.py \
+PT_HPU_LAZY_MODE=1 GLM=4v python3 run_pipeline.py \
     --model_name_or_path THUDM/glm-4v-9b \
     --use_hpu_graphs \
     --bf16 \
     --sdp_on_bf16 \
     --use_flash_attention \
     --use_kv_cache
-
+```
 
 ### Multi-cards inference with BF16
 
 Use the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs:
 ```bash
-PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
     --model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -66,7 +66,7 @@ More information on enabling FP8 in SynapseAI is available here:
 ### Single card inference with FP8
 Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
+PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -76,7 +76,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
 
 Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
+PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -87,7 +87,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r
 ### Multi-cards inference with FP8
 Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -98,7 +98,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py
 
 Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -112,7 +112,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python .
 Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct.
 
 ```bash
-python3 run_image2text_lora_finetune.py \
+PT_HPU_LAZY_MODE=1 python3 run_image2text_lora_finetune.py \
     --model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
     --dataset_name nielsr/docvqa_1200_examples \
     --bf16 True \
@@ -145,7 +145,7 @@ python3 run_image2text_lora_finetune.py \
 ```
 
 ```bash
-python3 ../gaudi_spawn.py \
+PT_HPU_LAZY_MODE=1 python3 ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_image2text_lora_finetune.py \
     --model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
     --dataset_name nielsr/docvqa_1200_examples \