diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx index c882de2629..6762f8d846 100644 --- a/docs/source/quickstart.mdx +++ b/docs/source/quickstart.mdx @@ -121,7 +121,7 @@ pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0 With DeepSpeed successfully installed we can now run a distributed GPT-2 inference on an 8 HPU system as follows: ```bash number_of_devices=8 \ -python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \ run_generation.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --use_hpu_graphs \ @@ -167,7 +167,7 @@ python run_clm.py \ To train GPT-2 model using multi-card Gaudi system: ```bash number_of_devices=8 \ -python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size ${number_of_devices} \ run_clm.py \ --model_name_or_path gpt2 \ --dataset_name wikitext \ diff --git a/docs/source/usage_guides/multi_node_training.mdx b/docs/source/usage_guides/multi_node_training.mdx index 19bdb80e54..3fe7ee9396 100644 --- a/docs/source/usage_guides/multi_node_training.mdx +++ b/docs/source/usage_guides/multi_node_training.mdx @@ -92,7 +92,7 @@ We are going to use the [causal language modeling example which is given in the The first step consists in training the model on several nodes with this command: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --hostfile path_to_hostfile --use_deepspeed run_clm.py \ --model_name_or_path gpt2-xl \ --gaudi_config_name Habana/gpt2 \ diff --git a/examples/contrastive-image-text/README.md b/examples/contrastive-image-text/README.md index def6d74ec0..e1bbc8e8cd 100644 --- a/examples/contrastive-image-text/README.md +++ b/examples/contrastive-image-text/README.md @@ -173,7 +173,7 @@ For training BridgeTower, you need to run the `run_bridgetower.py` script. For instance, to reproduce the results presented in [this blog post](https://huggingface.co/blog/bridgetower), you should run: ```bash -python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \ --output_dir /tmp/bridgetower-test \ --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \ --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \ diff --git a/examples/image-classification/README.md b/examples/image-classification/README.md index 01b19b25ba..61edea6eb8 100644 --- a/examples/image-classification/README.md +++ b/examples/image-classification/README.md @@ -176,7 +176,7 @@ $ huggingface-cli login 3. When running the script, pass the following arguments: ```bash -python run_image_classification.py \ +PT_HPU_LAZY_MODE=1 python run_image_classification.py \ --push_to_hub \ --push_to_hub_model_id \ ... @@ -288,7 +288,7 @@ To run only inference, you can start from the commands above and you just have t For instance, you can run inference with ViT on Cifar10 on 1 Gaudi card with the following command: ```bash -python run_image_classification.py \ +PT_HPU_LAZY_MODE=1 python run_image_classification.py \ --model_name_or_path google/vit-base-patch16-224-in21k \ --dataset_name cifar10 \ --output_dir /tmp/outputs/ \ diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md index 5cce1528dc..b5daec00b7 100644 --- a/examples/language-modeling/README.md +++ b/examples/language-modeling/README.md @@ -79,7 +79,7 @@ python run_clm.py \ ### Multi-card Training (GPT2) ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_clm.py \ --model_name_or_path gpt2 \ --dataset_name wikitext \ @@ -109,7 +109,7 @@ Fine tuning on 8 HPU cards takes around 6 minutes with a batch size of 32 (4 per It reaches a perplexity of 14.011. ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_clm.py \ --model_name_or_path EleutherAI/gpt-j-6b \ --dataset_name wikitext \ @@ -143,7 +143,7 @@ It reaches a perplexity of 10.469. > Please refer to [this page](https://github.com/huggingface/optimum-habana/tree/main/examples/multi-node-training) for performing multi-node training properly. ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --hostfile path_to_my_hostfile --use_deepspeed run_clm.py \ --model_name_or_path EleutherAI/gpt-neox-20b \ --dataset_name wikitext \ @@ -175,7 +175,7 @@ converge slightly slower (over-fitting takes more epochs). ### Multi-card Training ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_mlm.py \ --model_name_or_path roberta-base \ --dataset_name wikitext \ @@ -292,7 +292,7 @@ python3 run_lora_clm.py \ - Multi-card finetuning of gemma2 using chat template: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 2 --use_mpi run_lora_clm.py \ --model_name_or_path google/gemma-2b-it \ --per_device_train_batch_size 16 \ @@ -509,7 +509,7 @@ Default `peft_type` is `prompt_tuning`, you could enable prefix-tuning or p-tuni Use the prompt finetuned model for text-generation: ```bash -python3 ../text-generation/run_generation.py \ +PT_HPU_LAZY_MODE=1 python3 ../text-generation/run_generation.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --max_new_tokens 128 \ --bf16 \ diff --git a/examples/protein-folding/README.md b/examples/protein-folding/README.md index 8997c75143..b55090676b 100644 --- a/examples/protein-folding/README.md +++ b/examples/protein-folding/README.md @@ -50,7 +50,7 @@ python run_zero_shot_eval.py --bf16 --max_seq_length 1024 ## Multi-HPU finetune for sequence classification task ```bash -python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi run_sequence_classification.py \ --output_dir ./out \ --model_name_or_path mila-intel/protst-esm1b-for-sequential-classification \ --tokenizer_name facebook/esm1b_t33_650M_UR50S \ diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md index d7a83ea5c8..1d11cd533f 100755 --- a/examples/question-answering/README.md +++ b/examples/question-answering/README.md @@ -40,7 +40,7 @@ pip install -r requirements.txt Here is a command you can run to train a Llama model for question answering: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_qa.py \ --model_name_or_path meta-llama/Llama-2-7b-chat-hf \ --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \ diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md index 1f0f8fbe38..3bc553fc5d 100644 --- a/examples/speech-recognition/README.md +++ b/examples/speech-recognition/README.md @@ -102,7 +102,7 @@ On a single HPU, this script should run in *ca.* 6 hours and yield a CTC loss of The following command shows how to fine-tune [wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60) on [Librispeech](https://huggingface.co/datasets/librispeech_asr) using 8 HPUs. ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_speech_recognition_ctc.py \ --dataset_name librispeech_asr \ --model_name_or_path facebook/wav2vec2-large-lv60 \ @@ -154,7 +154,7 @@ DeepSpeed can be used with almost the same command as for a multi-card run: For example: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_speech_recognition_ctc.py \ --dataset_name librispeech_asr \ --model_name_or_path facebook/wav2vec2-large-lv60 \ @@ -273,7 +273,7 @@ If training on a different language, you should be sure to change the `language` ### Multi HPU Whisper Training with Seq2Seq The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \ --model_name_or_path="openai/whisper-large" \ --dataset_name="mozilla-foundation/common_voice_11_0" \ diff --git a/examples/stable-diffusion/README.md b/examples/stable-diffusion/README.md index 9919780543..1a3a2455d8 100644 --- a/examples/stable-diffusion/README.md +++ b/examples/stable-diffusion/README.md @@ -72,7 +72,7 @@ python text_to_image_generation.py \ Distributed inference with multiple HPUs is also supported. Below is an example demonstrating how to generate images with two prompts on two HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 2 text_to_image_generation.py \ --model_name_or_path CompVis/stable-diffusion-v1-4 \ --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \ @@ -147,7 +147,8 @@ python text_to_image_generation.py \ Here is how to generate images and depth maps with two prompts on two HPUs: ```bash -python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ + --world_size 2 text_to_image_generation.py \ --model_name_or_path "Intel/ldm3d-4c" \ --prompts "An image of a squirrel in Picasso style" "A shiny flying horse taking off" \ --num_images_per_prompt 10 \ @@ -219,7 +220,8 @@ python text_to_image_generation.py \ SDXL also supports distributed inferencing with Intel Gaudi accelerators. Below is an example of generating SDXL images in a distributed manner using two prompts on two HPUs: ```bash -python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ + --world_size 2 text_to_image_generation.py \ --model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \ --prompts "Sailing ship painting by Van Gogh" "A shiny flying horse taking off" \ --prompts_2 "Red tone" "Blue tone" \ @@ -481,7 +483,30 @@ The ControlNet example can be run with multiple prompts by supplying more than o Additionally, it supports distributed execution. Below is an example of generating images conditioned by the Canny edge model using two prompts on two HPUs: ```bash +<<<<<<< HEAD python ../gaudi_spawn.py --world_size 2 text_to_image_generation.py \ +======= +python text_to_image_generation.py \ + --model_name_or_path CompVis/stable-diffusion-v1-4 \ + --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \ + --prompts "futuristic-looking woman" "a rusty robot" \ + --control_image https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png \ + --num_images_per_prompt 28 \ + --batch_size 7 \ + --image_save_dir /tmp/controlnet_images \ + --use_habana \ + --use_hpu_graphs \ + --gaudi_config Habana/stable-diffusion \ + --sdp_on_bf16 \ + --bf16 +``` + +Here is how to generate images conditioned by canny edge model and with two prompts on two HPUs: + +```bash +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ + --world_size 2 text_to_image_generation.py \ +>>>>>>> c6d15a26 ([SW-218526] Updated Readme files for explicite lazy mode (#174)) --model_name_or_path CompVis/stable-diffusion-v1-4 \ --controlnet_model_name_or_path lllyasviel/sd-controlnet-canny \ --prompts "futuristic-looking woman" "a rusty robot" \ diff --git a/examples/summarization/README.md b/examples/summarization/README.md index bdaef78edf..b934aca5ab 100644 --- a/examples/summarization/README.md +++ b/examples/summarization/README.md @@ -152,7 +152,7 @@ And as with the CSV files, you can specify which values to select from the file, Here is an example on 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_summarization.py \ --model_name_or_path t5-small \ --do_train \ diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md index 9ffc78ae43..03f9b04826 100644 --- a/examples/text-classification/README.md +++ b/examples/text-classification/README.md @@ -72,7 +72,7 @@ python run_glue.py \ Here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using the `run_glue` script, with 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_glue.py \ --model_name_or_path bert-large-uncased-whole-word-masking \ --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \ @@ -101,7 +101,7 @@ python ../gaudi_spawn.py \ Similarly to multi-card training, here is how you would fine-tune the BERT large model (with whole word masking) on the text classification MRPC task using DeepSpeed with 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_glue.py \ --model_name_or_path bert-large-uncased-whole-word-masking \ --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \ @@ -176,7 +176,7 @@ Llama Guard can be used for text classification. The Transformers library will c Llama Guard can be fine-tuned with DeepSpeed, here is how you would do it on the text classification MRPC task using DeepSpeed with 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_glue.py \ --model_name_or_path meta-llama/LlamaGuard-7b \ --gaudi_config Habana/llama \ diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 161057b699..3e20936610 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -44,16 +44,16 @@ In this section, we present how to benchmark a model on Intel Gaudi AI Accelerat To run generation with DeepSpeed-inference, you must launch the script as follows: ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS ``` To run multiple DeepSpeed tasks simultaneously, you can launch them with different `master_port` and [`HABANA_VISIBLE_MODULES`](https://docs.habana.ai/en/latest/PyTorch/PT_Multiple_Tenants_on_HPU/Multiple_Dockers_each_with_Single_Workload.html#running-distributed-workload-inside-the-docker-container), for example: ```bash # the following tasks could run simultaneously in a container with 8 HPUs -HABANA_VISIBLE_MODULES="0,1" python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py ARGS # using the default master_port=29500 -HABANA_VISIBLE_MODULES="2,3,4,5" python ../gaudi_spawn.py --use_deepspeed --world_size 4 --master_port 29501 run_generation.py ARGS -HABANA_VISIBLE_MODULES="6,7" python ../gaudi_spawn.py --use_deepspeed --world_size 2 --master_port 29502 run_generation.py ARGS +HABANA_VISIBLE_MODULES="0,1" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py ARGS # using the default master_port=29500 +HABANA_VISIBLE_MODULES="2,3,4,5" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 4 --master_port 29501 run_generation.py ARGS +HABANA_VISIBLE_MODULES="6,7" PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 2 --master_port 29502 run_generation.py ARGS ``` Without DeepSpeed-inference, you can run the script with: @@ -136,7 +136,7 @@ Here are a few settings you may be interested in: For example, you can reproduce the results presented in [this blog post](https://huggingface.co/blog/habana-gaudi-2-bloom) with the following command: ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path bigscience/bloom \ --batch_size 1 \ --use_hpu_graphs \ @@ -147,7 +147,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ You can also run Llama2-70B on Gaudi2 with all optimizations enabled using the following command: ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path meta-llama/Llama-2-70b-hf \ --max_new_tokens 4096 \ --bf16 \ @@ -176,7 +176,7 @@ python run_generation.py \ To run Falcon-40B inference on 8 Gaudi2 cards, use the following command: ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path tiiuae/falcon-40b \ --max_new_tokens 2048 \ --bf16 \ @@ -190,7 +190,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ To run Llama3-405B inference on 8 Gaudi3 cards use the following command: ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \ --max_new_tokens 2048 \ --bf16 \ @@ -375,7 +375,7 @@ https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP Here is an example to measure the tensor quantization statistics on LLama2-70b: ```bash -QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_lm_eval.py \ -o acc_70b_bs1_measure.txt \ --model_name_or_path meta-llama/Llama-2-70b-hf \ @@ -393,7 +393,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py Here is an example to quantize the model based on previous measurements for LLama2-70b: ```bash -QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_lm_eval.py \ -o acc_70b_bs1_quant.txt \ --model_name_or_path meta-llama/Llama-2-70b-hf \ @@ -411,7 +411,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ Alternatively, here is another example to quantize the model based on previous measurements for LLama2-70b: ```bash -QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path meta-llama/Llama-2-70b-hf \ --attn_softmax_bf16 \ @@ -457,7 +457,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generati Here is an example to measure the tensor quantization statistics on Falcon-180B with 8 cards: > Please note that Falcon-180B is a gated model, and users are required to request access to it. Please refer to the instructions provided in the StarCoder example above. ```bash -QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_lm_eval.py \ -o acc_falcon180b_bs1_quant.txt \ --model_name_or_path tiiuae/falcon-180B \ @@ -474,7 +474,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python .. Here is an example to quantize the model based on previous measurements for Falcon-180B with 8 cards: ```bash -QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path tiiuae/falcon-180B \ --use_hpu_graphs \ @@ -494,7 +494,7 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ Here is an example to measure the tensor quantization statistics on Llama3-405B with 8 cards: > Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3. ```bash -QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_lm_eval.py \ -o acc_llama3_405b_bs1_quant.txt \ --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \ @@ -512,7 +512,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python .. Here is an example to quantize the model based on previous measurements for Llama3-405B with 8 cards: > Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3. ```bash -QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \ +QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path meta-llama/Llama-3.1-405B-Instruct \ --use_hpu_graphs \ @@ -670,7 +670,7 @@ You can load pre-quantized FP8 models using the `--load_quantized_model_with_inc Below is an example of how to load `neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8` on two cards. ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --use_deepspeed --world_size 2 run_lm_eval.py \ -o acc_load_fp8_model.txt \ --model_name_or_path neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \ @@ -745,7 +745,7 @@ Habana Flash Attention addresses large sequence lengths on prompt stage of infer Below example uses `flash_attention_recompute` mode in order to reduce memory consumption on prompt stage. Additionally since all sequences in a batch are of the same length it uses `flash_attention_causal_mask` which will further improve performance by taking advantage of specific lower-diagonal shape of inputs to softmax operation. ```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path meta-llama/Llama-2-70b-hf \ --use_hpu_graphs \ --limit_hpu_graphs \ diff --git a/examples/translation/README.md b/examples/translation/README.md index 1d705d23fc..e417fc8495 100644 --- a/examples/translation/README.md +++ b/examples/translation/README.md @@ -135,7 +135,7 @@ python run_translation.py \ Here is an example of distributing training on 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_mpi run_translation.py \ --model_name_or_path t5-small \ --do_train \ @@ -167,7 +167,7 @@ python ../gaudi_spawn.py \ Here is an example with DeepSpeed on 8 HPUs: ```bash -python ../gaudi_spawn.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ --world_size 8 --use_deepspeed run_translation.py \ --model_name_or_path t5-small \ --do_train \ diff --git a/examples/trl/README.md b/examples/trl/README.md index 5e488e7072..4b028f5dca 100644 --- a/examples/trl/README.md +++ b/examples/trl/README.md @@ -46,7 +46,7 @@ $ pip install -U -r requirements.txt 2. Supervised fine-tuning of the mistralai/Mixtral-8x7B-Instruct-v0.1 on 4 cards: ``` - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \ + DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \ --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \ --dataset_name "philschmid/dolly-15k-oai-style" \ --subset 'data/' \ @@ -88,7 +88,7 @@ steps like: 1. Supervised fine-tuning of the base llama-v2-70b model to create llama-v2-70b-se: ``` - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py \ + DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py \ --model_name_or_path meta-llama/Llama-2-70b-hf \ --dataset_name "lvwerra/stack-exchange-paired" \ --deepspeed ../language-modeling/llama2_ds_zero3_config.json \ @@ -120,7 +120,7 @@ steps like: 2. Run the DPO trainer using the model saved by the previous step: ``` - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed dpo.py \ + DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed dpo.py \ --model_name_or_path="sft/final_merged_checkpoint" \ --tokenizer_name_or_path=meta-llama/Llama-2-70b-hf \ --deepspeed ../language-modeling/llama2_ds_zero3_config.json \ @@ -147,7 +147,7 @@ which will also push the model to your HuggingFace hub account. We can load the DPO-trained LoRA adaptors which were saved by the DPO training step and run it through the [text-generation example](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation). ``` -python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_generation.py \ +PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_generation.py \ --model_name_or_path ../trl/stack-llama-2/ \ --use_hpu_graphs --use_kv_cache --batch_size 1 --bf16 --max_new_tokens 100 \ --prompt "Here is my prompt" @@ -163,7 +163,7 @@ The following example is for the creation of StackLlaMa 2: a Stack exchange llam There are three main steps to the PPO training process: 1. Supervised fine-tuning of the base llama-v2-7b model to create llama-v2-7b-se: ``` - python ../gaudi_spawn.py --world_size 8 --use_mpi sft.py \ + PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi sft.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset_name "lvwerra/stack-exchange-paired" \ --output_dir="./sft" \ @@ -193,7 +193,7 @@ There are three main steps to the PPO training process: ``` 2. Reward modeling using dialog pairs from the SE dataset on the llama-v2-7b-se to create llama-v2-7b-se-rm ``` - python ../gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py \ + PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py \ --model_name_or_path=./sft/final_merged_checkpoint \ --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \ --output_dir=./rm @@ -206,7 +206,7 @@ There are three main steps to the PPO training process: 3. RL fine-tuning of llama-v2-7b-se with the llama-v2-7b-se-rm reward model: ``` - python ../gaudi_spawn.py --world_size 8 --use_mpi ppo.py \ + PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --world_size 8 --use_mpi ppo.py \ --model_name_or_path=./sft/final_merged_checkpoint \ --reward_model_name=./rm_merged_checkpoint \ --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \