diff --git a/README_GAUDI.md b/README_GAUDI.md index d557ea1186c7..1af0863c72bc 100644 --- a/README_GAUDI.md +++ b/README_GAUDI.md @@ -4,8 +4,8 @@ This README provides instructions on how to run vLLM with Intel Gaudi devices. # Requirements and Installation -Please follow the instructions provided in the [Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html) to set up the execution environment. -To achieve the best performance, please follow the methods outlined in the +To set up the execution environment, please follow the instructions in the [Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html). +To achieve the best performance on HPU, please follow the methods outlined in the [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html). ## Requirements @@ -16,7 +16,7 @@ To achieve the best performance, please follow the methods outlined in the - Intel Gaudi software version 1.21.0 and above ## Quick Start Using Dockerfile -Set up the container with latest release of Gaudi Software Suite using the Dockerfile: +Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile. ### Ubuntu @@ -26,12 +26,12 @@ $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_ ``` > [!TIP] -> If you are facing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Optional Packages" section +> If you are facing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to the "Install Optional Packages" section of [Install Driver and Software](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#install-driver-and-software) and "Configure Container Runtime" section of [Docker Installation](https://docs.habana.ai/en/latest/Installation_Guide/Installation_Methods/Docker_Installation.html#configure-container-runtime). Make sure you have ``habanalabs-container-runtime`` package installed and that ``habana`` container runtime is registered. -### Red Hat Enterprise Linux for use with Red Hat OpenShift AI. +### Red Hat Enterprise Linux for Use with Red Hat OpenShift AI ``` $ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env . @@ -54,7 +54,7 @@ Refer to [System Verification and Final Tests](https://docs.habana.ai/en/latest/ ### Run Docker Image -It is highly recommended to use the latest Docker image from Intel Gaudi vault. +It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers) for more details. Use the following commands to run a Docker image. Make sure to update the versions below as listed in the [Support Matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html): @@ -82,7 +82,7 @@ $ python setup.py develop #### 2. Build and Install the latest from vLLM-fork -Currently, the latest features and performance optimizations are being developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and periodically upstreamed to vLLM main repository. +Currently, the latest features and performance optimizations are being developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and periodically upstreamed to the vLLM main repository. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: ```{.console} @@ -94,7 +94,7 @@ $ pip install -r requirements-hpu.txt $ python setup.py develop ``` -#### 3. Build and Install from vLLM main source +#### 3. Build and Install from the vLLM main source If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following: @@ -113,77 +113,77 @@ $ python setup.py develop | HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A | | Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A | | Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A | -| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend support multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)
[Example](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html)
[HCCL reference](https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/index.html) | -| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)
[How to run](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#pipeline-parallelism) | -| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
[vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes)
[Optimization guide](https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#hpu-graph-capture) | -| Inference with torch.compile | vLLM HPU backend supports inference with torch.compile. | [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) | -| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html) | -| AutoAWQ quantization | vLLM HPU backend supports the inference with models quantized using AutoAWQ library. | [Library](https://github.com/casper-hansen/AutoAWQ) | -| AutoGPTQ quantization | vLLM HPU backend supports the inference with models quantized using AutoGPTQ library. | [Library](https://github.com/AutoGPTQ/AutoGPTQ) | +| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)
[Example](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html)
[HCCL reference](https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/index.html) | +| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)
[Running Pipeline Parallelism](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#pipeline-parallelism) | +| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
[vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes)
[Optimization guide](https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#hpu-graph-capture) | +| Inference with torch.compile | vLLM HPU backend supports inference with `torch.compile`. | [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) | +| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html) | +| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | [Library](https://github.com/casper-hansen/AutoAWQ) | +| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | [Library](https://github.com/AutoGPTQ/AutoGPTQ) | | LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | [Documentation](https://docs.vllm.ai/en/stable/models/lora.html)
[Example](https://docs.vllm.ai/en/stable/getting_started/examples/multilora_inference.html)
[vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) | | Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter. | [Feature RFC](https://github.com/vllm-project/vllm/issues/6854) | | Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter. | [Documentation](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html)
[Details](https://docs.vllm.ai/en/stable/automatic_prefix_caching/details.html) | -| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard `--speculative_model` and `--num_speculative_tokens` parameters. (Not fully supported with t.compile execution mode) | [Documentation](https://docs.vllm.ai/en/stable/models/spec_decode.html)
[Example](https://docs.vllm.ai/en/stable/getting_started/examples/mlpspeculator.html) | +| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurable via standard `--speculative_model` and `--num_speculative_tokens` parameters. (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.vllm.ai/en/stable/models/spec_decode.html)
[Example](https://docs.vllm.ai/en/stable/getting_started/examples/mlpspeculator.html) | | Multiprocessing backend | Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray. | [Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) | | Multimodal | vLLM HPU backend supports the inference for multi-modal models. (Not fully supported with t.compile execution mode) | [Documentation](https://docs.vllm.ai/en/latest/serving/multimodal_inputs.html) | -| Multinode support | vLLM HPU backend supports distributed, multiple nodes inferencing with Ray. | | -| vLLM v1 architecture (early release) | V1 architecture is now available for HPU backend, and will gradually enable it for every use case we plan to support. | [Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) | -| Guided decode | vLLM HPU supports guided decoding backend for the generation of structured outputs. | [Documentation](https://docs.vllm.ai/en/latest/features/structured_outputs.html) | +| Multinode support | vLLM HPU backend supports distributed, multiple-node inference with Ray. | | +| vLLM v1 architecture (early release) | V1 architecture is now available for the HPU backend, and will gradually enable it for every use case we plan to support. | [Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) | +| Guided decode | vLLM HPU supports a guided decoding backend for generating structured outputs. | [Documentation](https://docs.vllm.ai/en/latest/features/structured_outputs.html) | | Delayed Sampling (experimental) | vLLM HPU supports delayed sampling scheduling for asynchronous execution, enabled by `VLLM_DELAYED_SAMPLING=true` environment variable. | N/A | | Exponential bucketing (experimental) | vLLM HPU supports exponential bucketing spacing instead of linear to automate configuration of bucketing mechanism, enabled by `VLLM_EXPONENTIAL_BUCKETING=true` environment variable. | N/A | > [!NOTE] -> All specified features are supported with --enforce-eager flag as well. +> All specified features are also supported with the `-- enforce-eager` flag. # Unsupported Features - Beam search - Prefill chunking (mixed-batch inferencing) -# Validated models and configurations +# Validated Models and Configurations -The following configurations have been validated to be function with Gaudi2 or Gaudi 3 devices with random or greedy sampling. Configurations that are not listed may or may not work. +The following configurations have been validated to function with Gaudi 2 or Gaudi 3 devices with random or greedy sampling. Configurations that are not listed may or may not work. | **Model** | **Tensor Parallelism [x HPU]** | **Datatype** | **Validated on** | |:--- |:---: |:---: |:---: | -| [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | 1, 2, 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 1, 2, 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) | 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | 1, 2, 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 1, 2, 8 | BF16 | Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | 8 | BF16 |Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | BF16 |Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | 1 | BF16, FP8, INT4, FP16 (Gaudi2) | Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 1 | BF16, FP8 | Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | 2, 4, 8 | BF16, FP8, INT4 |Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 2, 4, 8 | BF16, FP8, FP16 (Gaudi2) |Gaudi2, Gaudi3| -| [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) | 8 | BF16, FP8 |Gaudi3| -| [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) | 8 | BF16, FP8 |Gaudi3| -| [meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) | 1 | BF16, FP8 | Gaudi2, Gaudi3| -| [meta-llama/Llama-3.2-90B-Vision](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) | 4, 8 (min. for Gaudi 2) | BF16, FP8 | Gaudi2, Gaudi3| -| [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) | 4, 8 (min. for Gaudi 2) | BF16 | Gaudi2, Gaudi3 | -| [meta-llama/Meta-Llama-3.2-405B](https://huggingface.co/meta-llama/Llama-3.2-405B) | 8 | BF16 | Gaudi3| -| [meta-llama/Meta-Llama-3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B) | 4 | BF16, FP8 | Gaudi3| -| [meta-llama/Granite-3B-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) | 1 | BF16 | Gaudi3| -| [meta-llama/Granite-3.0-8B-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) | 1 | BF16, FP8 | Gaudi2, Gaudi3| -| [meta-llama/Granite-20B-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) | 1 | BF16, FP8 | Gaudi2, Gaudi3| -| [meta-llama/Granite-34B-code-instruc-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) | 1 | BF16 | Gaudi3| -| [mistralai/Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407) | 1, 4 | BF16 | Gaudi3| -| [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 1, 2 | BF16 | Gaudi2| -| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | 2 | FP8, BF16 |Gaudi2, Gaudi3| -| [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 1, 8 | BF16 | Gaudi2, Gaudi3 | -| [princeton-nlp/gemma-2-9b-it-SimPO](https://huggingface.co/princeton-nlp/gemma-2-9b-it-SimPO) | 1 | BF16 |Gaudi2, Gaudi3| -| [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | 8 | BF16 |Gaudi2| -| [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | 8 | BF16 |Gaudi2| -| [meta-llama/CodeLlama-34b-Instruct-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Instruct-hf) | 1 | BF16 |Gaudi3| -| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | 8 | FP8, BF16 |Gaudi2, Gaudi3| +| [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | 1, 2, 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 1, 2, 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) | 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | 1, 2, 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 1, 2, 8 | BF16 | Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | 8 | BF16 |Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | BF16 |Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | 1 | BF16, FP8, INT4, FP16 (Gaudi 2) | Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | 2, 4, 8 | BF16, FP8, INT4 |Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 2, 4, 8 | BF16, FP8, FP16 (Gaudi 2) |Gaudi 2, Gaudi 3| +| [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) | 8 | BF16, FP8 |Gaudi 3| +| [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) | 8 | BF16, FP8 |Gaudi 3| +| [meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3| +| [meta-llama/Llama-3.2-90B-Vision](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) | 4, 8 (min. for Gaudi 2) | BF16, FP8 | Gaudi 2, Gaudi 3| +| [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) | 4, 8 (min. for Gaudi 2) | BF16 | Gaudi 2, Gaudi 3 | +| [meta-llama/Meta-Llama-3.2-405B](https://huggingface.co/meta-llama/Llama-3.2-405B) | 8 | BF16 | Gaudi 3| +| [meta-llama/Meta-Llama-3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B) | 4 | BF16, FP8 | Gaudi 3| +| [meta-llama/Granite-3B-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) | 1 | BF16 | Gaudi 3| +| [meta-llama/Granite-3.0-8B-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3| +| [meta-llama/Granite-20B-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3| +| [meta-llama/Granite-34B-code-instruc-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) | 1 | BF16 | Gaudi 3| +| [mistralai/Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407) | 1, 4 | BF16 | Gaudi 3| +| [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 1, 2 | BF16 | Gaudi 2| +| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | 2 | FP8, BF16 |Gaudi 2, Gaudi 3| +| [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 1, 8 | BF16 | Gaudi 2, Gaudi 3 | +| [princeton-nlp/gemma-2-9b-it-SimPO](https://huggingface.co/princeton-nlp/gemma-2-9b-it-SimPO) | 1 | BF16 |Gaudi 2, Gaudi 3| +| [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | 8 | BF16 |Gaudi 2| +| [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | 8 | BF16 |Gaudi 2| +| [meta-llama/CodeLlama-34b-Instruct-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Instruct-hf) | 1 | BF16 |Gaudi 3| +| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
[quick start scripts](https://github.com/HabanaAI/vllm-fork/blob/deepseek_r1/scripts/DEEPSEEK_R1_ON_GAUDI.md) | 8 | FP8, BF16 |Gaudi 2, Gaudi 3| # Performance Tuning ## Execution Modes -Currently, vLLM for HPU supports four execution modes, determined by the selected HPU PyTorch Bridge backend (via the PT_HPU_LAZY_MODE environment variable) and the --enforce-eager flag. +Currently, vLLM for HPU supports four execution modes, determined by the selected HPU PyTorch Bridge backend (via the `PT_HPU_LAZY_MODE` environment variable) and the `--enforce-eager` flag. | `PT_HPU_LAZY_MODE` | `enforce_eager` | Execution Mode | | ------------------ | --------------- | ------------------ | @@ -193,23 +193,23 @@ Currently, vLLM for HPU supports four execution modes, determined by the selecte | 1 | 1 | PyTorch lazy mode | > [!NOTE] -> Starting with the 1.21.0 Intel Gaudi software release, the torch.compile execution mode became the default for vLLM. HPU Graphs mode remains supported to ensure backward compatibility. Please verify the compatibility of the torch.compile mode with the information in the [Supported Features](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#supported-features) table. +> Starting with the 1.21.0 Intel Gaudi software release, the `torch.compile` execution mode is the default for vLLM. HPU Graphs mode remains supported to ensure backward compatibility. Please verify the compatibility of the `torch.compile` mode with the information in the [Supported Features](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#supported-features) table. > [!TIP] -> We recommend experimenting with the PT_HPU_LAZY_MODE environment variable to determine whether HPU Graphs or torch.compile mode performs better for your specific use case. While both modes generally deliver comparable performance, certain edge cases may favor one over the other. +> We recommend experimenting with the `PT_HPU_LAZY_MODE` environment variable to determine whether HPU Graphs or `torch.compile` mode performs better for your specific use case. While both modes generally deliver comparable performance, certain edge cases may favor one over the other. ## Bucketing Mechanism Intel Gaudi accelerators perform best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) generates optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be highly dependent on input and output tensor shapes, requiring graph recompilation when encountering tensors with different shapes within the same topology. While these binaries efficiently utilize Gaudi, the compilation process itself can introduce noticeable overhead in end-to-end execution. -In dynamic inference serving scenarios, it is important to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently, this is achieved by +In dynamic inference serving scenarios, minimizing the number of graph compilations and reducing the risk of graph compilation occurring during server runtime is important. Currently, this is achieved by "bucketing" the model's forward pass across two dimensions: `batch_size` and `sequence_length`. > [!NOTE] -> Bucketing helps significantly reduce the number of required graphs, but it does not handle graph compilation or device code generation. These tasks are performed during the warmup and HPUGraph capture phase. +> Bucketing helps significantly reduce the number of required graphs, but does not handle graph compilation or device code generation. These tasks are performed during the warmup and HPUGraph capture phase. -Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters +Bucketing ranges are determined with 3 parameters - `min`, `step`, and `max`. They can be set separately for the prompt and decode phase, and batch size and sequence length dimensions. These parameters can be observed in logs during vLLM startup: ```{.} @@ -219,7 +219,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] ``` -`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling - `min` gets multiplied by consecutive powers of two, until the multiplier is less than or equal to `step`. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, +`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, the interval between `min` and `step` has special handling - `min` gets multiplied by consecutive powers of two, until the multiplier is less than or equal to `step`. We call this the ramp-up phase, and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes. **Example with ramp-up** @@ -250,7 +250,7 @@ The boundaries of the buckets are user-configurable via environment variables, a For example, if a request with 3 sequences, each having a maximum sequence length of 412, is sent to an idle vLLM server, it will be padded and executed as a `(4, 512)` prefill bucket. This is because the `batch_size` (number of sequences) will be padded to 4 (the nearest batch size dimension higher than 3), and the maximum sequence length will be padded to 512 (the nearest sequence length dimension higher than 412). After the prefill stage, it will be executed as a `(4, 512)` decode bucket and will remain in this bucket until either the batch dimension changes (e.g., due to a request being completed), in which case it will become -a `(2, 512)` bucket, or the context length increases beyond 512 tokens, at which point it will become a `(4, 640)` bucket. +a `(2, 512)` bucket, or the context length increases beyond 512 tokens. It will become a `(4, 640)` bucket at that point. > [!NOTE] > Bucketing is transparent to the user – padding in the sequence length dimension is never returned, and padding in the batch dimension does not create new requests. @@ -278,7 +278,7 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size ``` > [!TIP] -> Compiling all the buckets may take some time and can be disabled by setting the VLLM_SKIP_WARMUP=true environment variable. Keep in mind that if you do this, you may encounter graph compilations +> Compiling all the buckets may take some time and can be disabled by setting the `VLLM_SKIP_WARMUP=true` environment variable. Remember that if you do this, you may encounter graph compilations when executing a given bucket for the first time. > [!WARNING] @@ -305,18 +305,18 @@ regardless of the total device memory. You can also configure the strategy for capturing HPU graphs separately for the prompt and decode stages. The strategy affects the order in which graphs are captured. Two strategies are implemented: -- `max_bs` - The graph capture queue is sorted in descending order by batch size. Buckets with equal batch sizes are sorted by sequence length in an ascending order +- `max_bs` - The graph capture queue is sorted in descending order by batch size. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g., `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), which is the default strategy for decode. -- `min_tokens` - The graph capture queue is sorted in an ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), which is the default strategy for prompt. +- `min_tokens` - The graph capture queue is sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), which is the default strategy for prompt. -When a large number of requests are pending, the vLLM scheduler attempts to fill the maximum batch size for decoding as quickly as possible. Once a request is finished, the decode batch size decreases. +When many requests are pending, the vLLM scheduler attempts to fill the maximum batch size for decoding as quickly as possible. Once a request is finished, the decode batch size decreases. When this happens, vLLM attempts to schedule a prefill iteration for requests in the waiting queue to restore the decode batch size to its previous state. In a fully loaded scenario, the decode batch size is often at its maximum, making large-batch HPU graphs critical to capture, as indicated by the `max_bs` strategy. Conversely, prefill iterations will typically be executed with very low batch sizes (1-4), as reflected in the `min_tokens` strategy. > [!NOTE] > `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on the memory allocated for graphs in each stage (prefill and decode). vLLM first attempts to use the entire usable prefill graph memory -(usable graph memory * VLLM_GRAPH_PROMPT_RATIO) for capturing prefill HPU Graphs. It will then attempt to do the same for decode graphs and the usable decode graph memory pool. If one stage is fully +(usable graph memory * VLLM_GRAPH_PROMPT_RATIO) to capture prefilled HPU graphs. It will then attempt to do the same for decode graphs and the usable decode graph memory pool. If one stage is fully captured and there is unused memory remaining in the usable graph memory pool, vLLM will attempt to capture more graphs for the other stage, until no more HPU Graphs can be captured without exceeding the reserved memory pool. The behavior of this mechanism is illustrated in the example below. @@ -365,23 +365,23 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi **Diagnostic and Profiling Knobs:** -- `VLLM_PROFILER_ENABLED`: if `true` - enables high level profiler. Resulting JSON traces can be viewed at [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). Disabled by default. +- `VLLM_PROFILER_ENABLED`: if `true` - enables high-level profiler. Resulting JSON traces can be viewed at [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). Disabled by default. - `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: if `true` - logs graph compilations for each vLLM engine step, but only if any compilation occurs. It is highly recommended to use this in conjunction with `PT_HPU_METRICS_GC_DETAILS=1`. Disabled by default. - `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: if `true` - logs graph compilations for every vLLM engine step, even if no compilation occurs. Disabled by default. - `VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: if `true` - logs CPU fallbacks for each vLLM engine step, but only if any fallback occurs. Disabled by default. -- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true` - logs CPU fallbacks for each vLLM engine step, even if no fallback occur. Disabled by default. -- `VLLM_T_COMPILE_FULLGRAPH`: if `true` - PyTorch compile function raises an error if any graph breaks happened during compilation. This allows an easy detection of existing graph breaks, which usually reduce the performance. Disabled by default. +- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true` - logs CPU fallbacks for each vLLM engine step, even if no fallback occurs. Disabled by default. +- `VLLM_T_COMPILE_FULLGRAPH`: if `true` - PyTorch compile function raises an error if any graph breaks happen during compilation. This allows for the easy detection of existing graph breaks, which usually reduce performance. Disabled by default. **Performance Tuning Knobs:** -- `VLLM_SKIP_WARMUP`: if `true` - warmup is skipped. `false` by default. -- `VLLM_GRAPH_RESERVED_MEM`: percentage of memory dedicated for HPUGraph capture, `0.1` by default. -- `VLLM_GRAPH_PROMPT_RATIO`: percentage of reserved graph memory dedicated for prompt graphs, `0.3` by default. -- `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`, `min_tokens` by default. -- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`, `max_bs` by default. -- `VLLM_EXPONENTIAL_BUCKETING`, if `true`, enables exponential bucket spacing instead of linear (experimental). -- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only). +- `VLLM_SKIP_WARMUP`: if `true`, warmup is skipped. The default is `false`. +- `VLLM_GRAPH_RESERVED_MEM`: percentage of memory dedicated to HPUGraph capture. The default is `0.1`. +- `VLLM_GRAPH_PROMPT_RATIO`: percentage of reserved graph memory dedicated to prompt graphs. The default is `0.3`. +- `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`. The default is `min_tokens`. +- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`. +- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear (experimental). +- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only). - `{phase}` is either `PROMPT` or `DECODE` - `{dim}` is either `BS`, `SEQ` or `BLOCK` - `{param}` is either `MIN`, `STEP` or `MAX` @@ -412,25 +412,24 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi - block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len/block_size)` > [!NOTE] -> Model config may report very high max_model_len, -> please set it to max input_tokens+output_tokens rounded up to multiple of block_size as per actual requirements. +> If the model config reports a high `max_model_len`, set it to max `input_tokens+output_tokens` rounded up to a multiple of `block_size` as per actual requirements. > [!TIP] -> When a deployed workload does not utilize the full context that a model can handle, it is good practice to limit the maximum values upfront based on the input and output token lengths that will be generated after serving the vLLM server.

EXAMPLE:

Let's assume that we want to deploy text generation model Qwen2.5-1.5B, which has a defined "max_position_embeddings" of 131072 (our max_model_len). At the same time, we know that our workload pattern will not use the full context length because we expect a maximum input token size of 1K and predict generating a maximum of 2K tokens as output. In this case, it is not necessary to start the vLLM server to be ready for the full context length. Instead, we should limit it upfront to achieve faster service preparation and decrease warmup time. The recommended values in this example should be: -> - `--max_model_len`: `3072` - the sum of input and output sequences (1+2)*1024 -> - `VLLM_PROMPT_SEQ_BUCKET_MAX`: `1024` - the maximum input token size that we expect to handle +> When a deployed workload does not utilize the full context that a model can handle, it is good practice to limit the maximum values upfront based on the input and output token lengths that will be generated after serving the vLLM server.

**Example:**

Let's assume that we want to deploy text generation model Qwen2.5-1.5B, which has a defined `max_position_embeddings` of 131072 (our `max_model_len`). At the same time, we know that our workload pattern will not use the full context length because we expect a maximum input token size of 1K and predict generating a maximum of 2K tokens as output. In this case, starting the vLLM server to be ready for the full context length is unnecessary. Instead, we should limit it upfront to achieve faster service preparation and decrease warmup time. The recommended values in this example should be: +> - `--max_model_len`: `3072` - the sum of input and output sequences (1+2)*1024. +> - `VLLM_PROMPT_SEQ_BUCKET_MAX`: `1024` - the maximum input token size that we expect to handle. -- `VLLM_HANDLE_TOPK_DUPLICATES`, if ``true`` - handles duplicates that are outside of top-k. `false` by default. -- `VLLM_CONFIG_HIDDEN_LAYERS` - configures how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. - It helps improve throughput by reducing inter-token latency limitations in some models. +- `VLLM_HANDLE_TOPK_DUPLICATES`: if ``true`` - handles duplicates outside top-k. The default is `false`. +- `VLLM_CONFIG_HIDDEN_LAYERS`: configures how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. + It helps to improve throughput by reducing inter-token latency limitations in some models. The default is `1`. Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution: -- `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used, if `1` PyTorch Lazy backend for Gaudi will be used. `1` is the default. +- `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used. If `1`, PyTorch Lazy backend for Gaudi will be used. The default is `0`. -- `PT_HPU_ENABLE_LAZY_COLLECTIVES` must be set to `true` for tensor parallel inference with HPU Graphs. `true` is the default. -- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE` must be set to `false` for llava, qwen and RoBERTa models. `false` is the default. -- `VLLM_PROMPT_USE_FLEX_ATTENTION` is enabled only for llama model, and allows to use torch.nn.attention.flex_attention instead of FusedSDPA. Note, this requires `VLLM_PROMPT_USE_FUSEDSDPA=0`. `false` is the default. +- `PT_HPU_ENABLE_LAZY_COLLECTIVES`: must be set to `true` for tensor parallel inference with HPU Graphs. The default is `true`. +- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE`: must be set to `false` for LLaVA, qwen, and RoBERTa models. The default is `false`. +- `VLLM_PROMPT_USE_FLEX_ATTENTION`: enabled only for the Llama model, allowing usage of `torch.nn.attention.flex_attention` instead of FusedSDPA. Requires `VLLM_PROMPT_USE_FUSEDSDPA=0`. The default is `false`. # Quantization, FP8 Inference and Model Calibration Process @@ -483,16 +482,16 @@ Set the following environment variables to avoid OOM/functional issues. Additio **32K context length flags examples:** -- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B. -- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. -- `VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased. -- `VLLM_PROMPT_BS_BUCKET_MAX=16` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. -- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576` - Suggested value, depends on warmup results. -- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048` - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3. -- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768` - Value for context length of 32K. Use 16384 for 16K. -- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024` - Suggested value, depends on warmup results. -- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024` - Suggested value, depends on warmup results. -- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792` - `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example: +- `VLLM_GRAPH_RESERVED_MEM`: The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B. +- `VLLM_PROMPT_BS_BUCKET_MIN=1`: Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. +- `VLLM_PROMPT_BS_BUCKET_STEP=16`: Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased. +- `VLLM_PROMPT_BS_BUCKET_MAX=16`: Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. +- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576`: Suggested value, depends on warmup results. +- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048`: Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3. +- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768`: Value for context length of 32K. Use 16384 for 16K. +- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024`: Suggested value, depends on warmup results. +- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024`: Suggested value, depends on warmup results. +- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792`: `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example: - `128 *((32 + 1)* 1024) / 128` - `32 *((32 + 1)* 1024) / 128` @@ -514,34 +513,38 @@ Configuration: (prompt, 1, 36864) was not warmed-up! Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. ``` -**Usage of Multi-Step Scheduling feature** -Enabling of Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details. +## Multi-Step Scheduling Feature Usage + +Enabling Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details. # Pipeline Parallelism Pipeline parallelism is a distributed model parallelization technique that splits the model vertically across its layers, distributing different parts of the model across multiple devices. -With this feature when running a model that does not fit on a single node with tensor parallelism and requires multi-node solution we can split the model vertically across its layers and distribute the slices across available nodes. -For example if we have two nodes - 8 HPUs each, we no longer have to use tensor_parallel_size=16 but instead we can set tensor_parallel_size-8 with pipeline_parallel_size=2. -Because pipeline parallelism runs pp_size amount of virtual engines on each device we have to accordingly lower max_num_seqs since it acts as a micro batch for each virtual engine. +With this feature, when running a model that does not fit on a single node with tensor parallelism and requires a multi-node solution, we can split the model vertically across its layers and distribute the slices across available nodes. +For example, if we have two nodes, each with 8 HPUs, we no longer have to use `tensor_parallel_size=16` but can instead set `tensor_parallel_size=8` with pipeline_parallel_size=2. +Because pipeline parallelism runs a `pp_size` number of virtual engines on each device, we have to lower `max_num_seqs` accordingly, since it acts as a micro batch for each virtual engine. + +## Running Pipeline Parallelism -## How to run +The following example shows how to use Pipeline parallelism with vLLM on HPU: ```bash vllm serve --device hpu --tensor-parallel-size 8 --pipeline_parallel_size 2 --distributed-executor-backend ray ``` > [!NOTE] -> Currently pipeline parallelism on lazy mode requires: PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0. +> Currently, pipeline parallelism on Lazy mode requires the `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0` flag. -# Multi-node support +# Multi-node Support -vLLM works with multi-node environment setup via Ray. To run models on multiple nodes run following steps: +vLLM works with a multi-node environment setup via Ray. To run models on multiple nodes, follow the procedure below. -## 1. Pre-requisites, follow the steps on all nodes: +## Prerequisites +Perform the following on all nodes: -- Install latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm) +- Install the latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm). -- Check if all Gaudi NIC ports are up. +- Check if all Gaudi NIC ports are up by running: > [!NOTE] > Following commands should be run on the host and NOT inside the container. @@ -555,7 +558,7 @@ cd /opt/habanalabs/qual/gaudi2/bin # Give it a minute for the NIC's to flip and check the status again ``` -- Set following envs: +- Set the following flags: ```bash # Check the network interface for outbound/inbound comms. Command 'ip a' or 'ifconfig' should list all the interfaces @@ -563,30 +566,30 @@ export GLOO_SOCKET_IFNAME=eth0 export HCCL_SOCKET_IFNAME=eth0 ``` -### 2. Start Ray on head node: +## 1. Start Ray on the head node: ```bash ray start --head --port=6379 ``` -#### 3. Add workers to the Ray cluster: +## 2. Add workers to the Ray cluster: ```bash ray start --address=':6379' ``` -#### 4. Start vLLM server: +## 3. Start the vLLM server: ```bash vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len 2048 --block-size 128 --max-num-seqs 32 --tensor-parallel-size 16 --distributed-executor-backend ray ``` > [!NOTE] -> Running FP8 models with multi-node setup has been described in the documentation of FP8 calibration procedure: [README](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md) +> Running FP8 models with a multi-node setup is described in the documentation of FP8 calibration procedure: [README](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md). # Other Online Serving Examples -Please refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts designed to help the user reproduce performance numbers with vLLM on Gaudi for various types of models and varying context lengths. This a list of the models and examples scripts provided for 2K and 4K context length scenarios: +Please refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts designed to help the user reproduce performance numbers with vLLM on Gaudi for various types of models and varying context lengths. Below is a list of the models and example scripts provided for 2K and 4K context length scenarios: - deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-2k - deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-4k - llama-3.1-70b-instruct_gaudi3_1.20_contextlen-2k @@ -599,7 +602,7 @@ Please refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tr # Troubleshooting The following steps address Out of Memory related errors: -- Increase gpu_memory_utilization - This addresses insufficient overall memory. The vLLM pre-allocates HPU cache by using gpu_memory_utilization% of device memory. By default, gpu_memory_utilization is set to 0.9. By increasing this utilization, you can provide more KV cache space. -- Decrease max_num_seqs or max_num_batched_tokens - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space when overall usable memory is limited. -- Increase tensor_parallel_size - This approach shards model weights, so each GPU has more memory available for KV cache. -- For maximizing memory available for KV cache, you can disable `HPUGraph` completely. With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding `--enforce-eager` flag to the server (for online inference), or by passing `enforce_eager=True` argument to LLM constructor (for offline inference). +- Increase `gpu_memory_utilization` - This addresses insufficient overall memory. The vLLM pre-allocates HPU cache by using `gpu_memory_utilization%` of device memory. By default, `gpu_memory_utilization` is set to 0.9. By increasing this utilization, you can provide more KV cache space. +- Decrease `max_num_seqs` or `max_num_batched_tokens` - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space when overall usable memory is limited. +- Increase `tensor_parallel_size` - This approach shards model weights, so each GPU has more memory available for KV cache. +- To maximize the memory available for the KV cache, you can disable `HPUGraph` completely. With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding `--enforce-eager` flag to the server (for online inference), or by passing `enforce_eager=True` argument to LLM constructor (for offline inference).