Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion doc/source/serve/llm/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Benchmarks

Performance in LLM serving depends heavily on your specific workload characteristics and hardware stack. From a Ray Serve perspective, the focus is on orchestration overhead and the effectiveness of serving pattern implementations. The Ray team maintains the [ray-serve-llm-perf-examples](https://github.com/anyscale/ray-serve-llm-perf-examples) repository with benchmarking snapshots, tooling, and lessons learned. These benchmarks validate the correctness and effectiveness of different serving patterns. You can use these benchmarks to validate your production stack more systematically.
Performance in LLM serving depends heavily on your specific workload characteristics and hardware stack. From a Ray Serve perspective, the focus is on orchestration overhead and the effectiveness of serving pattern implementations. The Ray team maintains the [ray-serve-llm-perf-examples](https://github.com/anyscale/ray-serve-llm-perf-examples) repository with benchmarking snapshots, tooling, and lessons learned. These benchmarks validate the correctness and effectiveness of different serving patterns. You can use these benchmarks to validate your production stack more systematically.

## Replica Startup Latency

Replica startup times involving large models can be slow, leading to slow autoscaling and poor response to changing workloads. Experiments on replica startup can be found [here](https://github.com/anyscale/ray-serve-llm-perf-examples/tree/master/replica_initialization). The experiments illustrate the effects of the various techniques mentioned in [this guide](./user-guides/deployment-initialization.md), primarily targeting the latency cost of model loading and Torch Compile. As models grow larger, the effects of these optimizations become increasingly pronounced. As an example, we get nearly 3.88x reduction in latency on `Qwen/Qwen3-235B-A22B`.
Original file line number Diff line number Diff line change
@@ -1,47 +1,25 @@
(model-loading-guide)=
# Model loading
(deployment-initialization-guide)=
# Deployment Initialization

Configure model loading from Hugging Face, remote storage, or gated repositories.
The initialization phase of a serve.llm deployment involves many steps, including preparation of model weights, engine (vLLM) initialization, and Ray serve replica autoscaling overheads. A detailed breakdown of the steps involved in using serve.llm with vLLM is provided below.

Ray Serve LLM supports loading models from multiple sources:
## Startup Breakdown
- **Provisioning Nodes**: If a GPU node isn't available, a new instance must be provisioned.
- **Image Download**: Downloading image to target instance incurs latency correlated with image size.
- **Fixed Ray/Node Initialization**: Ray/vLLM incurs some fixed overhead when spawning new processes to handle a new replica, which involves importing large libraries (such as vLLM), preparing model and engine configurations, etc.
- **Model Loading**: Retrieve model either from Hugging Face or cloud storage, including time spent downloading the model and moving it to GPU memory
- **Torch Compile**: Torch compile is integral to vLLM's design and it is enabled by default.
- **Memory Profiling**: vLLM runs some inference on the model to determine the amount of available memory it can dedicate to the KV cache
- **CUDA Graph Capture**: vLLM captures the CUDA graphs for different input sizes ahead of time. More details are [here.](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
- **Warmup**: Initialize KV cache, run model inference.

- **Hugging Face Hub**: Load models directly from Hugging Face (default)
- **Remote storage**: Load from S3 or GCS buckets
- **Gated models**: Access private or gated Hugging Face models with authentication

You configure model loading through the `model_loading_config` parameter in `LLMConfig`.

## Load from Hugging Face Hub
This document will provide an overview of the numerous ways to customize your deployment initialization.

By default, Ray Serve LLM loads models from Hugging Face Hub. Specify the model source with `model_source`:

```python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3-8b",
model_source="meta-llama/Meta-Llama-3-8B-Instruct",
),
accelerator_type="A10G",
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
```

### Fast download from Hugging Face

Enable fast downloads with Hugging Face's `hf_transfer` library:

1. Install the library:

```bash
pip install hf_transfer
```
## Model Loading from Hugging Face

2. Set the `HF_HUB_ENABLE_HF_TRANSFER` environment variable:
By default, Ray Serve LLM loads models from Hugging Face Hub. Specify the model source with `model_source`:

```python
from ray import serve
Expand All @@ -53,20 +31,13 @@ llm_config = LLMConfig(
model_source="meta-llama/Meta-Llama-3-8B-Instruct",
),
accelerator_type="A10G",
runtime_env=dict(
env_vars={
"HF_HUB_ENABLE_HF_TRANSFER": "1"
}
),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
```

You can also use third-party integrations for streaming models directly to GPU, such as Run:ai Model Streamer.

## Load gated models
### Load gated models

Gated Hugging Face models require authentication. Pass your Hugging Face token through the `runtime_env`:

Expand Down Expand Up @@ -113,7 +84,42 @@ ray.init(
```


## Load from remote storage

### Fast download from Hugging Face

Enable fast downloads with Hugging Face's `hf_transfer` library:

1. Install the library:

```bash
pip install hf_transfer
```

2. Set the `HF_HUB_ENABLE_HF_TRANSFER` environment variable:

```python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3-8b",
model_source="meta-llama/Meta-Llama-3-8B-Instruct",
),
accelerator_type="A10G",
runtime_env=dict(
env_vars={
"HF_HUB_ENABLE_HF_TRANSFER": "1"
}
),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
```


## Model Loading from remote storage

Load models from S3 or GCS buckets instead of Hugging Face. This is useful for:

Expand Down Expand Up @@ -229,6 +235,69 @@ llm_config = LLMConfig(

Use EC2 instance profiles or EKS service accounts with appropriate S3 read permissions.


### S3 and RunAI Streamer
S3 can be combined with RunAI Streamer, an extension in vLLM that enables streaming the model weights directly from remote cloud storage into GPU memory, improving model load latency. More details can be found [here](https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html).

```python
llm_config = LLMConfig(
...
model_loading_config={
"model_id": "llama",
"model_source": "s3://your-bucket/Meta-Llama-3-8B-Instruct",
},
engine_kwargs={
"tensor_parallel_size": 1,
"load_format": "runai_streamer",
},
...
)
```

### Model Sharding
Modern LLM model sizes often outgrow the memory capacity of a single GPU, requiring the use of tensor parallelism to split computation across multiple devices. In this paradigm, only a subset of weights are stored on each GPU, and model sharding ensures that each device only loads the relevant portion of the model. By sharding the model files in advance, we can reduce load times significantly, since GPUs avoid loading unneeded weights. vLLM provides a utility script for this purpose: [save_sharded_state.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/save_sharded_state.py).

Once the sharded weights have been saved, upload them to S3 and use RunAI streamer with a new flag to load the sharded weights

```python
llm_config = LLMConfig(
...
engine_kwargs={
"tensor_parallel_size": 4,
"load_format": "runai_streamer_sharded",
},
...
)
```

## Additional Optimizations

### Torch Compile Cache
Torch.compile incurs some latency during initialization. This can be mitigated by keeping a torch compile cache, which is automatically generated by vLLM. To retrieve the torch compile cache, run vLLM and look for a log like below:
```
(RayWorkerWrapper pid=126782) INFO 10-15 11:57:04 [backends.py:608] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9/rank_1_0/backbone for vLLM's torch.compile
```

In this example the cache folder is located at `/home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9`. Upload this directory to your S3 bucket. The cache folder can now be retrieved at startup. We provide a custom utility to download the compile cache from cloud storage. Specify the `CloudDownloader` callback in `LLMConfig` and supply the relevant arguments. Make sure to set the `cache_dir` in compilation_config correctly.

```python
llm_config = LLMConfig(
...
callback_config={
"callback_class": "ray.llm._internal.common.callbacks.cloud_downloader.CloudDownloader",
"callback_kwargs": {"paths": [("s3://samplebucket/llama-3-8b-cache", "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache")]},
},
engine_kwargs={
"tensor_parallel_size": 1,
"compilation_config": {
"cache_dir": "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache",
}
},
...
)
```
NOTE: `CloudDownloader` is a callback that isn't public yet. We plan to make it public after stabilizing the API and incorporating user feedback. In the meantime, the compile cache can be retrieved using any preferred method, as long as the path to the cache is set in `compilation_config`.

## Best practices

### Model source selection
Expand All @@ -248,14 +317,15 @@ Use EC2 instance profiles or EKS service accounts with appropriate S3 read permi
- **Co-locate storage and compute** in the same cloud region to reduce latency and egress costs.
- **Use fast download** (`HF_HUB_ENABLE_HF_TRANSFER`) for models larger than 10GB.
- **Cache models** locally if you're repeatedly deploying the same model.
- **See benchmarks** [here](../benchmarks.md) for detailed information about optimizations

## Troubleshooting

### Slow downloads from Hugging Face

- Install `hf_transfer`: `pip install hf_transfer`
- Set `HF_HUB_ENABLE_HF_TRANSFER=1` in `runtime_env`
- Consider moving the model to S3/GCS in your cloud region
- Consider moving the model to S3/GCS in your cloud region and using RunAI streamer, and use sharding for large models

### S3/GCS access errors

Expand Down
2 changes: 1 addition & 1 deletion doc/source/serve/llm/user-guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ How-to guides for deploying and configuring Ray Serve LLM features.
```{toctree}
:maxdepth: 1

Model loading <model-loading>
Deployment Initialization <deployment-initialization>
Prefill/decode disaggregation <prefill-decode>
Prefix-aware routing <prefix-aware-routing>
Multi-LoRA deployment <multi-lora>
Expand Down