[doc][serve][llm] Model loading Docs #57922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

kouroshHakha merged 9 commits into ray-project:master from ahao-anyscale:model-loading-docs

Oct 24, 2025

doc/source/serve/llm/benchmarks.md

-Original file line number
+Diff line change
@@ -1,3 +1,7 @@
     # Benchmarks
-    Performance in LLM serving depends heavily on your specific workload characteristics and hardware stack. From a Ray Serve perspective, the focus is on orchestration overhead and the effectiveness of serving pattern implementations. The Ray team maintains the [ray-serve-llm-perf-examples](https://github.com/anyscale/ray-serve-llm-perf-examples) repository with benchmarking snapshots, tooling, and lessons learned. These benchmarks validate the correctness and effectiveness of different serving patterns. You can use these benchmarks to validate your production stack more systematically.
+    Performance in LLM serving depends heavily on your specific workload characteristics and hardware stack. From a Ray Serve perspective, the focus is on orchestration overhead and the effectiveness of serving pattern implementations. The Ray team maintains the [ray-serve-llm-perf-examples](https://github.com/anyscale/ray-serve-llm-perf-examples) repository with benchmarking snapshots, tooling, and lessons learned. These benchmarks validate the correctness and effectiveness of different serving patterns. You can use these benchmarks to validate your production stack more systematically.
+    ## Replica Startup Latency
+    Replica startup times involving large models can be slow, leading to slow autoscaling and poor response to changing workloads. Experiments on replica startup can be found [here](https://github.com/anyscale/ray-serve-llm-perf-examples/tree/master/replica_initialization). The experiments illustrate the effects of the various techniques mentioned in [this guide](./user-guides/deployment-initialization.md), primarily targeting the latency cost of model loading and Torch Compile. As models grow larger, the effects of these optimizations become increasingly pronounced. As an example, we get nearly 3.88x reduction in latency on `Qwen/Qwen3-235B-A22B`.

...ce/serve/llm/user-guides/model-loading.md → .../user-guides/deployment-initialization.md

ahao-anyscale marked this conversation as resolved.

Show resolved Hide resolved

-Original file line number
+Diff line change
@@ -1,47 +1,25 @@
-    (model-loading-guide)=
-    # Model loading
+    (deployment-initialization-guide)=
+    # Deployment Initialization
-    Configure model loading from Hugging Face, remote storage, or gated repositories.
+    The initialization phase of a serve.llm deployment involves many steps, including preparation of model weights, engine (vLLM) initialization, and Ray serve replica autoscaling overheads. A detailed breakdown of the steps involved in using serve.llm with vLLM is provided below.
-    Ray Serve LLM supports loading models from multiple sources:
+    ## Startup Breakdown
+    - **Provisioning Nodes**: If a GPU node isn't available, a new instance must be provisioned.
+    - **Image Download**: Downloading image to target instance incurs latency correlated with image size.
+    - **Fixed Ray/Node Initialization**: Ray/vLLM incurs some fixed overhead when spawning new processes to handle a new replica, which involves importing large libraries (such as vLLM), preparing model and engine configurations, etc.
+    - **Model Loading**: Retrieve model either from Hugging Face or cloud storage, including time spent downloading the model and moving it to GPU memory
+    - **Torch Compile**: Torch compile is integral to vLLM's design and it is enabled by default.
+    - **Memory Profiling**: vLLM runs some inference on the model to determine the amount of available memory it can dedicate to the KV cache
+    - **CUDA Graph Capture**: vLLM captures the CUDA graphs for different input sizes ahead of time. More details are [here.](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
+    - **Warmup**: Initialize KV cache, run model inference.
-    - **Hugging Face Hub**: Load models directly from Hugging Face (default)
-    - **Remote storage**: Load from S3 or GCS buckets
-    - **Gated models**: Access private or gated Hugging Face models with authentication
-    You configure model loading through the `model_loading_config` parameter in `LLMConfig`.
-    ## Load from Hugging Face Hub
+    This document will provide an overview of the numerous ways to customize your deployment initialization.
-    By default, Ray Serve LLM loads models from Hugging Face Hub. Specify the model source with `model_source`:
-    ```python
-    from ray import serve
-    from ray.serve.llm import LLMConfig, build_openai_app
-    llm_config = LLMConfig(
-        model_loading_config=dict(
-            model_id="llama-3-8b",
-            model_source="meta-llama/Meta-Llama-3-8B-Instruct",
-        ),
-        accelerator_type="A10G",
-    )
-    app = build_openai_app({"llm_configs": [llm_config]})
-    serve.run(app, blocking=True)
-    ```
-    ### Fast download from Hugging Face
-    Enable fast downloads with Hugging Face's `hf_transfer` library:
-. Install the library:
-    ```bash
-    pip install hf_transfer
-    ```
+    ## Model Loading from Hugging Face
-. Set the `HF_HUB_ENABLE_HF_TRANSFER` environment variable:
+    By default, Ray Serve LLM loads models from Hugging Face Hub. Specify the model source with `model_source`:
     ```python
     from ray import serve
@@ Expand All / @@ -53,20 +31,13 @@ llm_config = LLMConfig( @@
             model_source="meta-llama/Meta-Llama-3-8B-Instruct",
         ),
         accelerator_type="A10G",
-        runtime_env=dict(
-            env_vars={
-                "HF_HUB_ENABLE_HF_TRANSFER": "1"
-            }
-        ),
     )
     app = build_openai_app({"llm_configs": [llm_config]})
     serve.run(app, blocking=True)
     ```
-    You can also use third-party integrations for streaming models directly to GPU, such as Run:ai Model Streamer.
-    ## Load gated models
+    ### Load gated models
     Gated Hugging Face models require authentication. Pass your Hugging Face token through the `runtime_env`:
@@ Expand Down Expand Up / @@ -113,7 +84,42 @@ ray.init( @@
     ```
-    ## Load from remote storage
+    ### Fast download from Hugging Face
+    Enable fast downloads with Hugging Face's `hf_transfer` library:
+. Install the library:
+    ```bash
+    pip install hf_transfer
+    ```
+. Set the `HF_HUB_ENABLE_HF_TRANSFER` environment variable:
+    ```python
+    from ray import serve
+    from ray.serve.llm import LLMConfig, build_openai_app
+    llm_config = LLMConfig(
+        model_loading_config=dict(
+            model_id="llama-3-8b",
+            model_source="meta-llama/Meta-Llama-3-8B-Instruct",
+        ),
+        accelerator_type="A10G",
+        runtime_env=dict(
+            env_vars={
+                "HF_HUB_ENABLE_HF_TRANSFER": "1"
+            }
+        ),
+    )
+    app = build_openai_app({"llm_configs": [llm_config]})
+    serve.run(app, blocking=True)
+    ```
+    ## Model Loading from remote storage
     Load models from S3 or GCS buckets instead of Hugging Face. This is useful for:
@@ Expand Down Expand Up / @@ -229,6 +235,69 @@ llm_config = LLMConfig( @@
     Use EC2 instance profiles or EKS service accounts with appropriate S3 read permissions.
+    ### S3 and RunAI Streamer
+    S3 can be combined with RunAI Streamer, an extension in vLLM that enables streaming the model weights directly from remote cloud storage into GPU memory, improving model load latency. More details can be found [here](https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html).
+    ```python
+    llm_config = LLMConfig(
+        ...
+        model_loading_config={
+            "model_id": "llama",
+            "model_source": "s3://your-bucket/Meta-Llama-3-8B-Instruct",
+        },
+        engine_kwargs={
+            "tensor_parallel_size": 1,
+            "load_format": "runai_streamer",
+        },
+        ...
+    )
+    ```
+    ### Model Sharding
+    Modern LLM model sizes often outgrow the memory capacity of a single GPU, requiring the use of tensor parallelism to split computation across multiple devices. In this paradigm, only a subset of weights are stored on each GPU, and model sharding ensures that each device only loads the relevant portion of the model. By sharding the model files in advance, we can reduce load times significantly, since GPUs avoid loading unneeded weights. vLLM provides a utility script for this purpose: [save_sharded_state.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/save_sharded_state.py).
+    Once the sharded weights have been saved, upload them to S3 and use RunAI streamer with a new flag to load the sharded weights
+    ```python
+    llm_config = LLMConfig(
+        ...
+        engine_kwargs={
+            "tensor_parallel_size": 4,
+            "load_format": "runai_streamer_sharded",
+        },
+        ...
+    )
+    ```
+    ## Additional Optimizations
+    ### Torch Compile Cache
+    Torch.compile incurs some latency during initialization. This can be mitigated by keeping a torch compile cache, which is automatically generated by vLLM. To retrieve the torch compile cache, run vLLM and look for a log like below:
+    ```
+    (RayWorkerWrapper pid=126782) INFO 10-15 11:57:04 [backends.py:608] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9/rank_1_0/backbone for vLLM's torch.compile
+    ```
+    In this example the cache folder is located at `/home/ray/.cache/vllm/torch_compile_cache/131ee5c6d9`. Upload this directory to your S3 bucket. The cache folder can now be retrieved at startup. We provide a custom utility to download the compile cache from cloud storage. Specify the `CloudDownloader` callback in `LLMConfig` and supply the relevant arguments. Make sure to set the `cache_dir` in compilation_config correctly.
+    ```python
+    llm_config = LLMConfig(
+        ...
+        callback_config={
+            "callback_class": "ray.llm._internal.common.callbacks.cloud_downloader.CloudDownloader",
+            "callback_kwargs": {"paths": [("s3://samplebucket/llama-3-8b-cache", "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache")]},
+        },
+        engine_kwargs={
+            "tensor_parallel_size": 1,
+            "compilation_config": {
+                "cache_dir": "/home/ray/.cache/vllm/torch_compile_cache/llama-3-8b-cache",
+            }
+        },
+        ...
+    )
+    ```
+    NOTE: `CloudDownloader` is a callback that isn't public yet. We plan to make it public after stabilizing the API and incorporating user feedback. In the meantime, the compile cache can be retrieved using any preferred method, as long as the path to the cache is set in `compilation_config`.
     ## Best practices
     ### Model source selection
@@ Expand All @@
     - **Co-locate storage and compute** in the same cloud region to reduce latency and egress costs.
     - **Use fast download** (`HF_HUB_ENABLE_HF_TRANSFER`) for models larger than 10GB.
     - **Cache models** locally if you're repeatedly deploying the same model.
+    - **See benchmarks** [here](../benchmarks.md) for detailed information about optimizations
     ## Troubleshooting
     ### Slow downloads from Hugging Face
     - Install `hf_transfer`: `pip install hf_transfer`
     - Set `HF_HUB_ENABLE_HF_TRANSFER=1` in `runtime_env`
-    - Consider moving the model to S3/GCS in your cloud region
+    - Consider moving the model to S3/GCS in your cloud region and using RunAI streamer, and use sharding for large models
     ### S3/GCS access errors
@@ Expand Down @@

doc/source/serve/llm/user-guides/index.md

-Original file line number
+Diff line change
@@ Expand Up @@
     ```{toctree}
     :maxdepth: 1
-    Model loading <model-loading>
+    Deployment Initialization <deployment-initialization>
     Prefill/decode disaggregation <prefill-decode>
     Prefix-aware routing <prefix-aware-routing>
     Multi-LoRA deployment <multi-lora>
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][serve][llm] Model loading Docs #57922

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!