Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ nav:
- Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md
- Models:
- models/supported_models.md
- Features:
- Sleep Mode: features/sleep_mode.md
- Developer Guide:
- General:
- contributing/README.md
Expand Down
39 changes: 39 additions & 0 deletions docs/features/sleep_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Sleep Mode
Comment thread
knlnguyen1802 marked this conversation as resolved.

vLLM-Omni’s **Sleep Mode** allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches (for autoregressive models)—**without stopping the server or unloading the Docker container**.

This feature is inherited from [vLLM’s Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html), which provides zero-reload model switching for multi-model serving.
Comment thread
hsliuustc0106 marked this conversation as resolved.

It is especially useful in **RLHF**, **training**, or **cost-saving scenarios**, where GPU resources must be freed between inference workloads.

---

## Omni Model

Omni model inherit the feature from vLLM' Sleep Mode

This means:

- Support both Level 1 and Level 2 sleep, allow to release and reset both model weights and KV Cache

## Diffusion Model Extension

We added Sleep Mode support for **diffusion models**, which previously lacked this functionality.
In diffusion pipelines, this currently only offloads **model weight memory**, as these models typically do not use KV caches.

This means:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not rendered correctly
https://vllm--660.org.readthedocs.build/projects/vllm-omni/en/660/features/sleep_mode/

This means: - Diffusion models can now enter Level 1 sleep. - Pipeline states (e.g., noise schedulers, buffers) remain intact after waking. - Useful for releasing VRAM between image generation or training cycles.

Copy link
Copy Markdown
Contributor Author

@knlnguyen1802 knlnguyen1802 Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- Diffusion models can now enter Level 1 sleep.
- Pipeline states (e.g., noise schedulers, buffers) remain intact after waking.
- Useful for releasing VRAM between image generation or training cycles.

---

## Enable sleep mode
To enable sleep mode, set the `enable_sleep_mode` in `engine_args` to `True`


Example:
```python
omni = Omni(model=...,enable_sleep_mode=True)
```
21 changes: 21 additions & 0 deletions vllm_omni/diffusion/worker/gpu_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,16 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
return self.pipeline.load_weights(weights)

def sleep(self, level: int = 1) -> bool:
"""
Put the worker to sleep. The worker should not process any requests.
The caller should guarantee that no requests are being processed
during the sleep period, before `wake_up` is called.

Args:
level: The sleep level. Level 1 sleep will offload the model
weights and discard the kv cache.
Currently only support level 1.
"""
from vllm.device_allocator.cumem import CuMemAllocator

free_bytes_before_sleep = torch.cuda.mem_get_info()[0]
Expand All @@ -166,6 +176,17 @@ def sleep(self, level: int = 1) -> bool:
return True

def wake_up(self, tags: list[str] | None = None) -> bool:
"""
Wake up the worker from sleep mode. See the sleep function
method for more details.

Args:
tags: An optional list of tags to reallocate the worker memory
for specific memory allocations. Values must be in
`("weights")`. If None, all memory is reallocated.
wake_up should be called with all tags (or None) before the
worker is used again.
"""
from vllm.device_allocator.cumem import CuMemAllocator

allocator = CuMemAllocator.get_instance()
Expand Down