vllm-project · hsliuustc0106 · Jan 6, 2026 · Jan 6, 2026 · Jan 6, 2026 · Jan 6, 2026
@@ -33,6 +33,8 @@ nav:
       - Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md
   - Models:
     - models/supported_models.md
+  - Features:
+    - Sleep Mode: features/sleep_mode.md
 - Developer Guide:
   - General:
     - contributing/README.md

@@ -0,0 +1,39 @@
+# Sleep Mode
+
+vLLM-Omni’s **Sleep Mode** allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches (for autoregressive models)—**without stopping the server or unloading the Docker container**.
+
+This feature is inherited from [vLLM’s Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html), which provides zero-reload model switching for multi-model serving.  
+
+It is especially useful in **RLHF**, **training**, or **cost-saving scenarios**, where GPU resources must be freed between inference workloads.
+
+---
+
+## Omni Model
+
+Omni model inherit the feature from vLLM' Sleep Mode
+
+This means:
+
+- Support both Level 1 and Level 2 sleep, allow to release and reset both model weights and KV Cache
+
+## Diffusion Model Extension
+
+We added Sleep Mode support for **diffusion models**, which previously lacked this functionality.  
+In diffusion pipelines, this currently only offloads **model weight memory**, as these models typically do not use KV caches.
+
+This means:
+
+- Diffusion models can now enter Level 1 sleep.
+- Pipeline states (e.g., noise schedulers, buffers) remain intact after waking.
+- Useful for releasing VRAM between image generation or training cycles.
+
+---
+
+## Enable sleep mode
+To enable sleep mode, set the `enable_sleep_mode` in `engine_args` to `True`
+
+
+Example:
+```python
+omni = Omni(model=...,enable_sleep_mode=True)
+```
@@ -143,6 +143,16 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
         return self.pipeline.load_weights(weights)
 
     def sleep(self, level: int = 1) -> bool:
+        """
+        Put the worker to sleep. The worker should not process any requests.
+        The caller should guarantee that no requests are being processed
+        during the sleep period, before `wake_up` is called.
+
+        Args:
+            level: The sleep level. Level 1 sleep will offload the model
+                weights and discard the kv cache.
+                Currently only support level 1.
+        """
         from vllm.device_allocator.cumem import CuMemAllocator
 
         free_bytes_before_sleep = torch.cuda.mem_get_info()[0]
@@ -166,6 +176,17 @@ def sleep(self, level: int = 1) -> bool:
         return True
 
     def wake_up(self, tags: list[str] | None = None) -> bool:
+        """
+        Wake up the worker from sleep mode. See the sleep function
+        method for more details.
+
+        Args:
+            tags: An optional list of tags to reallocate the worker memory
+                for specific memory allocations. Values must be in
+                `("weights")`. If None, all memory is reallocated.
+                wake_up should be called with all tags (or None) before the
+                worker is used again.
+        """
         from vllm.device_allocator.cumem import CuMemAllocator
 
         allocator = CuMemAllocator.get_instance()