vllm-project · ultism · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
@@ -1,60 +1,186 @@
 # LoRA (Low-Rank Adaptation) Guide
 
-LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models by adding trainable low-rank matrices to existing model weights. vLLM-Omni currently supports PEFT-style LoRA adapters, allowing you to customize model behavior without modifying the base model weights.
+LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models by adding trainable low-rank matrices to existing model weights. vLLM-Omni supports PEFT-style LoRA adapters, allowing you to customize model behavior without modifying the base model weights.
 
 ## Overview
 
-LoRA adapters are lightweight, model-specific fine-tuning weights that can be dynamically loaded and applied to diffusion models. vLLM-Omni uses a unified LoRA handling mechanism similar to vLLM with LRU cache management.
+vLLM-Omni exposes two complementary LoRA flows for diffusion models:
+
+1. **Init-time LoRA**: a single adapter is pre-loaded when `Omni` starts and is applied to every request. Lowest runtime overhead; best when all requests should share the same adapter.
+2. **Per-request LoRA**: zero or more adapters are attached to each request via `sampling_params.lora_requests`. Supports switching adapters between requests and composing multiple adapters in a single forward pass (multi-LoRA).
+
+Adapters are managed by an LRU cache so repeated activations avoid redundant weight reloads.
 
 ## LoRA Adapter Format
 
-LoRA adapters must be in **PEFT (Parameter-Efficient Fine-Tuning)** format. A typical LoRA adapter directory structure:
+LoRA adapters must be in **PEFT (Parameter-Efficient Fine-Tuning)** format. A typical adapter directory:
 
 ```
 lora_adapter/
 ├── adapter_config.json
 └── adapter_model.safetensors
 ```
 
-The `adapter_config.json` file contains metadata about the LoRA adapter, including:
+`adapter_config.json` contains:
 - `r`: LoRA rank
 - `lora_alpha`: LoRA alpha scaling factor
-- `target_modules`: List of module names to apply LoRA to
+- `target_modules`: list of module names the adapter applies to
+
+!!! note "Server-side Path Requirement"
+    The LoRA adapter path must be readable on the **server** machine. If your client and server are on different hosts, ensure the adapter is accessible via a shared mount or copied to the server.
 
-## Quick Start
 
-### Offline Inference
+## Init-time LoRA
 
-#### Pre-loaded LoRA
+### How It Works
 
-Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
+Passing `lora_path` to `Omni(...)` instructs the engine to register a single adapter at startup and activate it as the only adapter for every request. The adapter occupies one slot of the LoRA cache for the lifetime of the process.
+
+### Usage
 
 ```python
 from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+
+omni = Omni(
+    model="Tongyi-MAI/Z-Image-Turbo",
+    lora_path="/path/to/lora_adapter",
+    lora_scale=1.0,  # optional, default 1.0
+)
+
+outputs = omni.generate(
+    "A piece of cheesecake",
+    OmniDiffusionSamplingParams(height=1024, width=1024, num_inference_steps=9),
+)
+images = outputs[0].request_output.images
+```
+
+The CLI wrapper `examples/offline_inference/text_to_image/text_to_image.py` exposes these two kwargs as `--lora-path` and `--lora-scale`:
+
+```bash
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --prompt "A piece of cheesecake" \
+  --lora-path /path/to/lora_adapter \
+  --lora-scale 1.0 \
+  --output outputs/cheesecake.png
+```
+
+### Limitations
+
+- Exactly one adapter, chosen at init. The adapter cannot be swapped or disabled for individual requests — restart `Omni` to change it.
+- Mutually exclusive with `--lora-paths` in the example CLI. Use per-request LoRA when you need different adapters on different requests.
+
+
+## Per-request LoRA
+
+### How It Works
+
+Each request carries its own adapter set via `OmniDiffusionSamplingParams`:
+
+```python
+sampling_params = OmniDiffusionSamplingParams(
+    ...,
+    lora_requests=[req_a, req_b],  # list of LoRARequest
+    lora_scales=[1.0, 0.5],        # same length as lora_requests
+)
+```
+
+- `lora_requests=[]` (or omitted) → no LoRA applied to this request.
+- `lora_requests=[req]` → single adapter at the given scale.
+- `lora_requests=[req_a, req_b, ...]` → multi-LoRA: all listed adapters are activated simultaneously, each in its own cache slot, and their deltas are summed during the forward pass.
+
+The cache is sized by `max_loras` (defaults to 1). Set `Omni(..., max_loras=N)` when you plan to activate up to `N` adapters concurrently — requests exceeding this limit are rejected. The example CLI at `examples/offline_inference/text_to_image/text_to_image.py` auto-sizes this to `max(len(--lora-paths), 1)` when `--max-loras` is omitted.
+
+### Scale Semantics
+
+- `lora_scales[i]` multiplies adapter `i`'s contribution to the output delta.
+- `lora_scales[i] == 0.0` is a registered-but-inactive slot: the adapter remains in the cache but contributes nothing this forward pass. This is distinct from omitting the adapter from `lora_requests`, which releases the slot.
+- When `lora_requests` is set and `lora_scales` is omitted, every adapter defaults to scale `1.0`.
+
+### Usage
+
+**Single adapter (per-request):**
+
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
 from vllm_omni.lora.request import LoRARequest
+from vllm_omni.lora.utils import stable_lora_int_id
 
-lora_path="/path/to/lora_adapter"
+omni = Omni(model="Tongyi-MAI/Z-Image-Turbo", max_loras=1)
 
-omni = Omni(
-    model="stabilityai/stable-diffusion-3.5-medium",
-    lora_path=lora_path
+req = LoRARequest(
+    lora_name="style_a",
+    lora_int_id=stable_lora_int_id("/path/to/style_a"),
+    lora_path="/path/to/style_a",
 )
 
-lora_request = LoRARequest(
-    lora_name="preloaded",
-    lora_int_id=1,
-    lora_path=lora_path
+outputs = omni.generate(
+    "A piece of cheesecake",
+    OmniDiffusionSamplingParams(
+        height=1024,
+        width=1024,
+        num_inference_steps=9,
+        lora_requests=[req],
+        lora_scales=[1.0],
+    ),
 )
+```
+
+**Multi-LoRA composition:**
+
+```python
+omni = Omni(model="Tongyi-MAI/Z-Image-Turbo", max_loras=2)
+
+req_a = LoRARequest(lora_name="style_a", lora_int_id=stable_lora_int_id("/lora/a"), lora_path="/lora/a")
+req_b = LoRARequest(lora_name="style_b", lora_int_id=stable_lora_int_id("/lora/b"), lora_path="/lora/b")
 
 outputs = omni.generate(
-    prompt="A piece of cheesecake",
-    lora_request=lora_request,
-    lora_scale=2.0, # optional arg, default 1.0
+    "A piece of cheesecake",
+    OmniDiffusionSamplingParams(
+        height=1024,
+        width=1024,
+        num_inference_steps=9,
+        lora_requests=[req_a, req_b],
+        lora_scales=[1.0, 0.5],
+    ),
 )
 ```
 
-!!! note "Server-side Path Requirement"
-    The LoRA adapter path (`local_path`) must be readable on the **server** machine. If your client and server are on different machines, ensure the LoRA adapter is accessible via a shared mount or copied to the server.
+**Switching adapters between requests** — issue separate `omni.generate(...)` calls with different `OmniDiffusionSamplingParams`. `sampling_params_list` on `omni.generate` is stage-indexed (one entry per pipeline stage) and is shared across all prompts in a batch, so per-prompt adapter variance within a single batch call is not supported through that path.
+
+**CLI:**
+
+The example CLI exposes `--lora-paths` + `--lora-scales` for per-request composition, and `--axis` for Cartesian-product XYZ plots that can put any parameter on any of the three axes. Supported axis types are `prompt`, `lora_scale[i]` (i-th `--lora-paths` entry), `guidance_scale`, `num_inference_steps`, and `seed`. X is columns, Y is rows, Z writes one `grid_z{k}.png` per value.
+
+```bash
+# Compose two adapters on one prompt
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --prompt "A piece of cheesecake" \
+  --lora-paths /lora/a /lora/b \
+  --lora-scales 1.0 0.5 \
+  --max-loras 2 \
+  --output-dir outputs/composed/
+
+# 2×2 LoRA-scale grid across 2 prompts (Z): produces grid_z00.png + grid_z01.png
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --lora-paths /lora/a /lora/b \
+  --max-loras 2 \
+  --axis "x=lora_scale[0]:0|1" \
+  --axis "y=lora_scale[1]:0|1" \
+  --axis "z=prompt:a girl|a cat" \
+  --output-dir outputs/axis_test/
+```
+
+### Limitations
+
+- Up to `max_loras` adapters per request. Requests that exceed the limit fail fast before inference.
+- All adapters in one request share the same forward pass; they must target compatible modules (scheme enforced by PEFT's `target_modules` field). Adapters targeting disjoint modules compose trivially; overlapping modules add linearly.
+- `max_loras` sizes the cache at init and is not resizable at runtime.
+
 
 ## Wan2.2 LightX2V Offline Assembly
 

@@ -74,6 +74,7 @@ python text_to_image.py \
 | Argument | Type | Default | Description |
 | -------- | ---- | ------- | ----------- |
 | `--prompt` | str | `"a cup of coffee on the table"` | Text description for image generation |
+| `--prompts` | str+ | — | Multiple prompts for batched generation. Overrides `--prompt`. Requires `--output-dir`. |
 | `--seed` | int | `142` | Integer seed for deterministic sampling |
 | `--negative-prompt` | str | `None` | Negative prompt for classifier-free conditional guidance |
 | `--cfg-scale` | float | `4.0` | True CFG scale (model-specific guidance strength) |
@@ -82,16 +83,21 @@ python text_to_image.py \
 | `--num-inference-steps` | int | `50` | Diffusion sampling steps (more steps = higher quality, slower) |
 | `--height` | int | `1024` | Output image height in pixels |
 | `--width` | int | `1024` | Output image width in pixels |
-| `--output` | str | `"qwen_image_output.png"` | Path to save the generated image |
+| `--output` | str | `"qwen_image_output.png"` | Single-image output file path (one prompt, one LoRA combo, one image) |
+| `--output-dir` | str | — | Output directory for batch / multi-LoRA / XYZ runs. Files are named `cell_x{x}_y{y}_z{z}_n{n}.png`; `--axis` mode also writes `grid.png` (or `grid_z{k}.png` per Z value). |
 | `--vae-use-slicing` | flag | off | Enable VAE slicing for memory optimization |
 | `--vae-use-tiling` | flag | off | Enable VAE tiling for memory optimization |
 | `--cfg-parallel-size` | int | `1` | Set to `2` to enable CFG Parallel |
 | `--ulysses-degree` | int | `1` | Ulysses sequence parallel degree for multi-GPU inference |
 | `--ring-degree` | int | `1` | Ring sequence parallel degree for hybrid Ulysses + Ring inference |
 | `--ulysses-mode` | str | `"strict"` | Ulysses SP mode: `"strict"` or `"advanced_uaa"` |
 | `--enable-cpu-offload` | flag | off | Enable CPU offloading for diffusion models |
-| `--lora-path` | str | — | Path to PEFT LoRA adapter folder |
-| `--lora-scale` | float | `1.0` | Scale factor for LoRA weights |
+| `--lora-path` | str | — | Path to a PEFT LoRA adapter folder for init-time static load |
+| `--lora-scale` | float | `1.0` | Scale factor for `--lora-path` |
+| `--lora-paths` | str+ | — | One or more PEFT LoRA adapter folders for per-request composition. Mutex with `--lora-path`. |
+| `--lora-scales` | float+ | `[1.0 ...]` | Per-adapter scales for `--lora-paths` (length must match) |
+| `--max-loras` | int | auto | LoRA cache slot count. Defaults to `max(len(--lora-paths), 1)` |
+| `--axis` | str (repeatable) | — | XYZ plot axis spec `NAME=TYPE:v1\|v2\|...` where NAME ∈ `{x,y,z}` and TYPE ∈ `{prompt, lora_scale[i], guidance_scale, num_inference_steps, seed}`. Cartesian product of axes defines cells; X=cols, Y=rows, Z produces one `grid_z{k}.png` per value. Repeat up to 3 times. |
 | `--use-system-prompt` | str | `None` | System prompt preset: `en_unified`, `en_vanilla`, `en_recaption`, `en_think_recaption`, `dynamic`, `None`, or custom text. Recommended: `en_unified`. Only for HunyuanImage-3.0.|
 | `--system-prompt` | str | `None` | Custom system prompt text. Only used when `--use-system-prompt` is set to `custom`. Only for HunyuanImage-3.0.|
 
@@ -252,7 +258,9 @@ See more examples in the [cfg_parallel user guide](../../../docs/user_guide/para
 
 #### LoRA
 
-This example supports PEFT-compatible LoRA (Low-Rank Adaptation) adapters for diffusion models. Pass `--lora-path` to use a LoRA adapter and optionally `--lora-scale` (default `1.0`); omit it to use the base model only.
+This example supports PEFT-compatible LoRA (Low-Rank Adaptation) adapters in two modes — see the [LoRA feature guide](../../../docs/user_guide/diffusion/lora.md) for a full description.
+
+**Init-time LoRA** — one adapter is pre-loaded when `Omni` starts and applied to every generation:
 
 ```bash
 python text_to_image.py \
@@ -263,6 +271,36 @@ python text_to_image.py \
   --output output.png
 ```
 
+**Per-request LoRA (incl. multi-LoRA composition)** — one or more adapters are attached to each request via `sampling_params.lora_requests`. Size the adapter cache with `--max-loras`:
+
+```bash
+python text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --prompt "A piece of cheesecake" \
+  --lora-paths /lora/style_a /lora/style_b \
+  --lora-scales 1.0 0.5 \
+  --max-loras 2 \
+  --output-dir outputs/composed/
+```
+
+**XYZ plot** — put any parameter on any axis and take the Cartesian product. Each `--axis` has the form `NAME=TYPE:v1|v2|...` where `NAME` is `x` / `y` / `z` and `TYPE` is one of `prompt`, `lora_scale[i]` (targets the i-th `--lora-paths` entry), `guidance_scale`, `num_inference_steps`, or `seed`. X/Y compose a 2D grid; Z writes one `grid_z{k}.png` per value.
+
+```bash
+# 2 × 2 scale sweep of two LoRAs, across 2 prompts (Z)
+python text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --lora-paths /lora/style_a /lora/style_b \
+  --max-loras 2 \
+  --axis "x=lora_scale[0]:0|1" \
+  --axis "y=lora_scale[1]:0|1" \
+  --axis "z=prompt:a girl|a cat" \
+  --output-dir outputs/axis_test/
+```
+
+Grid cells are labeled with `{adapter}\n{scale}` for LoRA-scale axes and the prompt text for a prompt axis; the Z banner shows the current slice.
+
+`--lora-path` and `--lora-paths` are mutually exclusive. `--output-dir` is required whenever the script produces more than one image (multiple prompts, any `--axis`, or `--num-images-per-prompt > 1`).
+
 LoRA adapters must be in PEFT format. A typical adapter directory structure:
 
 ```