Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
86553dc
[Feature] Support multi-LoRA composition for diffusion models
ultism Mar 29, 2026
ca4d418
[Bugfix] Default lora_scales to 1.0 when lora_requests is set via gen…
ultism Mar 29, 2026
fcfd6ed
[CI] Fix ruff formatting for pre-commit checks
ultism Mar 29, 2026
e023c7e
[CI] Fix ruff formatting in test_lora_manager.py
ultism Mar 29, 2026
eee6318
[Docs] Clarify zero-scale semantics in set_active_adapters
ultism Apr 14, 2026
8259080
Merge upstream/main into multi-lora-v2
ultism Apr 14, 2026
f32d76f
[Docs] Restore original Punica semantics docstring in DiffusionBaseLi…
ultism Apr 14, 2026
ad18467
[Multi-LoRA] Address PR #2309 reviewer follow-ups
ultism Apr 16, 2026
0f93d17
[LTX2] Update Stage 2 distilled LoRA loader to new set_active_adapter…
ultism Apr 16, 2026
0e48abf
[Example] Extend text_to_image.py with per-request multi-LoRA and XYZ…
ultism Apr 16, 2026
355ecea
[Docs] Rewrite lora.md with init-time vs per-request sections
ultism Apr 16, 2026
396da3f
[Example] Validate --max-loras >= len(--lora-paths) and clarify XYZ help
ultism Apr 16, 2026
5be2773
[Docs] Fix sampling_params_list claim and note CLI auto-sizes max_loras
ultism Apr 16, 2026
d8c4ba1
[Multi-LoRA] Fix plural lora_requests reconstruction across IPC
ultism Apr 17, 2026
4769ee8
[Example] Fix XYZ grid cell key order so combos become columns
ultism Apr 17, 2026
950def4
[Example] Add --xy-sweep scale matrix and axis-labeled grids
ultism Apr 17, 2026
5b235d4
[Example] Replace --xyz/--xy-sweep with unified --axis XYZ system
ultism Apr 17, 2026
7f5f7fd
[Docs] Update README and lora.md for --axis XYZ system
ultism Apr 17, 2026
035283a
[Docs] Drop stale reference to removed --xyz mode in README
ultism Apr 17, 2026
35f8b34
[LoRA] Document _set_lora_for_layer branch dispatch
ultism Apr 17, 2026
58154a6
Merge remote-tracking branch 'upstream/main' into multi-lora-v2
ultism Apr 17, 2026
af6ca8a
[Multi-LoRA] Adapt upstream /v1/images callers to plural LoRA API
ultism Apr 17, 2026
c6c83a4
Merge remote-tracking branch 'upstream/main' into multi-lora-v2
ultism Apr 25, 2026
f01722e
[Bugfix] Fix leftover singular _parse_lora_request call after merge
ultism Apr 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 148 additions & 22 deletions docs/user_guide/diffusion/lora.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,186 @@
# LoRA (Low-Rank Adaptation) Guide

LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models by adding trainable low-rank matrices to existing model weights. vLLM-Omni currently supports PEFT-style LoRA adapters, allowing you to customize model behavior without modifying the base model weights.
LoRA (Low-Rank Adaptation) enables fine-tuning diffusion models by adding trainable low-rank matrices to existing model weights. vLLM-Omni supports PEFT-style LoRA adapters, allowing you to customize model behavior without modifying the base model weights.

## Overview

LoRA adapters are lightweight, model-specific fine-tuning weights that can be dynamically loaded and applied to diffusion models. vLLM-Omni uses a unified LoRA handling mechanism similar to vLLM with LRU cache management.
vLLM-Omni exposes two complementary LoRA flows for diffusion models:

1. **Init-time LoRA**: a single adapter is pre-loaded when `Omni` starts and is applied to every request. Lowest runtime overhead; best when all requests should share the same adapter.
2. **Per-request LoRA**: zero or more adapters are attached to each request via `sampling_params.lora_requests`. Supports switching adapters between requests and composing multiple adapters in a single forward pass (multi-LoRA).

Adapters are managed by an LRU cache so repeated activations avoid redundant weight reloads.

## LoRA Adapter Format

LoRA adapters must be in **PEFT (Parameter-Efficient Fine-Tuning)** format. A typical LoRA adapter directory structure:
LoRA adapters must be in **PEFT (Parameter-Efficient Fine-Tuning)** format. A typical adapter directory:

```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```

The `adapter_config.json` file contains metadata about the LoRA adapter, including:
`adapter_config.json` contains:
- `r`: LoRA rank
- `lora_alpha`: LoRA alpha scaling factor
- `target_modules`: List of module names to apply LoRA to
- `target_modules`: list of module names the adapter applies to

!!! note "Server-side Path Requirement"
The LoRA adapter path must be readable on the **server** machine. If your client and server are on different hosts, ensure the adapter is accessible via a shared mount or copied to the server.

## Quick Start

### Offline Inference
## Init-time LoRA

#### Pre-loaded LoRA
### How It Works

Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
Passing `lora_path` to `Omni(...)` instructs the engine to register a single adapter at startup and activate it as the only adapter for every request. The adapter occupies one slot of the LoRA cache for the lifetime of the process.

### Usage

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Tongyi-MAI/Z-Image-Turbo",
lora_path="/path/to/lora_adapter",
lora_scale=1.0, # optional, default 1.0
)

outputs = omni.generate(
"A piece of cheesecake",
OmniDiffusionSamplingParams(height=1024, width=1024, num_inference_steps=9),
)
images = outputs[0].request_output.images
```

The CLI wrapper `examples/offline_inference/text_to_image/text_to_image.py` exposes these two kwargs as `--lora-path` and `--lora-scale`:

```bash
python examples/offline_inference/text_to_image/text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora_adapter \
--lora-scale 1.0 \
--output outputs/cheesecake.png
```

### Limitations

- Exactly one adapter, chosen at init. The adapter cannot be swapped or disabled for individual requests — restart `Omni` to change it.
- Mutually exclusive with `--lora-paths` in the example CLI. Use per-request LoRA when you need different adapters on different requests.


## Per-request LoRA

### How It Works

Each request carries its own adapter set via `OmniDiffusionSamplingParams`:

```python
sampling_params = OmniDiffusionSamplingParams(
...,
lora_requests=[req_a, req_b], # list of LoRARequest
lora_scales=[1.0, 0.5], # same length as lora_requests
)
```

- `lora_requests=[]` (or omitted) → no LoRA applied to this request.
- `lora_requests=[req]` → single adapter at the given scale.
- `lora_requests=[req_a, req_b, ...]` → multi-LoRA: all listed adapters are activated simultaneously, each in its own cache slot, and their deltas are summed during the forward pass.

The cache is sized by `max_loras` (defaults to 1). Set `Omni(..., max_loras=N)` when you plan to activate up to `N` adapters concurrently — requests exceeding this limit are rejected. The example CLI at `examples/offline_inference/text_to_image/text_to_image.py` auto-sizes this to `max(len(--lora-paths), 1)` when `--max-loras` is omitted.

### Scale Semantics

- `lora_scales[i]` multiplies adapter `i`'s contribution to the output delta.
- `lora_scales[i] == 0.0` is a registered-but-inactive slot: the adapter remains in the cache but contributes nothing this forward pass. This is distinct from omitting the adapter from `lora_requests`, which releases the slot.
- When `lora_requests` is set and `lora_scales` is omitted, every adapter defaults to scale `1.0`.

### Usage

**Single adapter (per-request):**

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.lora.request import LoRARequest
from vllm_omni.lora.utils import stable_lora_int_id

lora_path="/path/to/lora_adapter"
omni = Omni(model="Tongyi-MAI/Z-Image-Turbo", max_loras=1)

omni = Omni(
model="stabilityai/stable-diffusion-3.5-medium",
lora_path=lora_path
req = LoRARequest(
lora_name="style_a",
lora_int_id=stable_lora_int_id("/path/to/style_a"),
lora_path="/path/to/style_a",
)

lora_request = LoRARequest(
lora_name="preloaded",
lora_int_id=1,
lora_path=lora_path
outputs = omni.generate(
"A piece of cheesecake",
OmniDiffusionSamplingParams(
height=1024,
width=1024,
num_inference_steps=9,
lora_requests=[req],
lora_scales=[1.0],
),
)
```

**Multi-LoRA composition:**

```python
omni = Omni(model="Tongyi-MAI/Z-Image-Turbo", max_loras=2)

req_a = LoRARequest(lora_name="style_a", lora_int_id=stable_lora_int_id("/lora/a"), lora_path="/lora/a")
req_b = LoRARequest(lora_name="style_b", lora_int_id=stable_lora_int_id("/lora/b"), lora_path="/lora/b")

outputs = omni.generate(
prompt="A piece of cheesecake",
lora_request=lora_request,
lora_scale=2.0, # optional arg, default 1.0
"A piece of cheesecake",
OmniDiffusionSamplingParams(
height=1024,
width=1024,
num_inference_steps=9,
lora_requests=[req_a, req_b],
lora_scales=[1.0, 0.5],
),
)
```

!!! note "Server-side Path Requirement"
The LoRA adapter path (`local_path`) must be readable on the **server** machine. If your client and server are on different machines, ensure the LoRA adapter is accessible via a shared mount or copied to the server.
**Switching adapters between requests** — issue separate `omni.generate(...)` calls with different `OmniDiffusionSamplingParams`. `sampling_params_list` on `omni.generate` is stage-indexed (one entry per pipeline stage) and is shared across all prompts in a batch, so per-prompt adapter variance within a single batch call is not supported through that path.

**CLI:**

The example CLI exposes `--lora-paths` + `--lora-scales` for per-request composition, and `--axis` for Cartesian-product XYZ plots that can put any parameter on any of the three axes. Supported axis types are `prompt`, `lora_scale[i]` (i-th `--lora-paths` entry), `guidance_scale`, `num_inference_steps`, and `seed`. X is columns, Y is rows, Z writes one `grid_z{k}.png` per value.

```bash
# Compose two adapters on one prompt
python examples/offline_inference/text_to_image/text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--prompt "A piece of cheesecake" \
--lora-paths /lora/a /lora/b \
--lora-scales 1.0 0.5 \
--max-loras 2 \
--output-dir outputs/composed/

# 2×2 LoRA-scale grid across 2 prompts (Z): produces grid_z00.png + grid_z01.png
python examples/offline_inference/text_to_image/text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--lora-paths /lora/a /lora/b \
--max-loras 2 \
--axis "x=lora_scale[0]:0|1" \
--axis "y=lora_scale[1]:0|1" \
--axis "z=prompt:a girl|a cat" \
--output-dir outputs/axis_test/
```

### Limitations

- Up to `max_loras` adapters per request. Requests that exceed the limit fail fast before inference.
- All adapters in one request share the same forward pass; they must target compatible modules (scheme enforced by PEFT's `target_modules` field). Adapters targeting disjoint modules compose trivially; overlapping modules add linearly.
- `max_loras` sizes the cache at init and is not resizable at runtime.


## Wan2.2 LightX2V Offline Assembly

Expand Down
46 changes: 42 additions & 4 deletions examples/offline_inference/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ python text_to_image.py \
| Argument | Type | Default | Description |
| -------- | ---- | ------- | ----------- |
| `--prompt` | str | `"a cup of coffee on the table"` | Text description for image generation |
| `--prompts` | str+ | — | Multiple prompts for batched generation. Overrides `--prompt`. Requires `--output-dir`. |
| `--seed` | int | `142` | Integer seed for deterministic sampling |
| `--negative-prompt` | str | `None` | Negative prompt for classifier-free conditional guidance |
| `--cfg-scale` | float | `4.0` | True CFG scale (model-specific guidance strength) |
Expand All @@ -82,16 +83,21 @@ python text_to_image.py \
| `--num-inference-steps` | int | `50` | Diffusion sampling steps (more steps = higher quality, slower) |
| `--height` | int | `1024` | Output image height in pixels |
| `--width` | int | `1024` | Output image width in pixels |
| `--output` | str | `"qwen_image_output.png"` | Path to save the generated image |
| `--output` | str | `"qwen_image_output.png"` | Single-image output file path (one prompt, one LoRA combo, one image) |
| `--output-dir` | str | — | Output directory for batch / multi-LoRA / XYZ runs. Files are named `cell_x{x}_y{y}_z{z}_n{n}.png`; `--axis` mode also writes `grid.png` (or `grid_z{k}.png` per Z value). |
| `--vae-use-slicing` | flag | off | Enable VAE slicing for memory optimization |
| `--vae-use-tiling` | flag | off | Enable VAE tiling for memory optimization |
| `--cfg-parallel-size` | int | `1` | Set to `2` to enable CFG Parallel |
| `--ulysses-degree` | int | `1` | Ulysses sequence parallel degree for multi-GPU inference |
| `--ring-degree` | int | `1` | Ring sequence parallel degree for hybrid Ulysses + Ring inference |
| `--ulysses-mode` | str | `"strict"` | Ulysses SP mode: `"strict"` or `"advanced_uaa"` |
| `--enable-cpu-offload` | flag | off | Enable CPU offloading for diffusion models |
| `--lora-path` | str | — | Path to PEFT LoRA adapter folder |
| `--lora-scale` | float | `1.0` | Scale factor for LoRA weights |
| `--lora-path` | str | — | Path to a PEFT LoRA adapter folder for init-time static load |
| `--lora-scale` | float | `1.0` | Scale factor for `--lora-path` |
| `--lora-paths` | str+ | — | One or more PEFT LoRA adapter folders for per-request composition. Mutex with `--lora-path`. |
| `--lora-scales` | float+ | `[1.0 ...]` | Per-adapter scales for `--lora-paths` (length must match) |
| `--max-loras` | int | auto | LoRA cache slot count. Defaults to `max(len(--lora-paths), 1)` |
| `--axis` | str (repeatable) | — | XYZ plot axis spec `NAME=TYPE:v1\|v2\|...` where NAME ∈ `{x,y,z}` and TYPE ∈ `{prompt, lora_scale[i], guidance_scale, num_inference_steps, seed}`. Cartesian product of axes defines cells; X=cols, Y=rows, Z produces one `grid_z{k}.png` per value. Repeat up to 3 times. |
| `--use-system-prompt` | str | `None` | System prompt preset: `en_unified`, `en_vanilla`, `en_recaption`, `en_think_recaption`, `dynamic`, `None`, or custom text. Recommended: `en_unified`. Only for HunyuanImage-3.0.|
| `--system-prompt` | str | `None` | Custom system prompt text. Only used when `--use-system-prompt` is set to `custom`. Only for HunyuanImage-3.0.|

Expand Down Expand Up @@ -252,7 +258,9 @@ See more examples in the [cfg_parallel user guide](../../../docs/user_guide/para

#### LoRA

This example supports PEFT-compatible LoRA (Low-Rank Adaptation) adapters for diffusion models. Pass `--lora-path` to use a LoRA adapter and optionally `--lora-scale` (default `1.0`); omit it to use the base model only.
This example supports PEFT-compatible LoRA (Low-Rank Adaptation) adapters in two modes — see the [LoRA feature guide](../../../docs/user_guide/diffusion/lora.md) for a full description.

**Init-time LoRA** — one adapter is pre-loaded when `Omni` starts and applied to every generation:

```bash
python text_to_image.py \
Expand All @@ -263,6 +271,36 @@ python text_to_image.py \
--output output.png
```

**Per-request LoRA (incl. multi-LoRA composition)** — one or more adapters are attached to each request via `sampling_params.lora_requests`. Size the adapter cache with `--max-loras`:

```bash
python text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--prompt "A piece of cheesecake" \
--lora-paths /lora/style_a /lora/style_b \
--lora-scales 1.0 0.5 \
--max-loras 2 \
--output-dir outputs/composed/
```

**XYZ plot** — put any parameter on any axis and take the Cartesian product. Each `--axis` has the form `NAME=TYPE:v1|v2|...` where `NAME` is `x` / `y` / `z` and `TYPE` is one of `prompt`, `lora_scale[i]` (targets the i-th `--lora-paths` entry), `guidance_scale`, `num_inference_steps`, or `seed`. X/Y compose a 2D grid; Z writes one `grid_z{k}.png` per value.

```bash
# 2 × 2 scale sweep of two LoRAs, across 2 prompts (Z)
python text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--lora-paths /lora/style_a /lora/style_b \
--max-loras 2 \
--axis "x=lora_scale[0]:0|1" \
--axis "y=lora_scale[1]:0|1" \
--axis "z=prompt:a girl|a cat" \
--output-dir outputs/axis_test/
```

Grid cells are labeled with `{adapter}\n{scale}` for LoRA-scale axes and the prompt text for a prompt axis; the Z banner shows the current slice.

`--lora-path` and `--lora-paths` are mutually exclusive. `--output-dir` is required whenever the script produces more than one image (multiple prompts, any `--axis`, or `--num-images-per-prompt > 1`).

LoRA adapters must be in PEFT format. A typical adapter directory structure:

```
Expand Down
Loading
Loading