[Feature] Support multi-LoRA composition for diffusion models#2309
[Feature] Support multi-LoRA composition for diffusion models#2309ultism wants to merge 24 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d04ada9f74
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if lora_requests: | ||
| sampling_params.lora_requests = lora_requests |
There was a problem hiding this comment.
Set default scales when lora_requests override sampling params
When generate_batch(..., lora_requests=...) (and similarly generate) is used, only sampling_params.lora_requests is populated while sampling_params.lora_scales is left at its default empty list. The worker then calls set_active_adapters() and hits the length check (lora_requests vs lora_scales), so requests fail with ValueError whenever callers rely on this new convenience argument without manually setting scales in sampling_params.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks for catching this! Fixed in commit ca4d418 — when lora_requests is passed via generate() / generate_batch() without an explicit lora_scales, lora_scales now defaults to [1.0] * len(lora_requests) so the length check in set_active_adapters no longer trips. Callers using the convenience argument without specifying scales will get uniform 1.0 scaling, which matches the most common use case.
| if len(lora_requests) > self.max_loras: | ||
| raise ValueError( | ||
| f"Requested {len(lora_requests)} adapters but max_loras={self.max_loras}" |
There was a problem hiding this comment.
Filter zero-scale adapters before enforcing max_loras
This limit check runs before zero-scale adapters are removed, so requests like [adapter_a, adapter_b] with scales [1.0, 0.0] are rejected when max_loras=1 even though only one adapter would actually be active. Since zero scales are explicitly treated as disabled later in the method, enforcing max_loras on the unfiltered list causes avoidable request failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks for flagging — this is actually intentional. In this design, scale=0.0 is semantically distinct from omitting the adapter from lora_requests:
0.0→ adapter stays registered and occupies a LoRA cache slot, but contributes nothing to this forward pass.- Absent from the list → slot is released.
So max_loras is enforced against the full lora_requests list on purpose. If we filtered zero-scale adapters first, the same list [A, B] would pass under max_loras=1 with scales [1.0, 0.0] but fail with [1.0, 1.0] — a state-dependent capacity error that's harder to reason about than rejecting up front.
To release a slot, callers should remove the adapter from the request rather than pass 0.0. I've added an inline comment on the check to make this contract explicit.
Enable composing multiple LoRA adapters per request (e.g. pose + style + control), which is standard practice in CivitAI/ComfyUI workflows. Each adapter occupies its own slot in the pre-allocated lora_a_stacked/lora_b_stacked buffers, and apply() accumulates deltas from all active slots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
…erate() When callers pass lora_requests through the generate/generate_batch convenience parameter without explicitly setting lora_scales in sampling_params, the worker would hit a length mismatch ValueError. Default to scale 1.0 for each adapter when scales are not provided. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
Remove extra blank line and collapse single-expression ValueError raises to satisfy the ruff formatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
Expand multi-argument LoRALayerWeights() calls and long type annotations to satisfy the ruff formatter line-length rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
|
|
||
|
|
||
| @pytest.mark.parametrize("model_name", models) | ||
| def test_diffusion_model(model_name: str, tmp_path: Path): |
There was a problem hiding this comment.
please add @pytest.mark.diffusion @pytest.mark.advanced_model
For more information about tags, please refer to docs/contributing/ci/tests_markers.md
There was a problem hiding this comment.
Both markers are in place — @pytest.mark.diffusion and @pytest.mark.advanced_model are applied to test_diffusion_model at test_diffusion_lora.py:93-94, matching docs/contributing/ci/tests_markers.md. Thanks for flagging.
SamitHuang
left a comment
There was a problem hiding this comment.
For the test plan, please give the whole test script or add an example
There was a problem hiding this comment.
please also update the lora feature docs.
There was a problem hiding this comment.
Done — docs/user_guide/diffusion/lora.md has been reworked to cover the new flows:
355ecea5— rewrote the guide around init-time vs per-request LoRA.5be2773f— clarified thatsampling_params_listis stage-indexed (not per-prompt) and noted CLI auto-sizesmax_loras.7f5f7fd8— documented the--axisXYZ system for Cartesian-product grids acrosslora_scale[i]/prompt/guidance_scale/num_inference_steps/seed.
The guide now has separate "Init-time LoRA" and "Per-request LoRA" sections with zero-scale slot semantics, max_loras sizing, multi-LoRA composition examples, and CLI snippets. Let me know if anything's still missing.
lishunyang12
left a comment
There was a problem hiding this comment.
left a few comments. the zero-scale filtering ordering bug is the main one.
| if len(lora_requests) > self.max_loras: | ||
| raise ValueError(f"Requested {len(lora_requests)} adapters but max_loras={self.max_loras}") | ||
|
|
||
| # Filter out zero-scale adapters |
There was a problem hiding this comment.
+1 on the codex bot's comment — max_loras is enforced before zero-scale adapters are filtered out a few lines below, so [a, b] with scales [1.0, 0.0] blows up when max_loras=1 even though only one adapter would actually be active.
Move the len(lora_requests) > self.max_loras check to after the zero-scale filter loop, operating on active_requests instead.
There was a problem hiding this comment.
Good catch, but this ordering is intentional and I'd like to keep it. The key semantic distinction is that scale=0.0 is not equivalent to omitting the adapter — a zero-scale adapter is still registered and occupies a LoRA cache slot, it just contributes nothing to this forward pass. Releasing the slot requires removing the adapter from the request entirely.
If we moved the max_loras check after the zero-scale filter, [A, B] with scales [1.0, 0.0] would pass under max_loras=1, but the same list with scales [1.0, 1.0] (or even [1.0, 0.5]) would fail. That state-dependent capacity behavior is more surprising than the current upfront rejection.
I've pushed a commit that adds an inline comment on the check, so the intent is obvious at the call site. Happy to revisit if you see a concrete use case where zero-scale "soft disable" without a slot is required — but that feels like a distinct feature (e.g. an explicit "reserved but inactive" flag) rather than overloading 0.0.
Will address the _are_active_at_scales ordering TODO and the dead _parse_lora_request cleanup in a follow-up commit.
| return False | ||
| if self._get_rounded_scale(scale) != self._get_rounded_scale(active_scale): | ||
| return False | ||
| return True |
There was a problem hiding this comment.
_are_active_at_scales is order-sensitive — requesting [A, B] then [B, A] with the same scales will trigger a full re-activation even though the composed result is identical (addition is commutative). Probably fine for now but worth a TODO.
There was a problem hiding this comment.
Good call — added the TODO at manager.py:508-511 inline above the length check, noting the multiset comparison as the follow-up. Keeping the order-sensitive behavior for now since it's correct, just suboptimal.
|
|
||
| def _parse_lora_requests(lora_body: dict[str, Any] | list[dict[str, Any]] | None): | ||
| try: | ||
| return parse_lora_requests(lora_body) |
There was a problem hiding this comment.
Nit: is _parse_lora_request (singular) still called anywhere? Both call sites now use _parse_lora_requests. If it's dead code, remove it.
There was a problem hiding this comment.
Cleaned up in the multi-LoRA refactor — the singular _parse_lora_request no longer exists anywhere in the tree (rg _parse_lora_request returns nothing). What's at api_server.py:1672 now is the plural _parse_lora_requests, which is a thin wrapper that translates ValueError from parse_lora_requests (in entrypoints/openai/utils.py) into HTTPException(400) — kept because the util is also called from non-HTTP paths (serving_video.py, serving_chat.py) that shouldn't raise HTTP errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
Resolve conflicts from upstream refactor (PR vllm-project#2006 "Separate AR/Generation/Diffusion stage processes", RFC vllm-project#967): - Accept deletion of vllm_omni/entrypoints/async_omni_diffusion.py; AsyncOmniDiffusion is gone, replaced by StageDiffusionProc subprocess. - Migrate lora_scales defaulting from AsyncOmniDiffusion.generate() to OmniDiffusionSamplingParams.__post_init__ so any caller setting lora_requests without lora_scales still gets 1.0 per adapter. - Merge stage_engine_args refactor in async_omni_engine.py; add max_loras=kwargs.get("max_loras", 1) into the unified dict. - Accept upstream's test conftest (FakeLinearBase / DummyBaseLayerWithLoRA public names); upgrade conftest fixtures to multi-slot signatures so both upstream and multi-LoRA tests pass. - Migrate tests/e2e/offline_inference/test_diffusion_lora.py to OmniRunner context manager. - Keep parse_lora_requests (plural) in serving_chat/serving_video/utils; drop upstream's isinstance(lora_body, dict) guard so list-of-dicts still works. - Fix stale self._active_adapter_id (singular) ref in DiffusionLoRAManager._deactivate_all_adapters uncovered during merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
…nearLayerWithLoRA Commit 86553dc (multi-LoRA composition) inadvertently replaced the original Punica explanation — including the Shrink/Expand semantics matching PunicaWrapperGPU.add_lora_linear() — with a multi-LoRA blurb. Restore the original text from vllm-project#758 and append the multi-LoRA note as a separate paragraph so both pieces of context coexist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
- Add @pytest.mark.diffusion @pytest.mark.advanced_model to test_diffusion_model and test_diffusion_multi_lora_composition (yenuo26). - Add TODO on _are_active_at_scales flagging order-sensitivity — LoRA delta composition is commutative so same-set-different-order re-requests needlessly trigger full re-activation (lishunyang12). - Remove dead singular _parse_lora_request wrapper and its import from api_server.py; only _parse_lora_requests (plural) is called (lishunyang12). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
…s API PR vllm-project#2309 renamed DiffusionLoRAManager.set_active_adapter (singular) to set_active_adapters (plural) with list signatures. LTX2 distilled stage was added to upstream in vllm-project#2260 after vllm-project#2309 branched, so its two call sites were written against the old singular API and broke when this branch merged upstream/main. Wrap the single LoRARequest / scale in one-element lists to match the new signature; behavior is identical. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
… plot The existing CLI only exposed init-time LoRA (--lora-path / --lora-scale) which pre-loads a single adapter in the engine. Add per-request flags and a grid-plot mode to surface the multi-LoRA composition added in PR vllm-project#2309 and to give reviewers an end-to-end example script (SamitHuang asked for either a full test script or an example). New CLI args: - --prompts (nargs='+') batched generation in one omni.generate() call - --lora-paths (nargs='+') per-request adapters; mutex with --lora-path - --lora-scales matching scales (default 1.0 each) - --max-loras sizes the adapter cache; auto-derived from len(--lora-paths) when omitted - --xyz render {baseline, each adapter alone, composed} x prompts matrix, stitch to grid.png - --output-dir required whenever the run produces >1 image; cells saved as p{p}_c{c}_n{n}.png Behavior: - No --lora-paths: one implicit combo, same path as before. - --xyz: resolves to baseline + each-adapter + composed (when >=2 adapters). - Per combo, a fresh torch.Generator(seed) is constructed so seed is the only source of noise variance across combos (matches the assertion pattern in tests/e2e/offline_inference/test_diffusion_lora.py). README.md updated with Init-time / Per-request / XYZ subsections under Advanced Features > LoRA and new rows in the Key Arguments table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
The previous LoRA guide documented an Omni.generate(lora_request=..., lora_scale=...) signature that never existed on the unified AsyncOmni.generate API — callers following the docs hit "unexpected keyword argument". It also pre-dated multi-LoRA composition (PR vllm-project#2309) and treated LoRA as a single feature. Restructure into two top-level sections that match the cpu_offload_diffusion style: - Init-time LoRA: Omni(lora_path=..., lora_scale=...) pre-loads a single adapter and applies it to every request. Lowest overhead, no runtime swap. - Per-request LoRA: sampling_params.lora_requests=[...] and .lora_scales attach 0..N adapters per request, enabling multi-LoRA composition and per-request adapter switching. Includes the zero-scale semantics (scale=0.0 occupies a slot but contributes nothing, distinct from omitting the adapter) and the max_loras cache-size constraint. Multi-LoRA is folded into the Per-request section as the natural list-length-greater-than-one case rather than a separate chapter. All code examples reference real APIs and are runnable. The Wan2.2 LightX2V appendix and See Also links are preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
Previously when the user passed --max-loras smaller than the number of --lora-paths together with --xyz, the composed combo would fail at request time with a confusing cache-full rejection. Add a pre-flight check so the failure mode is obvious. Also tighten the --xyz help text to call out that the composed column is skipped for a single-adapter run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
The previous lora.md claim that different requests in one omni.generate batch can carry distinct adapter sets via sampling_params_list= is incorrect. sampling_params_list on omni.generate is stage-indexed — one entry per pipeline stage — and is shared across all prompts in the batch (see vllm_omni/entrypoints/omni.py:131). Per-prompt adapter variance must be done via separate generate() calls. Also annotate the max_loras paragraph with the example CLI's auto-sizing behavior so readers don't think they must always set it explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
_reconstruct_sampling_params only reconverted the legacy singular lora_request key; the new plural lora_requests list was left as raw msgspec-serialized items after IPC, so worker-side iteration surfaced each element as list and crashed at manager.set_active_adapters with 'list' object has no attribute lora_int_id. Iterate and convert every element; drop the dead singular branch since the field is gone from OmniDiffusionSamplingParams. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
cell_images was keyed as (combo_idx, p_idx) but _compose_grid and the num_rows/num_cols call site both treat the key as (row=prompt, col=combo). With 4 combos × 2 prompts the combo index overflowed the canvas height and the right half of grid.png came out blank. Key as (p_idx, combo_idx) and unpack accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
--xy-sweep SCALE... runs an N×N grid for exactly two --lora-paths,
with the first adapter's scale varying along X (columns) and the
second along Y (rows). Cells are saved as xy_x{x}_y{y}_n{n}.png
and stitched into a labeled grid.png.
_compose_grid gains optional row_labels/col_labels, so --xyz mode's
grid now shows combo labels (adapter name + scale) as column headers
and prompts as row headers instead of an unlabeled 2D paste.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Any parameter can go on any axis via NAME=TYPE:v1|v2|... — types are
prompt, lora_scale[i], guidance_scale, num_inference_steps, seed.
X/Y compose a 2D grid; Z produces one grid_z{k}.png per Z value.
Cells emit as cell_x{x}_y{y}_z{z}_n{n}.png.
Grid composer gains an optional title banner (used for Z slice);
row labels honor explicit newlines from axis labels so LoRA-scale
headers like "lora_chardesign\n1.00" wrap cleanly in the margin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Replace --xyz / --xy-sweep references with the unified --axis syntax.
Update output filename convention (cell_x{x}_y{y}_z{z}.png and
grid_z{k}.png) and add an example showing two LoRA scales on X/Y plus
prompt on Z to demonstrate the Cartesian-product semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
Annotate the four weight-shape cases (missing / packed / fused multi-slice / fused single-slice) and the non-obvious sub-suffix lookup for packed layers, so the branch structure is readable without tracing call sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
|
Re: test plan — added both:
Full walkthroughs with CLI snippets are in |
Signed-off-by: ultranationalism <www913363043@gmail.com>
Upstream main added new call sites that reference the singular `_parse_lora_request` / `parse_lora_request` helpers and the corresponding singular `.lora_request` / `.lora_scale` fields, both of which were removed by the multi-LoRA refactor. Update the three new callers in `api_server.py` and `serving_chat.py` to use the plural `_parse_lora_requests` / `parse_lora_requests` helpers and assign the plural `.lora_requests` / `.lora_scales` fields, widening the type guard to accept `list` so multi-LoRA bodies are not silently dropped. Also apply `ruff format` touch-ups to `text_to_image.py` and `stage_diffusion_proc.py` that became due after the merge. Signed-off-by: ultranationalism <www913363043@gmail.com>
Signed-off-by: ultranationalism <www913363043@gmail.com> # Conflicts: # examples/offline_inference/text_to_image/text_to_image.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>
|
@hsliuustc0106 Could you take a look when you have time? The branch has been rebased onto latest upstream/main and pre-commit is now passing. Ready for review. |
Summary
Implements multi-LoRA composition for diffusion models, enabling multiple LoRA adapters to be active simultaneously on a single request with independent per-adapter scales. This is a core workflow for the diffusion community (CivitAI, ComfyUI, etc.) and was tracked in RFC #2149.
Key changes:
DiffusionLoRAManagernow manages N GPU buffer slots (max_loras) instead of a single slot. Each adapter occupies one slot; unused slots are reset.apply():DiffusionBaseLinearLayerWithLoRA.apply()iterates all active adapter slots and accumulates LoRA deltas:y += Σ (x @ Aᵢᵀ) @ Bᵢᵀlora_request/lora_scale→lora_requests: list/lora_scales: listacross sampling params, entrypoints, and OpenAI-compatible protocollorafield in/v1/images/generations,/v1/images/edits,/v1/chat/completions, and/v1/videos/generationsmax_cpu_lorascontrols cache size,max_lorascontrols concurrent GPU slotsNew config parameter:
API Usage
Python (offline inference)
OpenAI-compatible endpoint
{ "prompt": "1girl, portrait", "lora": [ {"name": "style_a", "path": "/path/to/lora_a", "scale": 0.8}, {"name": "style_b", "path": "/path/to/lora_b", "scale": 0.6} ] }E2E Validation
Tested on RTX 5090 with Z-Image Turbo base model and two CivitAI LoRAs:
XY grid (Jinx scale × Chardesign scale, values: 0, 0.25, 0.5, 1.0):

The grid demonstrates:
Files Changed (15 files, +775/-381)
diffusion/lora/manager.py_set_lora_for_layer(),_activate_adapters()diffusion/lora/layers/base_linear.pyapply()loop, per-slot active-slice trackingdiffusion/data.pymax_lorasparam, validationinputs/data.pylora_request→lora_requests,lora_scale→lora_scalesdiffusion/worker/diffusion_worker.pymax_loras, updatedset_active_adaptercallsengine/async_omni_engine.pymax_lorasthrough engine configentrypoints/async_omni_diffusion.pylora_requestsparameterentrypoints/openai/utils.pyparse_lora_requests()fordict | list[dict] | Noneentrypoints/openai/api_server.pyentrypoints/openai/serving_chat.pyentrypoints/openai/serving_video.pyentrypoints/openai/protocol/images.pylorafield acceptslist[dict]entrypoints/openai/protocol/videos.pylorafield acceptslist[dict]tests/diffusion/lora/test_lora_manager.pytests/e2e/offline_inference/test_diffusion_lora.pytest_diffusion_multi_lora_compositionE2E testTest Plan
set_active_adapters, slot-indexedset_calls/reset_calls)test_diffusion_multi_lora_composition: creates two synthetic LoRA adapters perturbing different QKV slices, verifies composed output differs from each adapter alone and from baseline — passedBreaking Changes
OmniDiffusionSamplingParams.lora_request→lora_requests(list)OmniDiffusionSamplingParams.lora_scale→lora_scales(list)DiffusionLoRAManager.set_active_adapter()→set_active_adapters()These are internal APIs not yet part of a stable release.
Closes #2149
🤖 Generated with Claude Code