Skip to content

[Feature] Support multi-LoRA composition for diffusion models#2309

Open
ultism wants to merge 24 commits into
vllm-project:mainfrom
ultism:multi-lora-v2
Open

[Feature] Support multi-LoRA composition for diffusion models#2309
ultism wants to merge 24 commits into
vllm-project:mainfrom
ultism:multi-lora-v2

Conversation

@ultism
Copy link
Copy Markdown

@ultism ultism commented Mar 29, 2026

Summary

Implements multi-LoRA composition for diffusion models, enabling multiple LoRA adapters to be active simultaneously on a single request with independent per-adapter scales. This is a core workflow for the diffusion community (CivitAI, ComfyUI, etc.) and was tracked in RFC #2149.

Key changes:

  • Slot-based buffer management: DiffusionLoRAManager now manages N GPU buffer slots (max_loras) instead of a single slot. Each adapter occupies one slot; unused slots are reset.
  • Additive composition in apply(): DiffusionBaseLinearLayerWithLoRA.apply() iterates all active adapter slots and accumulates LoRA deltas: y += Σ (x @ Aᵢᵀ) @ Bᵢᵀ
  • API surface updated to plural: lora_request/lora_scalelora_requests: list/lora_scales: list across sampling params, entrypoints, and OpenAI-compatible protocol
  • OpenAI endpoint accepts single dict or list of dicts for the lora field in /v1/images/generations, /v1/images/edits, /v1/chat/completions, and /v1/videos/generations
  • LRU cache manages adapter registration independently of activation — max_cpu_loras controls cache size, max_loras controls concurrent GPU slots

New config parameter:

max_loras: int = 1   # max adapters composed per request (GPU slot count)
max_cpu_loras: int    # max adapters cached on CPU (LRU), defaults to max_loras

API Usage

Python (offline inference)

from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.lora.request import LoRARequest

params = OmniDiffusionSamplingParams(
    lora_requests=[lora_a, lora_b],
    lora_scales=[0.8, 0.6],
    # ... other params
)

OpenAI-compatible endpoint

{
  "prompt": "1girl, portrait",
  "lora": [
    {"name": "style_a", "path": "/path/to/lora_a", "scale": 0.8},
    {"name": "style_b", "path": "/path/to/lora_b", "scale": 0.6}
  ]
}

E2E Validation

Tested on RTX 5090 with Z-Image Turbo base model and two CivitAI LoRAs:

XY grid (Jinx scale × Chardesign scale, values: 0, 0.25, 0.5, 1.0):
xy_grid

The grid demonstrates:

  • Each axis independently controls its respective adapter's influence
  • Composition is additive — both styles blend when both scales are non-zero
  • Scale 0 correctly disables the adapter without affecting the other

Files Changed (15 files, +775/-381)

Area Files Description
Core diffusion/lora/manager.py Multi-slot activation, extracted _set_lora_for_layer(), _activate_adapters()
Core diffusion/lora/layers/base_linear.py Multi-adapter apply() loop, per-slot active-slice tracking
Config diffusion/data.py Added max_loras param, validation
Data inputs/data.py lora_requestlora_requests, lora_scalelora_scales
Worker diffusion/worker/diffusion_worker.py Plumbed max_loras, updated set_active_adapter calls
Engine engine/async_omni_engine.py Pass max_loras through engine config
Entrypoints entrypoints/async_omni_diffusion.py Plural lora_requests parameter
Entrypoints entrypoints/openai/utils.py New parse_lora_requests() for dict | list[dict] | None
Entrypoints entrypoints/openai/api_server.py Updated image gen/edit endpoints
Entrypoints entrypoints/openai/serving_chat.py Updated chat-based image generation
Entrypoints entrypoints/openai/serving_video.py Updated video generation
Protocol entrypoints/openai/protocol/images.py lora field accepts list[dict]
Protocol entrypoints/openai/protocol/videos.py lora field accepts list[dict]
Tests tests/diffusion/lora/test_lora_manager.py 7 new multi-LoRA unit tests + updated existing tests
Tests tests/e2e/offline_inference/test_diffusion_lora.py New test_diffusion_multi_lora_composition E2E test

Test Plan

  • Updated existing LoRA unit tests to use the new plural API (set_active_adapters, slot-indexed set_calls/reset_calls)
  • Added 7 new multi-LoRA unit tests: multi-slot activation, unused slot reset, max_loras exceeded, mismatched lengths, empty deactivation, zero-scale filtering, scale comparison — all passed
  • Added E2E test test_diffusion_multi_lora_composition: creates two synthetic LoRA adapters perturbing different QKV slices, verifies composed output differs from each adapter alone and from baseline — passed
  • Manual E2E: XY grid with real CivitAI LoRAs on RTX 5090 (Z-Image Turbo, 16 image combinations) — passed

Breaking Changes

  • OmniDiffusionSamplingParams.lora_requestlora_requests (list)
  • OmniDiffusionSamplingParams.lora_scalelora_scales (list)
  • DiffusionLoRAManager.set_active_adapter()set_active_adapters()

These are internal APIs not yet part of a stable release.

Closes #2149

🤖 Generated with Claude Code

@ultism ultism requested a review from hsliuustc0106 as a code owner March 29, 2026 13:02
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d04ada9f74

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +220 to +221
if lora_requests:
sampling_params.lora_requests = lora_requests
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set default scales when lora_requests override sampling params

When generate_batch(..., lora_requests=...) (and similarly generate) is used, only sampling_params.lora_requests is populated while sampling_params.lora_scales is left at its default empty list. The worker then calls set_active_adapters() and hits the length check (lora_requests vs lora_scales), so requests fail with ValueError whenever callers rely on this new convenience argument without manually setting scales in sampling_params.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! Fixed in commit ca4d418 — when lora_requests is passed via generate() / generate_batch() without an explicit lora_scales, lora_scales now defaults to [1.0] * len(lora_requests) so the length check in set_active_adapters no longer trips. Callers using the convenience argument without specifying scales will get uniform 1.0 scaling, which matches the most common use case.

Comment thread vllm_omni/diffusion/lora/manager.py Outdated
Comment on lines +243 to +245
if len(lora_requests) > self.max_loras:
raise ValueError(
f"Requested {len(lora_requests)} adapters but max_loras={self.max_loras}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter zero-scale adapters before enforcing max_loras

This limit check runs before zero-scale adapters are removed, so requests like [adapter_a, adapter_b] with scales [1.0, 0.0] are rejected when max_loras=1 even though only one adapter would actually be active. Since zero scales are explicitly treated as disabled later in the method, enforcing max_loras on the unfiltered list causes avoidable request failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging — this is actually intentional. In this design, scale=0.0 is semantically distinct from omitting the adapter from lora_requests:

  • 0.0 → adapter stays registered and occupies a LoRA cache slot, but contributes nothing to this forward pass.
  • Absent from the list → slot is released.

So max_loras is enforced against the full lora_requests list on purpose. If we filtered zero-scale adapters first, the same list [A, B] would pass under max_loras=1 with scales [1.0, 0.0] but fail with [1.0, 1.0] — a state-dependent capacity error that's harder to reason about than rejecting up front.

To release a slot, callers should remove the adapter from the request rather than pass 0.0. I've added an inline comment on the check to make this contract explicit.

ultism and others added 2 commits March 29, 2026 22:34
Enable composing multiple LoRA adapters per request (e.g. pose + style + control),
which is standard practice in CivitAI/ComfyUI workflows. Each adapter occupies its
own slot in the pre-allocated lora_a_stacked/lora_b_stacked buffers, and apply()
accumulates deltas from all active slots.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
…erate()

When callers pass lora_requests through the generate/generate_batch
convenience parameter without explicitly setting lora_scales in
sampling_params, the worker would hit a length mismatch ValueError.
Default to scale 1.0 for each adapter when scales are not provided.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
ultism and others added 2 commits March 29, 2026 23:38
Remove extra blank line and collapse single-expression ValueError
raises to satisfy the ruff formatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Expand multi-argument LoRALayerWeights() calls and long type
annotations to satisfy the ruff formatter line-length rules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>


@pytest.mark.parametrize("model_name", models)
def test_diffusion_model(model_name: str, tmp_path: Path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add @pytest.mark.diffusion @pytest.mark.advanced_model
For more information about tags, please refer to docs/contributing/ci/tests_markers.md

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both markers are in place — @pytest.mark.diffusion and @pytest.mark.advanced_model are applied to test_diffusion_model at test_diffusion_lora.py:93-94, matching docs/contributing/ci/tests_markers.md. Thanks for flagging.

Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test plan, please give the whole test script or add an example

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also update the lora feature docs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — docs/user_guide/diffusion/lora.md has been reworked to cover the new flows:

  • 355ecea5 — rewrote the guide around init-time vs per-request LoRA.
  • 5be2773f — clarified that sampling_params_list is stage-indexed (not per-prompt) and noted CLI auto-sizes max_loras.
  • 7f5f7fd8 — documented the --axis XYZ system for Cartesian-product grids across lora_scale[i]/prompt/guidance_scale/num_inference_steps/seed.

The guide now has separate "Init-time LoRA" and "Per-request LoRA" sections with zero-scale slot semantics, max_loras sizing, multi-LoRA composition examples, and CLI snippets. Let me know if anything's still missing.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments. the zero-scale filtering ordering bug is the main one.

if len(lora_requests) > self.max_loras:
raise ValueError(f"Requested {len(lora_requests)} adapters but max_loras={self.max_loras}")

# Filter out zero-scale adapters
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the codex bot's comment — max_loras is enforced before zero-scale adapters are filtered out a few lines below, so [a, b] with scales [1.0, 0.0] blows up when max_loras=1 even though only one adapter would actually be active.

Move the len(lora_requests) > self.max_loras check to after the zero-scale filter loop, operating on active_requests instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, but this ordering is intentional and I'd like to keep it. The key semantic distinction is that scale=0.0 is not equivalent to omitting the adapter — a zero-scale adapter is still registered and occupies a LoRA cache slot, it just contributes nothing to this forward pass. Releasing the slot requires removing the adapter from the request entirely.

If we moved the max_loras check after the zero-scale filter, [A, B] with scales [1.0, 0.0] would pass under max_loras=1, but the same list with scales [1.0, 1.0] (or even [1.0, 0.5]) would fail. That state-dependent capacity behavior is more surprising than the current upfront rejection.

I've pushed a commit that adds an inline comment on the check, so the intent is obvious at the call site. Happy to revisit if you see a concrete use case where zero-scale "soft disable" without a slot is required — but that feels like a distinct feature (e.g. an explicit "reserved but inactive" flag) rather than overloading 0.0.

Will address the _are_active_at_scales ordering TODO and the dead _parse_lora_request cleanup in a follow-up commit.

return False
if self._get_rounded_scale(scale) != self._get_rounded_scale(active_scale):
return False
return True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_are_active_at_scales is order-sensitive — requesting [A, B] then [B, A] with the same scales will trigger a full re-activation even though the composed result is identical (addition is commutative). Probably fine for now but worth a TODO.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added the TODO at manager.py:508-511 inline above the length check, noting the multiset comparison as the follow-up. Keeping the order-sensitive behavior for now since it's correct, just suboptimal.


def _parse_lora_requests(lora_body: dict[str, Any] | list[dict[str, Any]] | None):
try:
return parse_lora_requests(lora_body)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is _parse_lora_request (singular) still called anywhere? Both call sites now use _parse_lora_requests. If it's dead code, remove it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in the multi-LoRA refactor — the singular _parse_lora_request no longer exists anywhere in the tree (rg _parse_lora_request returns nothing). What's at api_server.py:1672 now is the plural _parse_lora_requests, which is a thin wrapper that translates ValueError from parse_lora_requests (in entrypoints/openai/utils.py) into HTTPException(400) — kept because the util is also called from non-HTTP paths (serving_video.py, serving_chat.py) that shouldn't raise HTTP errors.

ultism and others added 16 commits April 14, 2026 23:43
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Resolve conflicts from upstream refactor (PR vllm-project#2006 "Separate
AR/Generation/Diffusion stage processes", RFC vllm-project#967):

- Accept deletion of vllm_omni/entrypoints/async_omni_diffusion.py;
  AsyncOmniDiffusion is gone, replaced by StageDiffusionProc subprocess.
- Migrate lora_scales defaulting from AsyncOmniDiffusion.generate() to
  OmniDiffusionSamplingParams.__post_init__ so any caller setting
  lora_requests without lora_scales still gets 1.0 per adapter.
- Merge stage_engine_args refactor in async_omni_engine.py; add
  max_loras=kwargs.get("max_loras", 1) into the unified dict.
- Accept upstream's test conftest (FakeLinearBase / DummyBaseLayerWithLoRA
  public names); upgrade conftest fixtures to multi-slot signatures so
  both upstream and multi-LoRA tests pass.
- Migrate tests/e2e/offline_inference/test_diffusion_lora.py to
  OmniRunner context manager.
- Keep parse_lora_requests (plural) in serving_chat/serving_video/utils;
  drop upstream's isinstance(lora_body, dict) guard so list-of-dicts
  still works.
- Fix stale self._active_adapter_id (singular) ref in
  DiffusionLoRAManager._deactivate_all_adapters uncovered during merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
…nearLayerWithLoRA

Commit 86553dc (multi-LoRA composition) inadvertently replaced the
original Punica explanation — including the Shrink/Expand semantics
matching PunicaWrapperGPU.add_lora_linear() — with a multi-LoRA blurb.
Restore the original text from vllm-project#758 and append the multi-LoRA note as
a separate paragraph so both pieces of context coexist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
- Add @pytest.mark.diffusion @pytest.mark.advanced_model to
  test_diffusion_model and test_diffusion_multi_lora_composition
  (yenuo26).
- Add TODO on _are_active_at_scales flagging order-sensitivity —
  LoRA delta composition is commutative so same-set-different-order
  re-requests needlessly trigger full re-activation (lishunyang12).
- Remove dead singular _parse_lora_request wrapper and its import
  from api_server.py; only _parse_lora_requests (plural) is called
  (lishunyang12).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
…s API

PR vllm-project#2309 renamed DiffusionLoRAManager.set_active_adapter (singular)
to set_active_adapters (plural) with list signatures. LTX2 distilled
stage was added to upstream in vllm-project#2260 after vllm-project#2309 branched, so its two
call sites were written against the old singular API and broke when
this branch merged upstream/main.

Wrap the single LoRARequest / scale in one-element lists to match the
new signature; behavior is identical.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
… plot

The existing CLI only exposed init-time LoRA (--lora-path / --lora-scale)
which pre-loads a single adapter in the engine. Add per-request flags
and a grid-plot mode to surface the multi-LoRA composition added in
PR vllm-project#2309 and to give reviewers an end-to-end example script (SamitHuang
asked for either a full test script or an example).

New CLI args:
- --prompts (nargs='+')     batched generation in one omni.generate() call
- --lora-paths (nargs='+')  per-request adapters; mutex with --lora-path
- --lora-scales             matching scales (default 1.0 each)
- --max-loras               sizes the adapter cache; auto-derived from
                            len(--lora-paths) when omitted
- --xyz                     render {baseline, each adapter alone, composed}
                            x prompts matrix, stitch to grid.png
- --output-dir              required whenever the run produces >1 image;
                            cells saved as p{p}_c{c}_n{n}.png

Behavior:
- No --lora-paths: one implicit combo, same path as before.
- --xyz: resolves to baseline + each-adapter + composed (when >=2 adapters).
- Per combo, a fresh torch.Generator(seed) is constructed so seed is the
  only source of noise variance across combos (matches the assertion
  pattern in tests/e2e/offline_inference/test_diffusion_lora.py).

README.md updated with Init-time / Per-request / XYZ subsections under
Advanced Features > LoRA and new rows in the Key Arguments table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
The previous LoRA guide documented an Omni.generate(lora_request=...,
lora_scale=...) signature that never existed on the unified
AsyncOmni.generate API — callers following the docs hit
"unexpected keyword argument". It also pre-dated multi-LoRA composition
(PR vllm-project#2309) and treated LoRA as a single feature.

Restructure into two top-level sections that match the cpu_offload_diffusion
style:

- Init-time LoRA: Omni(lora_path=..., lora_scale=...) pre-loads a single
  adapter and applies it to every request. Lowest overhead, no runtime
  swap.
- Per-request LoRA: sampling_params.lora_requests=[...] and .lora_scales
  attach 0..N adapters per request, enabling multi-LoRA composition
  and per-request adapter switching. Includes the zero-scale semantics
  (scale=0.0 occupies a slot but contributes nothing, distinct from
  omitting the adapter) and the max_loras cache-size constraint.

Multi-LoRA is folded into the Per-request section as the natural
list-length-greater-than-one case rather than a separate chapter.

All code examples reference real APIs and are runnable. The Wan2.2
LightX2V appendix and See Also links are preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Previously when the user passed --max-loras smaller than the number of
--lora-paths together with --xyz, the composed combo would fail at
request time with a confusing cache-full rejection. Add a pre-flight
check so the failure mode is obvious.

Also tighten the --xyz help text to call out that the composed column
is skipped for a single-adapter run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
The previous lora.md claim that different requests in one omni.generate
batch can carry distinct adapter sets via sampling_params_list= is
incorrect. sampling_params_list on omni.generate is stage-indexed —
one entry per pipeline stage — and is shared across all prompts in the
batch (see vllm_omni/entrypoints/omni.py:131). Per-prompt adapter
variance must be done via separate generate() calls.

Also annotate the max_loras paragraph with the example CLI's
auto-sizing behavior so readers don't think they must always set it
explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
_reconstruct_sampling_params only reconverted the legacy singular
lora_request key; the new plural lora_requests list was left as raw
msgspec-serialized items after IPC, so worker-side iteration surfaced
each element as list and crashed at manager.set_active_adapters with
'list' object has no attribute lora_int_id. Iterate and convert every
element; drop the dead singular branch since the field is gone from
OmniDiffusionSamplingParams.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
cell_images was keyed as (combo_idx, p_idx) but _compose_grid and the
num_rows/num_cols call site both treat the key as (row=prompt, col=combo).
With 4 combos × 2 prompts the combo index overflowed the canvas height
and the right half of grid.png came out blank. Key as (p_idx, combo_idx)
and unpack accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
--xy-sweep SCALE... runs an N×N grid for exactly two --lora-paths,
with the first adapter's scale varying along X (columns) and the
second along Y (rows). Cells are saved as xy_x{x}_y{y}_n{n}.png
and stitched into a labeled grid.png.

_compose_grid gains optional row_labels/col_labels, so --xyz mode's
grid now shows combo labels (adapter name + scale) as column headers
and prompts as row headers instead of an unlabeled 2D paste.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Any parameter can go on any axis via NAME=TYPE:v1|v2|... — types are
prompt, lora_scale[i], guidance_scale, num_inference_steps, seed.
X/Y compose a 2D grid; Z produces one grid_z{k}.png per Z value.
Cells emit as cell_x{x}_y{y}_z{z}_n{n}.png.

Grid composer gains an optional title banner (used for Z slice);
row labels honor explicit newlines from axis labels so LoRA-scale
headers like "lora_chardesign\n1.00" wrap cleanly in the margin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Replace --xyz / --xy-sweep references with the unified --axis syntax.
Update output filename convention (cell_x{x}_y{y}_z{z}.png and
grid_z{k}.png) and add an example showing two LoRA scales on X/Y plus
prompt on Z to demonstrate the Cartesian-product semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
Annotate the four weight-shape cases (missing / packed / fused multi-slice /
fused single-slice) and the non-obvious sub-suffix lookup for packed layers,
so the branch structure is readable without tracing call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
@ultism
Copy link
Copy Markdown
Author

ultism commented Apr 17, 2026

Re: test plan — added both:

  • E2E test: tests/e2e/offline_inference/test_diffusion_lora.py covers init-time LoRA, per-request single/multi-LoRA, zero-scale slot semantics, and max_loras rejection, using a synthetic Z-Image PEFT adapter (tagged @pytest.mark.diffusion @pytest.mark.advanced_model).
  • Runnable example: examples/offline_inference/text_to_image/text_to_image.py now exposes --lora-path (init-time), --lora-paths + --lora-scales (per-request composition), and --axis x=…/y=…/z=… for Cartesian-product XYZ grids over lora_scale[i] / prompt / guidance_scale / num_inference_steps / seed.

Full walkthroughs with CLI snippets are in docs/user_guide/diffusion/lora.md. cc @SamitHuang

Signed-off-by: ultranationalism <www913363043@gmail.com>
Upstream main added new call sites that reference the singular
`_parse_lora_request` / `parse_lora_request` helpers and the
corresponding singular `.lora_request` / `.lora_scale` fields, both of
which were removed by the multi-LoRA refactor. Update the three new
callers in `api_server.py` and `serving_chat.py` to use the plural
`_parse_lora_requests` / `parse_lora_requests` helpers and assign the
plural `.lora_requests` / `.lora_scales` fields, widening the type
guard to accept `list` so multi-LoRA bodies are not silently dropped.

Also apply `ruff format` touch-ups to `text_to_image.py` and
`stage_diffusion_proc.py` that became due after the merge.

Signed-off-by: ultranationalism <www913363043@gmail.com>
ultism and others added 2 commits April 25, 2026 22:12
Signed-off-by: ultranationalism <www913363043@gmail.com>

# Conflicts:
#	examples/offline_inference/text_to_image/text_to_image.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ultranationalism <www913363043@gmail.com>
@ultism
Copy link
Copy Markdown
Author

ultism commented Apr 25, 2026

@hsliuustc0106 Could you take a look when you have time? The branch has been rebased onto latest upstream/main and pre-commit is now passing. Ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Multi-LoRA Composition for Diffusion Models

4 participants