Skip to content

reafator pipeline stage/step pipeline#1368

Merged
wtomin merged 5 commits into
vllm-project:mainfrom
omni-nicelab:pr/pipeline
Mar 20, 2026
Merged

reafator pipeline stage/step pipeline#1368
wtomin merged 5 commits into
vllm-project:mainfrom
omni-nicelab:pr/pipeline

Conversation

@asukaqaq-s
Copy link
Copy Markdown
Contributor

@asukaqaq-s asukaqaq-s commented Feb 13, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Introduce step-level execution capability at the runner/pipeline layer:
prepare_encode → denoise_step × N → step_scheduler × N → post_decode
QwenImagePipeline will be the first implementation.

Out of scope (no changes):Engine、Executor、Worker entrypoint、External APIs

This PR is strictly limited to the runner/pipeline layer and maintains full backward compatibility.
relativate RFC: #874

Test Plan

Test Result

image Prompts used:

"a cup of coffee on a wooden table, morning light, photorealistic"
"a red panda sitting on a tree branch in a bamboo forest, soft focus background"
"an astronaut riding a horse on the surface of Mars, cinematic lighting"
"a cozy cabin in the snowy mountains at sunset, warm glow from windows"
"a futuristic cityscape with flying cars and neon lights, cyberpunk style"

no preformance degradation

image Average generation time across 5 prompts per resolution: Stepwise execution introduces zero measurable overhead (< 0.1% difference, within noise).

bit-for-bit identical output

All 15 image pairs (5 prompts x 3 resolutions) produce identical MD5 checksums between stepwise and non-stepwise modes, confirming that the stepwise refactoring does not alter the generation output in any way.

512x512/prompt_0: IDENTICAL
512x512/prompt_1: IDENTICAL
512x512/prompt_2: IDENTICAL
512x512/prompt_3: IDENTICAL
512x512/prompt_4: IDENTICAL
768x768/prompt_0: IDENTICAL
768x768/prompt_1: IDENTICAL
768x768/prompt_2: IDENTICAL
768x768/prompt_3: IDENTICAL
768x768/prompt_4: IDENTICAL
1024x1024/prompt_0: IDENTICAL
1024x1024/prompt_1: IDENTICAL
1024x1024/prompt_2: IDENTICAL
1024x1024/prompt_3: IDENTICAL
1024x1024/prompt_4: IDENTICAL


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@asukaqaq-s asukaqaq-s changed the title reafator pipeline stage/step api reafator pipeline stage/step pipeline Feb 13, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. The step-level decomposition is a solid architectural direction. A few observations and questions inline before merge.

prompt_embeds_mask: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds_mask: torch.Tensor | None = None,
attention_kwargs: dict[str, Any] | None = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about a subtle difference from the existing forward() method. In forward(), the prompt/negative_prompt extraction from req.prompts happens unconditionally:

prompt = [p if isinstance(p, str) else (p.get("prompt") or "") for p in req.prompts] or prompt

But here in prepare_encode, it's guarded by if prompt is None. This means if someone passes an explicit prompt argument alongside a req that also contains prompts, forward() would override with req.prompts while prepare_encode would keep the explicit argument. Could you help me understand whether this divergence is intentional? If not, it might be worth making the behavior identical to forward() to avoid subtle bugs when the two code paths are used side by side.

scheduler_override: Any | None = None,
):
if prompt is None:
prompt = [p if isinstance(p, str) else (p.get("prompt") or "") for p in req.prompts] or prompt
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit -- the negative_prompt handling here also diverges from forward() in a similar way to the positive prompt. In forward(), the negative prompt logic runs unconditionally:

if all(isinstance(p, str) or p.get("negative_prompt") is None for p in req.prompts):
    negative_prompt = None

But here it's guarded by if negative_prompt is None. This means if negative_prompt is explicitly passed as an empty string "" (which is falsy but not None), the behavior would differ between the two code paths. I might be overthinking this, but it seemed worth flagging for consistency.


def prepare_encode(
self,
req: OmniDiffusionRequest,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation is missing from prepare_encode. Since this method returns a 15-element tuple that the caller (execute_stepwise) unpacks positionally, I was wondering if it might help maintainability to either:

  1. Add a return type annotation (even a tuple[...]), or
  2. Return a NamedTuple or @dataclass so callers don't need to rely on positional unpacking of 15 values?

Positional unpacking of large tuples can be fragile when someone later adds/removes/reorders a return value. Just a thought -- happy to hear your perspective!

height // self.vae_scale_factor // 2,
width // self.vae_scale_factor // 2,
)
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that prepare_encode does not call self.prepare_timesteps() (which already exists as a method on the class) but instead inlines the timestep preparation logic directly. The existing forward() method uses self.prepare_timesteps(). Could this lead to divergence if someone later modifies prepare_timesteps()? Would it make sense to reuse that helper here for DRY-ness?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in 868e40a. prepare_encode() now reuses self.prepare_timesteps(...) (the same path as forward()), so the timestep/mu logic is centralized and won’t drift.

but to do this,I also updated the cfg_parallel.py scheduler_step* signatures to accept an optional scheduler argument. This allows stepwise execution with per-request scheduler state instead of temporarily rebinding self.scheduler. The change is backward-compatible — scheduler=None falls back to self.scheduler, so existing call sites are unaffected.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now

latents: torch.Tensor,
img_shapes: list,
txt_seq_lens: list[int] | None,
negative_txt_seq_lens: list[int] | None,
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good reuse of predict_noise_maybe_with_cfg from the CFGParallelMixin. I noticed one difference though: in diffuse(), the additional_transformer_kwargs (which include return_dict and attention_kwargs) are spread into both positive_kwargs and negative_kwargs via **additional_transformer_kwargs. Here in denoise_step, these are set directly as individual keys. The behavior should be equivalent, but I wanted to confirm -- are the kwargs identical in both cases? Specifically, diffuse() passes attention_kwargs as a nested key via the spread, while here it's set as "attention_kwargs": self.attention_kwargs. They look the same to me, just wanted to double-check.

"hidden_states": latent_model_input,
"timestep": t_for_model / 1000,
"guidance": guidance,
"encoder_hidden_states_mask": prompt_embeds_mask,
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling the scheduler override by temporarily swapping self.scheduler with a try/finally is clean and safe.

One small thing I noticed: the mixin's scheduler_step_maybe_with_cfg presumably calls self.scheduler.step(...) internally. If an exception occurs during that call, the finally block will restore the original scheduler, which is great. But could there be thread-safety concerns if multiple requests are processed concurrently? In that case, temporarily mutating self.scheduler on the instance might cause race conditions. Is the current design single-threaded per pipeline instance? If so, this is perfectly fine.

with set_forward_context(vllm_config=self.vllm_config, omni_diffusion_config=self.od_config):
with record_function("pipeline_forward"):
output = self.pipeline.forward(req)
if isinstance(self.pipeline, SupportsStepExecution):
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatch logic here is clean. One minor question: since SupportsStepExecution is a @runtime_checkable Protocol, the isinstance check will verify that the pipeline has the supports_step_execution class variable and the required methods. However, I noticed that the helper function supports_step_execution() from interface.py is not used here. Would it be slightly cleaner to use the helper function instead of the raw isinstance check? That way the logic is in one place:\npython\nfrom vllm_omni.diffusion.models.interface import supports_step_execution\nif supports_step_execution(self.pipeline):\n\nMinor style suggestion -- the current code works fine too.

from vllm_omni.platforms import current_omni_platform

logger = init_logger(__name__)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: it looks like an extra blank line was introduced here (there are now two blank lines between the logger assignment and the class definition, where there was one before). Not a blocker at all, just flagging in case you want to keep formatting consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this. I checked and the spacing is enforced by our lint/CI — top-level classes require two blank lines, so reducing it to one would fail the style check.

I’ll keep it as-is to stay consistent with the formatter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

Comment thread vllm_omni/diffusion/models/interface.py Outdated

supports_step_execution: ClassVar[bool] = True

def prepare_encode(self, req: "OmniDiffusionRequest", **kwargs: Any) -> Any:
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean protocol design -- intentionally permissive with *args, **kwargs for denoise_step, step_scheduler, and post_decode.\n\nOne thought: prepare_encode takes a concrete OmniDiffusionRequest parameter, which couples the protocol to that specific request type. If future pipelines might use a different request type, would it make sense to loosen this to Any as well (matching the other methods), or is OmniDiffusionRequest intentionally the canonical request type for all diffusion pipelines? Curious about the intent.

.to(latents.device, latents.dtype)
)
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
latents.device, latents.dtype
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In post_decode, when output_type == "latent", the original forward() assigns image = latents and then wraps it in DiffusionOutput(output=image). Here, you return DiffusionOutput(output=latents) directly, which is equivalent.

However, I noticed that forward() sets self._current_timestep = None after the diffuse loop ends, but neither post_decode nor execute_stepwise resets it. This could leave self._current_timestep pointing to the last timestep value after generation completes. Could this cause issues if something inspects current_timestep between requests? It might be worth adding self._current_timestep = None at the end of post_decode or in execute_stepwise to match forward()'s behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that’s a great point. You’re right that the stepwise path should mirror forward() and clear self._current_timestep after generation.

I fixed this in the latest commit — self._current_timestep is now reset to None at the end of stepwise decoding, so it won’t retain the last timestep across requests.

DiffusionOutput(output=latents) is behaviorally unchanged compared to assigning image = latents first.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link
Copy Markdown

🤖 VLLM-Omni PR Review

Code Review: Refactor Pipeline Stage/Step Pipeline

1. Overview

This PR introduces step-level execution capability at the runner/pipeline layer for diffusion models, implementing a new execution flow: prepare_encode → denoise_step × N → step_scheduler × N → post_decode. The changes are well-scoped to the runner/pipeline layer and maintain backward compatibility through protocol-based detection.

Overall Assessment: Positive with suggestions - The architectural approach is sound, but there are several areas that need attention before merging.


2. Code Quality

Strengths

  • Good use of TYPE_CHECKING to avoid circular imports
  • Protocol-based design allows for clean duck typing
  • Methods have helpful docstrings

Issues

a) Fragile tuple unpacking in execute_stepwise (diffusion_model_runner.py:203-208)

The 14-element tuple returned from prepare_encode is fragile and error-prone. If the return order changes, it will break silently.

(
    prompt_embeds, prompt_embeds_mask,
    negative_prompt_embeds, negative_prompt_embeds_mask,
    latents, img_shapes, txt_seq_lens, negative_txt_seq_lens,
    timesteps, do_true_cfg, guidance, true_cfg_scale,
    height, width, scheduler,
) = self.pipeline.prepare_encode(req=req)

Recommendation: Use a dataclass or NamedTuple for the return value:

@dataclass
class StepExecutionContext:
    prompt_embeds: torch.Tensor
    prompt_embeds_mask: torch.Tensor
    negative_prompt_embeds: torch.Tensor | None
    negative_prompt_embeds_mask: torch.Tensor | None
    latents: torch.Tensor
    img_shapes: list
    txt_seq_lens: list[int] | None
    negative_txt_seq_lens: list[int] | None
    timesteps: torch.Tensor
    do_true_cfg: bool
    guidance: torch.Tensor | None
    true_cfg_scale: float
    height: int
    width: int
    scheduler: Any

b) Missing return type annotation (pipeline_qwen_image.py:536)

prepare_encode lacks a return type annotation, making it difficult to understand the expected output.

c) Interrupt handling incomplete (diffusion_model_runner.py:211-218)

denoise_step can return None when interrupted, but the loop continues without checking:

for _i, t in enumerate(timesteps):
    noise_pred = self.pipeline.denoise_step(...)  # Can return None
    latents = self.pipeline.step_scheduler(...)   # Uses None?

Recommendation:

for _i, t in enumerate(timesteps):
    noise_pred = self.pipeline.denoise_step(...)
    if noise_pred is None:
        break  # Handle interrupt
    latents = self.pipeline.step_scheduler(...)

3. Architecture & Design

Strengths

  • Clean separation of concerns with distinct methods for each phase
  • Backward compatibility maintained via isinstance check
  • Protocol pattern allows gradual adoption by other pipelines

Issues

a) Scheduler state mutation (pipeline_qwen_image.py:766-776)

The temporary binding of self.scheduler in step_scheduler modifies instance state, which could cause issues in concurrent scenarios:

if scheduler is not None and scheduler is not self.scheduler:
    saved = self.scheduler
    self.scheduler = scheduler
    try:
        return self.scheduler_step_maybe_with_cfg(...)
    finally:
        self.scheduler = saved

Recommendation: Pass the scheduler explicitly to scheduler_step_maybe_with_cfg or refactor the mixin to accept scheduler as a parameter.

b) Parameter redundancy in prepare_encode (pipeline_qwen_image.py:536-564)

The method accepts both req and individual parameters that overlap with req attributes. This creates ambiguity about which takes precedence.

Recommendation: Consider either:

  1. Only accepting req and extracting values internally
  2. Making individual parameters override req values consistently (document this behavior)

c) Missing protocol enforcement (interface.py:31-47)

The SupportsStepExecution protocol methods have no actual signature enforcement due to the permissive *args, **kwargs pattern. While documented as intentional, this reduces type safety.


4. Security & Safety

Issues

a) No input validation in execute_stepwise (diffusion_model_runner.py:203)

No validation that prepare_encode returned valid data before proceeding to denoise loop.

b) Resource cleanup

The scheduler state mutation pattern uses try/finally correctly, but consider what happens if an exception occurs mid-mutation in a multi-threaded context.


5. Testing & Documentation

Issues

a) Duplicate test command in PR description

The same test command appears twice in the PR description - appears to be a copy-paste error.

b) Missing test coverage

  • No tests for the new SupportsStepExecution protocol
  • No tests for execute_stepwise method
  • No tests for interrupt handling during step execution
  • No tests for scheduler override functionality

c) Missing documentation for new protocol

The SupportsStepExecution protocol lacks documentation on:

  • Expected call sequence
  • State management between calls
  • Thread-safety considerations

6. Specific Suggestions

vllm_omni/diffusion/models/interface.py

Line 31-47: Add more detailed docstrings explaining the contract:

@runtime_checkable
class SupportsStepExecution(Protocol):
    """Step-level execution protocol for diffusion pipelines.
    
    Implementations must support the following call sequence:
    1. prepare_encode() - called once before the denoise loop
    2. denoise_step() + step_scheduler() - called N times in sequence
    3. post_decode() - called once after the denoise loop
    
    State between calls should be managed via instance attributes
    or returned values from prepare_encode().
    """

vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py

Line 536: Add return type annotation:

def prepare_encode(
    self,
    req: OmniDiffusionRequest,
    ...
) -> tuple[...]:  # Or return StepExecutionContext

Line 766-776: Consider refactoring to avoid state mutation:

def step_scheduler(
    self,
    noise_pred: torch.Tensor | None,
    timestep: torch.Tensor | float | int,
    latents: torch.Tensor,
    do_true_cfg: bool = False,
    scheduler: Any | None = None,
) -> torch.Tensor:
    if self.interrupt:
        return latents
    
    active_scheduler = scheduler if scheduler is not None else self.scheduler
    # Pass active_scheduler to a refactored scheduler_step_maybe_with_cfg
    return self._scheduler_step_with_scheduler(
        active_scheduler, noise_pred, timestep, latents, do_true_cfg
    )

vllm_omni/diffusion/worker/diffusion_model_runner.py

Line 203-220: Add interrupt handling and use a context object:

def execute_stepwise(self, req: OmniDiffusionRequest) -> DiffusionOutput:
    """Execute via step-level Protocol: prepare_encode → denoise × N → post_decode."""
    ctx = self.pipeline.prepare_encode(req=req)
    
    for _i, t in enumerate(ctx.timesteps):
        noise_pred = self.pipeline.denoise_step(
            ctx.prompt_embeds, ctx.prompt_embeds_mask,
            ctx.negative_prompt_embeds, ctx.negative_prompt_embeds_mask,
            ctx.latents, ctx.img_shapes, ctx.txt_seq_lens, ctx.negative_txt_seq_lens,
            t, ctx.do_true_cfg, ctx.guidance, ctx.true_cfg_scale,
        )
        if noise_pred is None:
            logger.info("Denoise interrupted at step %d", _i)
            break
        ctx.latents = self.pipeline.step_scheduler(
            noise_pred, t, ctx.latents, ctx.do_true_cfg, ctx.scheduler,
        )
    
    return self.pipeline.post_decode(ctx.latents, ctx.height, ctx.width)

Line 37: Remove extra blank line added between imports and class.


7. Approval Status

Changes Requested

While the architectural approach is sound and the PR achieves its stated goals, I recommend addressing the following before merging:

  1. Critical: Fix interrupt handling in execute_stepwise - currently broken
  2. Important: Replace 14-element tuple with a named structure for maintainability
  3. Important: Add return type annotation to prepare_encode
  4. Minor: Fix duplicate test command in PR description
  5. Minor: Remove extra blank line in diffusion_model_runner.py

The backward compatibility approach is excellent, and the protocol-based design will make it easy for other pipelines to adopt step-level execution in the future. Once the issues above are addressed, this will be a solid addition to the codebase.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

@asukaqaq-s
Copy link
Copy Markdown
Contributor Author

asukaqaq-s commented Mar 1, 2026

Thanks for the detailed review. I’ve pushed an update addressing the concerns raised and included additional self-review fixes.
Main changes:

  1. Replaced the 15-element positional tuple with DiffusionRequestState (dataclass).
  • Step interfaces are now state-driven: prepare_encode, denoise_step, step_scheduler, and post_decode.
  • The runner constructs and manages the state lifecycle, removing fragile positional unpacking.
  • The Protocol is now decoupled from direct OmniDiffusionRequest method signatures.
  1. Removed temporary self.scheduler swapping.
  • Added an explicit scheduler: Any | None = None parameter in CFG mixin scheduler helpers.
  • The step path now passes state.scheduler explicitly.
  • Backward compatibility is preserved by falling back to self.scheduler when not provided.
    Also fixed:
  • Prompt / negative prompt extraction in prepare_encode now matches forward() behavior (unconditional extraction).
  • Reused self.prepare_timesteps() to avoid divergence and duplicated logic.
  • Runner dispatch now uses the supports_step_execution(...) helper.
  • Added _current_timestep = None reset in post_decode for parity with forward().
  • Fixed a cfg_normalize regression in the stepwise path by restoring forward-equivalent default behavior.
    Note on concurrency:
    The current runner execution model is synchronous per call. However, we still removed shared self.scheduler mutation to avoid potential race conditions in future continuous batching or concurrent execution models.

And i updated the test results in the PR docs.
plz re-review when you have a chance, thanks! @lishunyang12

@asukaqaq-s
Copy link
Copy Markdown
Contributor Author

Rebased onto main, resolved the conflicts, and addressed the review comments.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice rework — previous concerns addressed. one thing inline


try:
self.pipeline.prepare_encode(state)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denoise_step returns None on interrupt but the loop keeps going. Worth breaking early:

Suggested change
for _i, _t in enumerate(state.timesteps):
noise_pred = self.pipeline.denoise_step(state)
if noise_pred is None:
break
# TODO: continuous batching should step per-request state.
self.pipeline.step_scheduler(state, noise_pred)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I applied the early break in this PR so we don’t call 'step_schedulerwithNone. I’m keeping the change minimal for now; a follow-up with proper abort support will make the interrupt/abort path explicit.

"attention_kwargs": self.attention_kwargs,
"return_dict": False,
}
if state.do_true_cfg:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract a common function to avoid this massive duplicate code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I’ll refactor this by adding a private helper like _build_denoise_kwargs(...) inside qwen-image to build positive_kwargs, negative_kwargs, and output_slice, rather than pushing it into the generic CFGParallelMixin.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only cfg. prepare_encode also have duplicate code.

noise_pred: torch.Tensor,
t: torch.Tensor,
latents: torch.Tensor,
scheduler: Any | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to add this param, state.scheduler = self.scheduler in prepare_encode

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because with step-level switching, batches may be at different progress, and a request may be switched between different execution states. So we cache the scheduler in RequestStateCache to make sure the request continues with the correct local scheduling state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we need to keep the scheduling state (e.g., timesteps). However, state.scheduler = self.scheduler is just a reference assignment. When self.scheduler changes, request_state.scheduler will also change. have you test for it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state.scheduler is used in the later scheduler step here:

  state.latents = self.scheduler_step_maybe_with_cfg(
      noise_pred,
      t,
      state.latents,
      state.do_true_cfg,
      scheduler=state.scheduler,
  )

state.scheduler is meant to keep per-request scheduler state, not just timesteps. In stepwise execution and future continuous batching, different requests may be resumed at different denoise progress, so they should not share the same pipeline scheduler instance. self.scheduler is the base template, while state.scheduler is the request-local scheduler used later in scheduler_step_maybe_with_cfg(...).

You're right. Assigning self.scheduler directly only keeps a shared reference. I will change this to create a request-local scheduler instance, e.g. state.scheduler =FlowMatchEulerDiscreteScheduler.from_config(self.scheduler.config), so the later scheduler_step_maybe_with_cfg(..., scheduler=state.scheduler) uses per-request scheduler state.

@asukaqaq-s
Copy link
Copy Markdown
Contributor Author

Review 1 — Bounty-hunter: "not only cfg. prepare_encode also have duplicate code."

Addressed. Extracted _extract_prompts() and _prepare_generation_context() as shared helpers in pipeline_qwen_image.py. Both forward() and prepare_encode() now delegate to _prepare_generation_context() for input validation, prompt encoding, latent preparation, timestep computation, and guidance setup. Also extracted _build_denoise_kwargs() for the denoise kwargs construction used by denoise_step(), and _decode_latents() shared by forward() and post_decode().

Review 2 — Bounty-hunter: "state.scheduler = self.scheduler is just a reference assignment"

Fixed. prepare_encode() now does copy.deepcopy(self.scheduler) after _prepare_generation_context() (which calls prepare_timesteps() and materializes the timestep state on self.scheduler), so the per-request scheduler carries correct dynamic-shifting state without sharing the pipeline instance.

Comment thread vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py
asukaqaq-s added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 11, 2026
Refactor from pr/pipeline branch, related to vllm-project#1368.
- Restructure pipeline stage/step API
- Pass scheduler explicitly in stepwise pipeline and CFG mixin

Signed-off-by: asukaqaq-s <1311722138@qq.com>

Signed-off-by: asukaqaq <1311722138@qq.com>
asukaqaq-s added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 11, 2026
Refactor from pr/pipeline branch, related to vllm-project#1368.
- Restructure pipeline stage/step API
- Pass scheduler explicitly in stepwise pipeline and CFG mixin

Signed-off-by: asukaqaq-s <1311722138@qq.com>
Comment thread vllm_omni/diffusion/models/interface.py
noise_pred: torch.Tensor,
t: torch.Tensor,
latents: torch.Tensor,
scheduler: Any | None = None,
Copy link
Copy Markdown
Collaborator

@wtomin wtomin Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name for this new argument, for example, per_request_scheduler or per_state_scheduler, is needed to differentiate it with self.scheduler. And please provide more detailed docstring for this argument. This new argument takes effect only when step-wise execution is enabled, right? Please run some parameter check.

Maybe in a future PR: a future design document is needed for step-wise execution and continuous batching in diffusion pipelines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this API for clarity.

Comment thread vllm_omni/diffusion/worker/utils.py
Comment thread vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py
@asukaqaq-s asukaqaq-s force-pushed the pr/pipeline branch 2 times, most recently from ccb79fd to ff49bde Compare March 15, 2026 16:54
Comment thread docs/design/feature/diffusion_step_execution.md
Comment thread docs/models/supported_models.md Outdated


## List of Supported Models for Step-Execution

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the Step-Execution section a bit odd here.

I have another PR for docs #1928. Can you check docs/user_guide/diffusion_features.md. I think it suits better to be in this doc, and works as a feature for diffusion models.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed. I removed the Step-Execution section from docs/models/supported_models.md. For now, the user-facing content is documented in docs/user_guide/diffusion/step_execution.md. After docs PR #1928 is merged, I can rebase and further align it with docs/user_guide/diffusion_features.md if needed.

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 18, 2026

@asukaqaq-s There is one bug refactor PR #1908 that will be merged in just a few days. After it's merged, I think you PR needs to be rebased and verify its functionality again.

@asukaqaq-s asukaqaq-s force-pushed the pr/pipeline branch 3 times, most recently from 164f6dd to 7a16a57 Compare March 18, 2026 16:57
@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Mar 19, 2026
Signed-off-by: asukaqaq-s <1311722138@qq.com>
Signed-off-by: asukaqaq-s <1311722138@qq.com>
Move the step-execution docs into the diffusion feature docs structure, add a user-facing step execution page, and remove the feature-specific section from supported models.

Signed-off-by: asukaqaq-s <1311722138@qq.com>
Signed-off-by: asukaqaq-s <1311722138@qq.com>
@asukaqaq-s
Copy link
Copy Markdown
Contributor Author

I resolved the merge conflicts introduced by the rebase, and also fixed the earlier CI failure caused by the missing step_execution field in OmniDiffusionConfig.

Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of my comments were addressed, I will approve this PR.

if current_omni_platform.get_device_count() < world_size:
pytest.skip(f"Test requires {world_size} devices")

torch.multiprocessing.spawn(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yenuo26 @congw729 Can you check if we should include this test script in CI?
Currently this test script requries the minimal 2 GPU devices. It runs simple mocked tests cases, thus it doesn't download or run large-scale diffusion models.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asukaqaq-s How long does it take to run this test script on your local machine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Around 5~10s for each parallel test on my local machine. The other ones finish almost immediately.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_markers/#example-usage-for-markers
parallel means this is parallel feature related, for your test:

from tests.utils import hardware_test

@hardware_test(
   res={"cuda": "L4"},
   num_cards=2,
)

@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Mar 20, 2026

May I know the total time cost for your test file? I think maybe add one extra test step in CI test is suitable for this test @wtomin @yenuo26. Maybe place this test in test-ready.yml if it runs very fast?

@asukaqaq-s
Copy link
Copy Markdown
Contributor Author

May I know the total time cost for your test file? I think maybe add one extra test step in CI test is suitable for this test @wtomin @yenuo26. Maybe place this test in test-ready.yml if it runs very fast?

Thanks! I've replaced @pytest.mark.parallel with @hardware_test(res={"cuda": "L4"}, num_cards=2) on the three distributed tests.

Test durations with --durations=0 (2x L4):

test_execute_stepwise_with_ulysses_parallel: 10.84s
test_execute_stepwise_with_ring_parallel: 11.97s
test_execute_stepwise_with_cfg_parallel: 11.98s
CPU tests: < 1s each
Total: ~36s for all 10 tests.

Signed-off-by: asukaqaq-s <1311722138@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants