Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ nav:
- FP8: user_guide/diffusion/quantization/fp8.md
- Int8: user_guide/diffusion/quantization/int8.md
- GGUF: user_guide/diffusion/quantization/gguf.md
- Step Execution: user_guide/diffusion/step_execution.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- LoRA: user_guide/diffusion/lora.md
Expand Down Expand Up @@ -91,6 +92,7 @@ nav:
- design/feature/cache_dit.md
- design/feature/teacache.md
- design/feature/async_chunk_design.md
- design/feature/diffusion_step_execution.md
- Module Design:
- design/module/ar_module.md
- design/module/dit_module.md
Expand Down
16 changes: 16 additions & 0 deletions docs/contributing/model/adding_diffusion_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -739,6 +739,22 @@ See detailed guide: [How to add Sequence Parallel support](../../design/feature/
omni = Omni(model="your-model", ulysses_degree=2, ring_degree=2)
```

### Step Execution

See detailed design guide: [How to add step execution support](../../design/feature/diffusion_step_execution.md)

Use this only when your pipeline can be split into stable request-scoped and
step-scoped phases. The reference implementation is
`QwenImagePipeline`, which maps its request-level `forward()` into:

1. `prepare_encode()` for prompt encoding, latent init, timestep prep, and per-request scheduler setup.
2. `denoise_step()` for one transformer/noise prediction.
3. `step_scheduler()` for one scheduler update and `step_index` advance.
4. `post_decode()` for the final VAE decode.

Do not enable `step_execution=True` until those four methods are implemented
and validated against the request-level path.

### Cache Acceleration

#### TeaCache
Expand Down
121 changes: 121 additions & 0 deletions docs/design/feature/diffusion_step_execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Diffusion Step Execution
Comment thread
wtomin marked this conversation as resolved.

This guide documents vLLM-Omni's stepwise diffusion contract for model authors
and contributors implementing `step_execution=True` support for a diffusion
pipeline.

For end-user enablement, supported models, and current limitations, see
[Step Execution](../../user_guide/diffusion/step_execution.md).

## Current Support Scope

`step_execution` is **not** a generic diffusion toggle. It only works for
pipelines that implement the segmented stateful contract in
[`vllm_omni/diffusion/models/interface.py`](gh-file:vllm_omni/diffusion/models/interface.py).

Current in-tree support:

| Pipeline | Example models | Step execution |
|----------|----------------|----------------|
| `QwenImagePipeline` | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Yes |
| All other diffusion pipelines | `QwenImageEditPipeline`, `QwenImageEditPlusPipeline`, `QwenImageLayeredPipeline`, GLM-Image, Wan, Flux, etc. | No |

Current engine/runtime limitations:

- `StepScheduler` only schedules `batch_size=1`.
- `cache_backend` is not supported in step mode.
- Request-mode extras such as KV transfer are not wired into step mode yet.
- Unsupported pipelines now fail early during model loading instead of failing on the first request.

## Execution Contract

Step mode is driven by four pipeline methods plus the shared mutable request
state object:

- `prepare_encode(state)`: one-time request preparation.
- `denoise_step(state)`: compute the noise prediction for the current step.
- `step_scheduler(state, noise_pred)`: mutate latents and advance step state.
- `post_decode(state)`: decode the final output after denoising is complete.

The state lives in
[`vllm_omni/diffusion/worker/utils.py`](gh-file:vllm_omni/diffusion/worker/utils.py)
as `DiffusionRequestState`. Store request-scoped tensors there, or use
`state.extra` for model-specific fields that do not justify extending the core
dataclass.

The worker-side step loop lives in
[`vllm_omni/diffusion/worker/diffusion_model_runner.py`](gh-file:vllm_omni/diffusion/worker/diffusion_model_runner.py):

1. `prepare_encode()` runs once for a new request.
2. `denoise_step()` runs every scheduler tick.
3. `step_scheduler()` mutates `state.latents` and advances `state.step_index`.
4. `post_decode()` runs exactly once after `state.denoise_completed` becomes true.

## Recommended Split

When converting an existing request-level `forward()` pipeline, keep the split
strict and mechanical:

| Request-level phase | Stepwise method | What belongs there |
|---------------------|-----------------|--------------------|
| Input validation, prompt encoding, latent init, timestep prep, per-request scheduler creation | `prepare_encode()` | Anything that should happen once per request |
| Transformer forward / noise prediction | `denoise_step()` | Pure denoise computation for the current timestep |
| `scheduler.step(...)` and `step_index += 1` | `step_scheduler()` | Only latent/state mutation for one step |
| VAE decode / postprocess | `post_decode()` | Final decode only |

Keep the stepwise path reusing the same helpers as the request-level path
whenever possible. Reimplementing the denoise loop from scratch is the easiest
way to introduce behavioral drift.

## Qwen-Image Reference

[`pipeline_qwen_image.py`](gh-file:vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py)
is the reference implementation and is split correctly for the current
contract:

- `prepare_encode()` reuses `_prepare_generation_context()` so prompt encoding,
latent init, timestep creation, CFG setup, and shape bookkeeping stay aligned
with `forward()`.
- `prepare_encode()` deep-copies `self.scheduler` **after**
`prepare_timesteps()` so request-specific scheduler state is isolated.
- `denoise_step()` reuses `_build_denoise_kwargs()` plus
`predict_noise_maybe_with_cfg()`, so sequential CFG, CFG-parallel, and
non-CFG behavior stay identical to the request-level path.
- `step_scheduler()` only calls
`scheduler_step_maybe_with_cfg(..., per_request_scheduler=state.scheduler)`
and increments `state.step_index`.
- `post_decode()` reuses `_decode_latents()`, so the final image decode matches
the normal `forward()` path.

That decomposition is the target pattern for future models.

## Rules For New Pipelines

- Do not keep request-scoped scheduler state on `self.scheduler`. Copy it into
`state.scheduler` during `prepare_encode()`.
- Do not mutate `state.step_index` inside `denoise_step()`. Only
`step_scheduler()` should advance the step.
- Do not decode partial outputs in `denoise_step()` or `step_scheduler()`.
- If the request-level pipeline has condition latents, masks, or edit-specific
tensors, store them in `state` or `state.extra`, not in global pipeline
attributes.
- Preserve CFG behavior by sharing the same helper path used by `forward()`.
- Keep `post_decode()` equivalent to the tail of `forward()`.

## Validation Checklist

Before marking a pipeline as `supports_step_execution = True`, verify:

- Stepwise output matches request-level output for the same seed and sampling params.
- Per-request scheduler state is isolated across concurrent requests.
- Abort during denoise does not leak cached state.
- `step_index` reported by `RunnerOutput` matches the scheduler progress.
- CFG-parallel and non-CFG paths both work if the request-level pipeline supports them.

## Related Files

- Contract: [`vllm_omni/diffusion/models/interface.py`](gh-file:vllm_omni/diffusion/models/interface.py)
- State: [`vllm_omni/diffusion/worker/utils.py`](gh-file:vllm_omni/diffusion/worker/utils.py)
- Runner loop: [`vllm_omni/diffusion/worker/diffusion_model_runner.py`](gh-file:vllm_omni/diffusion/worker/diffusion_model_runner.py)
- Scheduler transport: [`vllm_omni/diffusion/sched/interface.py`](gh-file:vllm_omni/diffusion/sched/interface.py)
- Reference pipeline: [`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py`](gh-file:vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py)
61 changes: 61 additions & 0 deletions docs/user_guide/diffusion/step_execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Step Execution

Step execution is an opt-in diffusion execution mode enabled with
`step_execution=True` when constructing `Omni`.

It is not a generic diffusion toggle for every pipeline. Only pipelines that
implement the stepwise contract support it today.

## Quick Start

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Qwen/Qwen-Image",
step_execution=True,
)

outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```

## Supported Pipelines

| Pipeline | Example models | Step execution |
|----------|----------------|----------------|
| `QwenImagePipeline` | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Yes |
| All other diffusion pipelines | `QwenImageEditPipeline`, `QwenImageEditPlusPipeline`, `QwenImageLayeredPipeline`, GLM-Image, Wan, Flux, etc. | No |

## Current Limitations

- `step_execution` currently supports `batch_size=1` only.
- `cache_backend` is not supported together with step execution.
- Unsupported pipelines fail early during model loading.
- Request-mode extras such as KV transfer are not wired into step mode yet.

## When To Use It

Use step execution only when you specifically need the pipeline to run through
its stepwise request state machine. For normal diffusion inference, leave it
disabled unless your workflow depends on this mode.

If you are looking for general diffusion speedups, see
[Diffusion Acceleration Overview](../diffusion_acceleration.md).

## Troubleshooting

If model loading fails with a message mentioning `prepare_encode()`,
`denoise_step()`, `step_scheduler()`, and `post_decode()`, the selected
pipeline does not support step execution.

## For Model Authors

If you want to add step execution support to a new diffusion pipeline, see the
implementation guide:
[Diffusion Step Execution Design](../../design/feature/diffusion_step_execution.md).
Loading
Loading