[feat] Streaming diffusion video generation output by fhfuih · Pull Request #3737 · vllm-project/vllm-omni

fhfuih · 2026-05-19T09:34:04Z

Purpose

#3632 Phase 1

Gradio example

examples/online_serving/streaming_video_generation/gradio_demo.py

output.mp4

Test Plan

Unit test
- Engine / scheduler (step-execution streaming and chunk delivery) (should be auto-captured by existing ready (L1) CI yaml)
  - tests/diffusion/test_diffusion_model_runner.py: tests that DiffusionModelRunner.execute_stepwise() returns no result on non-decode steps and emits streaming DiffusionOutput chunks at chunk boundaries.
  - tests/diffusion/test_diffusion_scheduler.py: tests StepScheduler request lifecycle, batching, LoRA-compatible batching, incompatible sampling separation, step-count priority, invalid initial step state rejection, streaming chunk notification, empty/aborted streaming completion notification, and selecting StepScheduler when step_execution=True.
  - tests/diffusion/test_diffusion_engine_cleanup.py: tests that closing the diffusion engine completes pending streaming waiters with a terminal error output.
  - tests/diffusion/test_multiproc_engine_concurrency.py: tests that multiproc step execution allows streaming output mode and returns the worker RunnerOutput, while preserving existing normal RPC behavior.
  - tests/diffusion/test_stage_diffusion_proc.py: tests StageDiffusionProc yielding every streaming engine chunk with request metadata preserved.
  - tests/diffusion/test_inline_stage_diffusion_client.py: tests inline stage client delivery of multiple streaming diffusion chunks.
  - tests/engine/test_orchestrator.py: tests that the orchestrator forwards intermediate diffusion streaming chunks before the final output.
- Entrypoint (should be auto-captured by existing ready (L1) CI yaml)
  - tests/entrypoints/test_async_omni.py: tests that AsyncOmni.generate() yields intermediate streaming diffusion chunks before the final chunk.
  - tests/entrypoints/openai_api/test_serving_video_output_stream.py: adds WebSocket /v1/videos/stream protocol tests for video.start, binary chunks, session.done, invalid format, final encoder delta, and generation errors.
  - tests/entrypoints/openai_api/test_video_api_utils.py: (testing common video utility functions used in video related endpoints) adds fragmented MP4 streaming encoder/finalization tests.
Integration test (should be auto-captured by existing ready (L1) CI yaml)
- tests/diffusion/test_diffusion_streaming_output.py: adds mock pipeline streaming integration coverage through ZMQ StageDiffusionClient, inline stage -> orchestrator -> AsyncOmni, and /v1/videos/stream; also tests midway pipeline errors, Helios step-execution streaming support, and rejecting unsupported pipelines.
E2E (Smoke) test (Added in L4 .buildkite/test-nightly.yml)
- tests/e2e/accuracy/test_video_streaming_output_similarity.py: adds Helios full-model smoke test comparing streaming vs non-streaming video output with matching metadata plus SSIM/PSNR thresholds.
- tests/e2e/accuracy/helpers.py: adds shared ffmpeg/ffprobe video similarity helpers; Wan2.2 I2V test is refactored to reuse them.
- tests/helpers/runtime.py: adds an OpenAI client helper for native /v1/videos/stream WebSocket requests used by the E2E smoke test.
The following unit test files are cases are edited simply to add the new od_config field.

M       tests/diffusion/test_diffusion_model_runner.py
M       tests/diffusion/test_diffusion_scheduler.py
M       tests/diffusion/test_diffusion_step_pipeline.py
M       tests/diffusion/test_inline_stage_diffusion_client.py
M       tests/diffusion/test_multiproc_engine_concurrency.py

Test Result

Unit and integration test

> pytest -sv tests -m 'core_model and cpu'

passed

Smoke test

> pytest -s -v tests/e2e/accuracy/test_video_streaming_output_similarity.py -m full_model --run-level full_model

1 passed

Performance

See below #3737 (comment) . Performance test is not added to CI

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106

BLOCKING:

Test Coverage — The test results section says "To be updated" for unit/integration tests, and smoke test/E2E are marked "Todo". Please paste actual test outputs and add at least one smoke test showing streaming video chunks are emitted correctly.

VERDICT: REQUEST_CHANGES

fhfuih · 2026-05-22T01:38:10Z

Performance Check (currently not on CI)

TLDR

Seems like for Helios,

the first chunk takes the most time.
The first request's first-chunk forward is slower than subsequent requsts. And the current dummy cannot mitigate this delay.

As a result,

The time to stream the first chunk (like "TTFT") is longer than the time to stream the later chunks
The "TTFT" in the first requests longer than "TTFT" in later requests.

Note:

Streaming mode, first chunk: around ~2.5~3s
Non streaming entire forward: ~9.9s (first request), ~2.5~3s (subsequent request)

Test

Serve with vllm-omni serve BestWishYsh/Helios-Distilled --omni --streaming-output --enable-diffusion-pipeline-profiler (or without --streaming-output for non-streaming mode)
Client request with python examples/online_serving/streaming_video_generation/streaming_video_client.py --port 8000 --num-inference-steps 50 --num-frames 200 (or equivalence in non-streaming mode)

An important request param to speed up generation (already made default in both examples and noted in README) is pyramid_num_inference_steps_list set to [1,1,1]. Only this can we reach 19+ fps

Streaming, first time

Step-execution currently cannot log step-wise diffuse latency. So before this feature is completed in a separate PR, I have to "guess" the latency based on the logger time.

Server side: From the end of text encoder to the last chunk's VAE decode: 13s. Token-To-First-Chunk: 3s

INFO 05-26 09:54:36 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.text_encoder.forward took 0.021119s
INFO 05-26 09:54:39 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.712498s
INFO 05-26 09:54:40 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.703715s
INFO 05-26 09:54:42 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.701766s
INFO 05-26 09:54:44 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.703230s
INFO 05-26 09:54:46 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.701384s
INFO 05-26 09:54:48 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.708978s
INFO 05-26 09:54:49 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.707673s

Client side: Similar. e2e is around 13s. Input-To-First-Chunk: 3.18s

Video session started: request_id=video_stream-225fb4bdd4674d638f328f034c8e835e format=m4s
[chunk 001] bytes=1170479 frames=32 elapsed=3.18s total_bytes=1170479 total_frames=32 total_elapsed=3.18s
[chunk 002] bytes=1126176 frames=33 elapsed=1.70s total_bytes=2296655 total_frames=65 total_elapsed=4.89s
[chunk 003] bytes=1058700 frames=33 elapsed=1.76s total_bytes=3355355 total_frames=98 total_elapsed=6.65s
[chunk 004] bytes=979011 frames=33 elapsed=1.77s total_bytes=4334366 total_frames=131 total_elapsed=8.42s
[chunk 005] bytes=975686 frames=33 elapsed=1.76s total_bytes=5310052 total_frames=164 total_elapsed=10.18s
[chunk 006] bytes=945379 frames=33 elapsed=1.79s total_bytes=6255431 total_frames=197 total_elapsed=11.97s
[chunk 007] bytes=980796 frames=33 elapsed=1.77s total_bytes=7236227 total_frames=230 total_elapsed=13.74s
[chunk 008] bytes=35285 frames=1 elapsed=0.10s total_bytes=7271512 total_frames=231 total_elapsed=13.84s
Session complete: {"type": "session.done", "request_id": "video_stream-225fb4bdd4674d638f328f034c8e835e", "chunks": 8, "stopped": false}

Streaming (subsequent run)

**TLDR: the same as the first run**

INFO 05-26 09:55:40 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.text_encoder.forward took 0.021588s
INFO 05-26 09:55:42 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.697072s
INFO 05-26 09:55:44 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.701112s
INFO 05-26 09:55:46 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.703722s
INFO 05-26 09:55:47 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.701735s
INFO 05-26 09:55:49 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.706170s
INFO 05-26 09:55:51 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.703662s
INFO 05-26 09:55:53 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.703435s

Video session started: request_id=video_stream-c5dd3606576c442d94007e7be1d62ad3 format=m4s
[chunk 001] bytes=1170479 frames=32 elapsed=3.02s total_bytes=1170479 total_frames=32 total_elapsed=3.02s
[chunk 002] bytes=1126176 frames=33 elapsed=1.75s total_bytes=2296655 total_frames=65 total_elapsed=4.77s
[chunk 003] bytes=1058700 frames=33 elapsed=1.76s total_bytes=3355355 total_frames=98 total_elapsed=6.53s
[chunk 004] bytes=979011 frames=33 elapsed=1.76s total_bytes=4334366 total_frames=131 total_elapsed=8.30s
[chunk 005] bytes=975686 frames=33 elapsed=1.78s total_bytes=5310052 total_frames=164 total_elapsed=10.08s
[chunk 006] bytes=945379 frames=33 elapsed=1.76s total_bytes=6255431 total_frames=197 total_elapsed=11.84s
[chunk 007] bytes=980796 frames=33 elapsed=1.76s total_bytes=7236227 total_frames=230 total_elapsed=13.60s
[chunk 008] bytes=35285 frames=1 elapsed=0.09s total_bytes=7271512 total_frames=231 total_elapsed=13.69s

Non streaming (baseline, first run)

INFO 05-26 10:01:24 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.text_encoder.forward took 0.016749s
INFO 05-26 10:01:34 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.711054s
INFO 05-26 10:01:34 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.forward took 9.821887s
(APIServer pid=3841887) INFO 05-26 10:01:34 [serving_video.py:257] Video response encoding (MP4 bytes): 186.37 ms

Non streaming (baseline, subsequent run)

INFO 05-26 10:02:08 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.text_encoder.forward took 0.018973s
INFO 05-26 10:02:10 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.vae.decode took 0.698272s
INFO 05-26 10:02:10 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] HeliosPipeline.forward took 2.520728s
(APIServer pid=3841887) INFO 05-26 10:02:11 [serving_video.py:257] Video response encoding (MP4 bytes): 176.82 ms

Why non-streaming mode has different performance across two runs

For Helios-Distilled with dummy num_steps=1, HeliosScheduler.set_timesteps() then trims the last timestep. So the dummy run ended up with len(timesteps) == 0. _stage1_sample() entered with progress_bar(total=0) and never called scheduler.step(...). So the warm up is never complete.

This behavior is fixed in step execution mode (otherwise this mode would fail to run at all). But the non-streaming mode implementation is intentionally unchanged in this PR---to minimize unnecessary changes and allow accuracy comparison between these two modes)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

chatgpt-codex-connector · 2026-05-22T09:36:34Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copilot

Pull request overview

Adds diffusion video streaming output support across the diffusion pipeline/executor/engine stack and exposes a new OpenAI-style WebSocket endpoint (/v1/videos/stream) that streams fragmented MP4 bytes as generation proceeds.

Changes:

Introduces streaming_output mode for diffusion requests, propagating chunked DiffusionOutput with finished semantics across workers, executors, engine, and orchestrator.
Adds a WebSocket video output streaming endpoint plus incremental fMP4 encoding/finalization utilities.
Expands unit/integration/e2e coverage for streamed chunk forwarding and streaming-vs-non-streaming similarity.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
vllm_omni/outputs.py	Plumbs `finished` flag into diffusion outputs.
vllm_omni/entrypoints/openai/video_api_utils.py	Adds fMP4 streaming encoder + finalization helpers.
vllm_omni/entrypoints/openai/serving_video_output_stream.py	Implements `/v1/videos/stream` WebSocket handler.
vllm_omni/entrypoints/openai/api_server.py	Wires streaming video output handler into API server.
vllm_omni/entrypoints/omni_base.py	Forwards `finished` to request outputs.
vllm_omni/entrypoints/cli/serve.py	Adds `--streaming-output` CLI flag.
vllm_omni/engine/async_omni_engine.py	Passes `streaming_output` into diffusion stage config.
vllm_omni/diffusion/worker/diffusion_worker.py	Streams chunked outputs from workers when enabled.
vllm_omni/diffusion/worker/diffusion_model_runner.py	Supports pipeline generators for streaming output.
vllm_omni/diffusion/utils/media_utils.py	Adds incremental fragmented MP4 muxer + remux finalizer.
vllm_omni/diffusion/stage_diffusion_proc.py	Streams multiple ZMQ result envelopes per request.
vllm_omni/diffusion/sched/request_scheduler.py	Uses `finished` to decide request completion.
vllm_omni/diffusion/profiler/diffusion_pipeline_profiler.py	Adds generator-aware profiling wrapper.
vllm_omni/diffusion/models/interface.py	Defines streaming-output pipeline protocol.
vllm_omni/diffusion/models/helios/pipeline_helios.py	Implements chunk-yielding Helios forward path.
vllm_omni/diffusion/inline_stage_diffusion_client.py	Supports streaming chunk delivery inline.
vllm_omni/diffusion/executor/multiproc_executor.py	Adds `execute_streaming_request` generator RPC path.
vllm_omni/diffusion/executor/abstract.py	Adds streaming execution abstract method.
vllm_omni/diffusion/diffusion_engine.py	Adds async streaming step + streaming output queues.
vllm_omni/diffusion/data.py	Adds `streaming_output` config + streaming fields on outputs.
tests/helpers/runtime.py	Adds WS client helper for `/v1/videos/stream`.
tests/helpers/assertions.py	Adjusts Helios frame-count assertions.
tests/entrypoints/test_async_omni.py	Tests AsyncOmni yields intermediate diffusion chunks.
tests/entrypoints/openai_api/test_video_server.py	Updates extra_params expectations for Helios presets.
tests/entrypoints/openai_api/test_video_api_utils.py	Tests fMP4 encoder + finalization utilities.
tests/entrypoints/openai_api/test_serving_video_output_stream.py	Adds WebSocket protocol/unit tests for streaming.
tests/engine/test_orchestrator.py	Tests orchestrator forwards intermediate diffusion chunks.
tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py	Refactors to shared video similarity helpers.
tests/e2e/accuracy/test_video_streaming_output_similarity.py	Adds streaming vs non-streaming video similarity smoke test.
tests/e2e/accuracy/test_diffusers_backend_similarity.py	Imports shared ffmpeg similarity helper.
tests/e2e/accuracy/helpers.py	Adds ffmpeg/ffprobe-based video similarity helpers.
tests/diffusion/test_stage_diffusion_proc.py	Tests StageDiffusionProc yields every streaming chunk.
tests/diffusion/test_multiproc_engine_concurrency.py	Adds multiproc streaming behavior tests.
tests/diffusion/test_inline_stage_diffusion_client.py	Tests inline client streaming chunk delivery.
tests/diffusion/test_diffusion_streaming_output.py	Adds end-to-end streaming integration tests.
tests/diffusion/test_diffusion_step_pipeline.py	Adjusts configs/tests for new streaming flag.
tests/diffusion/test_diffusion_scheduler.py	Tests scheduler/engine streaming completion semantics.
tests/diffusion/test_diffusion_model_runner.py	Tests model runner forwards streaming generator outputs.
tests/diffusion/test_diffusion_engine_cleanup.py	Tests engine close completes streaming waiters.
pyproject.toml	Adds `websockets` dev dependency.
examples/online_serving/streaming_video_generation/video-stream-view.js	Browser MSE player for streamed fMP4 chunks.
examples/online_serving/streaming_video_generation/video-stream-view.html	HTML view for browser streaming player.
examples/online_serving/streaming_video_generation/streaming_video_client.py	CLI WS client example for streaming endpoint.
examples/online_serving/streaming_video_generation/README.md	Documents endpoint protocol + usage.
examples/online_serving/streaming_video_generation/gradio_demo.py	Gradio demo for browser-based streaming playback.
.buildkite/test-nightly.yml	Adds nightly e2e streaming similarity test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hsliuustc0106 · 2026-05-22T09:44:51Z

First forward chunk, first request: 10.433816s

does the latency of the first forward chunk come from the warmup overhead?

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>

fhfuih · 2026-05-22T09:46:16Z

First forward chunk, first request: 10.433816s

does the latency of the first forward chunk come from the warmup overhead?

Yes I think so.

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

hsliuustc0106

Left inline comments on the streaming timeout, streaming/RPC queue interaction, and streaming exec-time metric.

hsliuustc0106 · 2026-05-22T09:50:40Z

+                                for req_id in sched_output.scheduled_req_ids
+                            ]
+                        )
+                        met_error = True


In streaming mode this drains generic RPCs between chunks while the active execute_model generator is still running. With the multiproc executor, a queued RPC will send another worker RPC and wait for one result on the same result queue that is also carrying streaming chunks. Since the worker cannot process the new RPC until the generator finishes, this can block streaming or let the RPC path consume a video chunk as its result. I think we should avoid processing generic RPCs during an active streaming request, or separate/correlate streaming replies from RPC replies.

The current version should have resolved this issue, since it requires step execution scheduler to enable streaming output. In this mode, both Diffusion Enginer & worker should have already taken care of RPC requests & interleaving between two denoise steps, which is at a finer granularity than between video chunks.

fhfuih · 2026-05-22T18:23:47Z

In streaming mode this drains generic RPCs between chunks while the active execute_model generator is still running. With the multiproc executor, a queued RPC will send another worker RPC and wait for one result on the same result queue that is also carrying streaming chunks. Since the worker cannot process the new RPC until the generator finishes, this can block streaming or let the RPC path consume a video chunk as its result. I think we should avoid processing generic RPCs during an active streaming request, or separate/correlate streaming replies from RPC replies.

Hmm this is indeed a critical comment. After some investigation, I can confirm such a potential. And this blocking can potentially nullify the subsequent feature of submitting in-flight prompt changes and conflict with the overall vision in #3632

Therfore, I will convert this PR to draft and rethink the diffusion engine architecture again. Maybe using step-execution and maybe also make this PR dependent of #3099. (And in this case, the changes should be mainly on Diffusion Enginer/Executer/Scheduler side. Layers above should be fine.) And I will reopen it once ready.

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

chatgpt-codex-connector · 2026-05-26T12:59:12Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

fhfuih · 2026-05-26T13:09:53Z

This PR is ready. CC-ing the following people -- reviews are much appreciated!

@Gaohan123 @wtomin : general architecture and design

@asukaqaq-s : For the schedulers: I have extended the step execution scheduler to handle intermediate chunk output. Wonder if that goes along with your original step execution design

@fake0fan @yinpeiqi Any comments on the diffusion engine, worker, and stuff?

@princepride @SHYuanBest Any comments on the streaming-and-step-execution mode of the Helios pipeline? (I admit much of it is vibe-coded 😆 ) One particular new stuff is the handling of single num_inference_steps. This is heavily relied upon in the step-execution mode.

SHYuanBest · 2026-05-26T13:18:28Z

This PR is ready. CC-ing the following people -- reviews are much appreciated!

@Gaohan123 @wtomin : general architecture and design

@asukaqaq-s : For the schedulers: I have extended the step execution scheduler to handle intermediate chunk output. Wonder if that goes along with your original step execution design

@fake0fan @yinpeiqi Any comments on the diffusion engine, worker, and stuff?

@princepride @SHYuanBest Any comments on the streaming-and-step-execution mode of the Helios pipeline? (I admit much of it is vibe-coded 😆 ) One particular new stuff is the handling of single num_inference_steps. This is heavily relied upon in the step-execution mode.

LGTM! Thanks for your great effort!

Gaohan123 · 2026-05-28T14:01:28Z

@fhfuih Please resolve conflicts. Thanks

Gaohan123

In the following PR, I suggest we can move the streaming capacity from DiffusionEngine to a new inherited Engine to avoid the intrusive modification

Gaohan123 · 2026-05-28T14:30:35Z

@@ -0,0 +1,9 @@
+<div id="vllm-streaming-video-view" style="display:flex; flex-direction:column; gap:10px;">


Is this needed for gradio demo?

Yes in my current implementation, because Gradio's own video component doesn't perfectly support streaming input. It does claim to support another format but I still could not make it running for some unknown reason (their Video component has minimal documentation or error message, harder to maintain and adapt). So I'd rather use native HTML which supports the Fragment MP4 format for streaming.

Another reason to not go for Gradio-style file format: I surveyed common video file formats for streaming, and (supposedly?) Fragment MP4 is the most modern go-to choice. Add another format means to add extra logic to the API layer, and I could not see much benefit from supporting the Gradio-specific format.

Gaohan123 · 2026-05-28T14:33:56Z


    def prepare_encode(self, state: DiffusionRequestState, **kwargs: Any) -> DiffusionRequestState:
        """Prepare request-level inputs and return initialized state."""
+        ...


What is it for?

This is the same as pass, which is empty function placeholder. It is better explicitly added to these interface definitions but wasn't before. The current syntax passes only because there are docstrings. Otherwise an empty function without ... or pass will cause IndentationError

Gaohan123 · 2026-05-28T14:40:26Z

        )
+        # Diffusion model (mainly video generation models) streaming output mode
+        omni_config_group.add_argument(
+            "--streaming-output",


I think the argument name is not clear. It is easy to be misunderstood with stream of LLM stage

Agree, since AR and diffusion share this interface. I change the CLI arg to --diffusion-streaming-output. Then when it converts to a OmniDiffusionConfig field, I keep it as od_config.streaming_output since this config is diffusion-internal

Gaohan123

For the video demo, I suggest we can show the video generation w/o streaming to visually present the improvement saliently.

fhfuih · 2026-05-29T02:49:38Z

In the following PR, I suggest we can move the streaming capacity from DiffusionEngine to a new inherited Engine to avoid the intrusive modification

Agree. And since this PR may not be able to catch up with this version, if there is any incoming implementation of a new engine in the next iteration, I can wait it to merge first and adapt this PR to it.

For the video demo, I suggest we can show the video generation w/o streaming to visually present the improvement saliently.

For Helios model + current streaming implementation, seems like streaming "TTFC" roughly equals full video generation time in non-streaming. So there isn't really an interaction-level speedup. If this gap is identified and resolved in future, I can add such a comparison

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih changed the title ~~Streaming video~~ [feat] Streaming diffusion video generation output & mid-way prompt update May 19, 2026

fhfuih mentioned this pull request May 19, 2026

[RFC]: Real-time video generation demo JiusiServe/vllm-omni#230

Open

1 task

hsliuustc0106 requested changes May 19, 2026

View reviewed changes

Gaohan123 added this to the v0.22.0 milestone May 19, 2026

fhfuih mentioned this pull request May 20, 2026

[RFC]: Streaming diffusion video generation output & mid-way prompt update #3632

Open

1 task

fhfuih force-pushed the streaming-video branch from ee5ea72 to ddacda3 Compare May 21, 2026 10:07

hsliuustc0106 mentioned this pull request May 22, 2026

[Bugfix] Set separate CFG flag in Helios for CacheDiT #3756

Merged

fhfuih added 2 commits May 22, 2026 17:36

streaming video output

5ba6542

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix failing unit tests

d9bda4d

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih force-pushed the streaming-video branch from 2966e0c to d9bda4d Compare May 22, 2026 09:36

fhfuih marked this pull request as ready for review May 22, 2026 09:36

Copilot AI review requested due to automatic review settings May 22, 2026 09:36

fhfuih requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, congw729, david6666666, princepride, tzhouam, wtomin, yenuo26 and ywang96 as code owners May 22, 2026 09:36

Copilot started reviewing on behalf of fhfuih May 22, 2026 09:37 View session

fhfuih changed the title ~~[feat] Streaming diffusion video generation output & mid-way prompt update~~ [feat] Streaming diffusion video generation output May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py Outdated

Comment thread tests/entrypoints/openai_api/test_video_api_utils.py Outdated

[🤖 AI comment] Potential fix for pull request finding

3fe9ba3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>

[🤖 AI comment] fix docstring

c56294a

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

hsliuustc0106 reviewed May 22, 2026

View reviewed changes

fhfuih marked this pull request as draft May 22, 2026 18:24

fhfuih added 9 commits May 26, 2026 15:09

[breaking 💥] reimplement streaming output to depend on step execution

b31fe3e

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

remove stale logic from the old implementation

ce86e34

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

Update websocket example

713e22d

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix one-timestamp issue in Helios step execution

5bd36b1

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix cont'd

cf833e1

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix cont'd

a5d1a7e

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix cont'd

6f15736

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix cont'd (revert unexpected change in non-stremaing mode)

e5bbb02

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

Make stall-timeout depend on engine's last output and user's ping

1242b8b

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih marked this pull request as ready for review May 26, 2026 12:59

fhfuih mentioned this pull request May 27, 2026

[Entrypoint] Add realtime OpenPI robot serving API #3673

Merged

Gaohan123 reviewed May 28, 2026

View reviewed changes

fhfuih added 5 commits June 3, 2026 14:49

Merge remote-tracking branch 'origin/main' into streaming-video

d7ab163

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

adapt updated video similarity helper from vllm-project#3852

fab3c64

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

cont'd

bf15971

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

rename CLI arg to --diffusion-streaming-output, keep ODConfig field

0204249

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

change endpoint to /v1/realtime/video

95711e7

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

		@@ -0,0 +1,9 @@
		<div id="vllm-streaming-video-view" style="display:flex; flex-direction:column; gap:10px;">

Conversation

fhfuih commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Gradio example

Test Plan

Test Result

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

fhfuih commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Check (currently not on CI)

TLDR

Test

Streaming, first time

Streaming (subsequent run)

Non streaming (baseline, first run)

Non streaming (baseline, subsequent run)

Why non-streaming mode has different performance across two runs

Uh oh!

chatgpt-codex-connector Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented May 22, 2026

Uh oh!

fhfuih commented May 22, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fhfuih commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

fhfuih commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SHYuanBest commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaohan123 commented May 28, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

fhfuih commented May 29, 2026

Uh oh!

Reviewers

fhfuih commented May 19, 2026 •

edited

Loading

fhfuih commented May 22, 2026 •

edited

Loading

fhfuih commented May 22, 2026 •

edited

Loading

fhfuih commented May 26, 2026 •

edited

Loading

SHYuanBest commented May 26, 2026 •

edited

Loading