Add AR and Diffusion GPU Model Runners to vllm-omni (vLLM v1-Compatible) by tzhouam · Pull Request #4 · vllm-project/vllm-omni

tzhouam · 2025-09-23T02:18:06Z

Summary

This PR adds two model runners and their corresponding workers to vllm-omni, aligning with vLLM v1’s Worker/Runner abstractions:

ARGPUModelRunner: Autoregressive path; returns per-request hidden states via ModelRunnerOutput.pooler_output while still producing sampled tokens/logprobs.
DiffusionGPUModelRunner: Non-autoregressive (diffusion/DiT) path; returns per-request tensors via ModelRunnerOutput.pooler_output without logits/sampling.

Both are drop-in replacements within the existing EngineCore / Executor / Worker loop.

Motivation

vllm-omni aims to also support multimodal and non-autoregressive tasks under vLLM v1.

To minimize changes to control-plane logic (EngineCore / Executor / Worker), vllm-omni concentrates customization in the Model Runner layer while reusing vLLM’s scheduling, batching, and distributed infrastructure.

Key Changes

1. ARGPUModelRunner (`vllm_omni/worker/AR_gpu_model_runner.py`)

Extends: vllm.v1.worker.gpu_model_runner.GPUModelRunner
Preserves:

v1 input preparation
Multimodal handling
PP / TP / DP
M-RoPE
SpecDecode
Grammar mask
DP padding
EPLB

Adds:

Lightweight sampling
Per-request hidden states via ModelRunnerOutput.pooler_output

2. DiffusionGPUModelRunner (`vllm_omni/worker/diffusion_model_runner.py`)

Reuses v1 input preparation and PP/TP glue
Skips logits computation and token sampling
Runs diffusion and returns one tensor per request via ModelRunnerOutput.pooler_output
Non-last PP ranks: forward IntermediateTensors
Last PP rank: aggregates and returns outputs
_run_diffusion currently prefers model.forward(...) (Qwen 2.5 Omni path); future-compatible with sample / diffuse if exposed

3. New Workers

vllm_omni/worker/AR_gpu_worker.py → self.model_runner = ARGPUModelRunner(...)
vllm_omni/worker/diffusion_gpu_worker.py → self.model_runner = DiffusionGPUModelRunner(...)
All other worker lifecycle logic remains v1-compatible.

4. Documentation

Updated: docs/architecture/vllm_omni_design.md to reflect v1 path and components for AR and Diffusion.

Design & Compatibility

vLLM v1 Alignment

Retains EngineCore → Executor → Worker RPC flow
Uses existing ModelRunnerOutput; emphasizes pooler_output for non-text tensors

Diffusion Integration (Minimal Changes)

Reuses v1 SchedulerOutput and batching
KV cache paths become no-ops naturally
Returns diffusion outputs via pooler_output (no Engine/Executor modifications required)

Multimodal + PP / TP / DP

Matches v1 multimodal preparation and distributed semantics
Non-last PP ranks: forward intermediate tensors
Last PP rank: assembles final outputs

Robustness

Diffusion runner defers to model.forward(...) now
Ready to support future diffusion-specific interfaces

Usage

Autoregressive (AR)

Use as standard vLLM
Downstream components may optionally consume hidden states from pooler_output

Diffusion

Instantiate via DiffusionGPUWorker
No changes required in Engine or Executor layers
Retrieve one output tensor per request from pooler_output

Backward Compatibility

No changes to public vLLM APIs/ABIs
AR behavior unchanged; pooler_output is optional / opt-in
Diffusion path is additive and disabled by default

Files Changed

vllm_omni/worker/AR_gpu_model_runner.py
vllm_omni/worker/AR_gpu_worker.py
vllm_omni/worker/diffusion_model_runner.py
vllm_omni/worker/diffusion_gpu_worker.py
vllm_omni/worker/__init__.py

gemini-code-assist · 2025-09-23T02:18:22Z

Summary of Changes

Hello @tzhouam, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands vllm-omni's capabilities by introducing dedicated GPU model runners and workers for both autoregressive (AR) models that output hidden states and non-autoregressive diffusion models. The core objective is to enable multimodal and non-autoregressive tasks within the vLLM v1 framework by concentrating customization at the Model Runner layer, thereby minimizing changes to the existing control-plane logic for scheduling, batching, and distributed infrastructure. This ensures vLLM's robust performance and scalability can be leveraged for a broader range of AI workloads.

Highlights

Autoregressive GPU Model Runner (ARGPUModelRunner): Introduces ARGPUModelRunner which extends GPUModelRunner to return per-request hidden states via ModelRunnerOutput.pooler_output while still performing token sampling and logprob generation.
Diffusion GPU Model Runner (DiffusionGPUModelRunner): Adds DiffusionGPUModelRunner for non-autoregressive tasks like diffusion models. It reuses vLLM's input preparation and distributed glue but skips logits computation and token sampling, returning output tensors via ModelRunnerOutput.pooler_output.
New Worker Implementations: New worker classes, ARGPUWorker and DiffusionGPUWorker, are added to integrate the respective model runners into the existing EngineCore/Executor/Worker loop, maintaining v1 compatibility.
Architecture Documentation: Updates docs/architecture/vllm_omni_design.md to reflect the new AR and Diffusion components and their integration within the vLLM v1 architecture.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull Request Overview

This PR introduces Autoregressive (AR) and Diffusion GPU model runners to vllm-omni that align with vLLM v1's Worker/Runner architecture. The implementation enables multimodal and non-autoregressive tasks while reusing vLLM's existing scheduling, batching, and distributed infrastructure.

Adds ARGPUModelRunner for autoregressive models that returns hidden states via pooler_output while maintaining token sampling
Adds DiffusionGPUModelRunner for non-autoregressive diffusion models that skip sampling and return diffusion tensors
Introduces corresponding GPU workers that instantiate the appropriate model runners

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_omni/worker/AR_gpu_model_runner.py	AR model runner extending GPUModelRunner with hidden state output support
vllm_omni/worker/AR_gpu_worker.py	GPU worker wrapper that instantiates ARGPUModelRunner
vllm_omni/worker/diffusion_model_runner.py	Diffusion model runner for non-autoregressive models
vllm_omni/worker/diffusion_gpu_worker.py	GPU worker wrapper that instantiates DiffusionGPUModelRunner
vllm_omni/worker/init.py	Module initialization exposing the new classes
docs/architecture/diffusion_executor_worker_runner.md	Architecture documentation for the diffusion components

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T02:19:00Z

+        # For Qwen 2.5 Omni's current implementation, we only support the forward method
+        if hasattr(self.model, "forward"):
+            return self.model.forward(**kwargs)
+
+        # if hasattr(self.model, "sample"):
+        #     return self.model.sample(**kwargs)
+        # if hasattr(self.model, "forward"):
+        #     return self.model.forward(**kwargs)
+        # if hasattr(self.model, "diffuse"):
+        #     return self.model.diffuse(**kwargs)
+
+        raise RuntimeError(
+            "The loaded model does not expose diffusion interfaces 'sample', "
+            "'forward', or 'diffuse'. Please implement one of them or adapt the runner.")


The commented code contains duplicated logic and inconsistent ordering compared to the docstring. The docstring mentions trying methods in order: sample, forward, diffuse - but the commented code checks forward twice. Remove the commented code or fix the ordering to match the documented behavior.

Copilot · 2025-09-23T02:19:01Z

+class ARGPUWorker(GPUWorker):
+    def init_device(self):


The init_device method is duplicated between ARGPUWorker and DiffusionGPUWorker with identical implementations except for the model runner type. Consider extracting a base class or using a factory pattern to reduce code duplication.

Copilot · 2025-09-23T02:19:01Z

+            "The loaded model does not expose diffusion interfaces 'sample', "
+            "'forward', or 'diffuse'. Please implement one of them or adapt the runner.")


The error message mentions three interfaces ('sample', 'forward', 'diffuse') but the current code only checks for 'forward'. The message is misleading since it suggests all three methods are checked when only one is actually verified.

Suggested change

"The loaded model does not expose diffusion interfaces 'sample', "

"'forward', or 'diffuse'. Please implement one of them or adapt the runner.")

"The loaded model does not expose the required diffusion interface 'forward'. "

"Please implement it or adapt the runner.")

gemini-code-assist

Code Review

This pull request introduces ARGPUModelRunner and DiffusionGPUModelRunner to support autoregressive models with hidden state outputs and non-autoregressive diffusion models within the vllm-omni framework. The changes are well-aligned with the vLLM v1 architecture. My review identifies a critical issue in the pipeline parallelism logic, suggests refactoring to reduce code duplication in the new worker classes, and provides several minor fixes in the documentation and model runner implementations to improve robustness and maintainability.

gemini-code-assist · 2025-09-23T02:20:18Z

+        if not get_pp_group().is_last_rank:
+            assert isinstance(text_hidden_states, IntermediateTensors)
+            text_hidden_states.kv_connector_output = kv_connector_output
+            return text_hidden_states


This if not get_pp_group().is_last_rank: block causes an early return for all non-last pipeline parallel ranks. This makes the subsequent logic for handling broadcast_pp_output (lines 158-168) for non-last ranks unreachable. This appears to be a copy-paste error, and this block should be removed to allow the correct pipeline parallelism logic to execute.

gemini-code-assist · 2025-09-23T02:20:18Z

+#### 1) Inherited and overridden
+Those parts relied to the KV Cache will be omitted if we do not register the model to the vllm config. The engine core will view it as do not require KV Cache, and handle it properly
+
+Reuse `vllm/v1/outputs.py::ModelRunnerOutput`：


A full-width colon ： is used here. It should be a standard ASCII colon :.

Suggested change

Reuse `vllm/v1/outputs.py::ModelRunnerOutput`：

Reuse `vllm/v1/outputs.py::ModelRunnerOutput`:

gemini-code-assist · 2025-09-23T02:20:19Z

+        model_name = "Qwen/Qwen-Image"
+
+        self.pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=dtype)
+        self.pipe = pipe.to(device)


In this line, pipe is used, but it should be self.pipe to refer to the instance attribute initialized in the previous line.

Suggested change

self.pipe = pipe.to(device)

self.pipe = self.pipe.to(device)

gemini-code-assist · 2025-09-23T02:20:19Z

+
+        # Generate image
+        prompt_embeds = self._get_and_process_prompt_embeds(scheduler_output, positive_magic)
+        negtive_prompt_embeds = self.pipe.embed_prompt(" ")


There is a typo in the variable name negtive_prompt_embeds. It should be negative_prompt_embeds.

Suggested change

negtive_prompt_embeds = self.pipe.embed_prompt(" ")

negative_prompt_embeds = self.pipe.embed_prompt(" ")

gemini-code-assist · 2025-09-23T02:20:19Z

+        image = pipe(
+            prompt_embeds=prompt_embeds,
+            negtive_prompt_embeds=negtive_prompt_embeds,


There are a couple of issues in this code block:

pipe is used instead of self.pipe.

negtive_prompt_embeds has a typo and should be negative_prompt_embeds.

Suggested change

image = pipe(

prompt_embeds=prompt_embeds,

negtive_prompt_embeds=negtive_prompt_embeds,

image = self.pipe(

prompt_embeds=prompt_embeds,

negative_prompt_embeds=negtive_prompt_embeds,

gemini-code-assist · 2025-09-23T02:20:20Z

+class ARGPUWorker(GPUWorker):
+    def init_device(self):
+        if self.device_config.device.type == "cuda":
+            # torch.distributed.all_reduce does not free the input tensor until
+            # the synchronization point. This causes the memory usage to grow
+            # as the number of all_reduce calls increases. This env var disables
+            # this behavior.
+            # Related issue:
+            # https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/191573
+            os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
+
+            # This env var set by Ray causes exceptions with graph building.
+            os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
+            self.device = torch.device(f"cuda:{self.local_rank}")
+            current_platform.set_device(self.device)
+
+            _check_if_gpu_supports_dtype(self.model_config.dtype)
+            gc.collect()
+            torch.cuda.empty_cache()
+
+            # take current memory snapshot
+            self.init_snapshot = MemorySnapshot()
+            self.requested_memory = (self.init_snapshot.total_memory *
+                                     self.cache_config.gpu_memory_utilization)
+            if self.init_snapshot.free_memory < self.requested_memory:
+                GiB = lambda b: round(b / GiB_bytes, 2)
+                raise ValueError(
+                    f"Free memory on device "
+                    f"({GiB(self.init_snapshot.free_memory)}/"
+                    f"{GiB(self.init_snapshot.total_memory)} GiB) on startup "
+                    f"is less than desired GPU memory utilization "
+                    f"({self.cache_config.gpu_memory_utilization}, "
+                    f"{GiB(self.requested_memory)} GiB). Decrease GPU memory "
+                    f"utilization or reduce GPU memory used by other processes."
+                )
+        else:
+            raise RuntimeError(
+                f"Not support device type: {self.device_config.device}")
+        # Initialize the distributed environment.
+        init_worker_distributed_environment(self.vllm_config, self.rank,
+                                            self.distributed_init_method,
+                                            self.local_rank,
+                                            current_platform.dist_backend)
+        # Set random seed.
+        set_random_seed(self.model_config.seed)
+
+        # Construct the model runner
+        self.model_runner: ARGPUModelRunner = ARGPUModelRunner(
+            self.vllm_config, self.device)
+
+        if self.rank == 0:
+            # If usage stat is enabled, collect relevant info.
+            report_usage_stats(self.vllm_config)


The init_device method is almost entirely duplicated in ARGPUWorker and DiffusionGPUWorker. The only difference is the type of model runner being instantiated. To avoid this code duplication and improve maintainability, consider creating a common base worker class that contains the shared init_device logic and uses a class attribute to specify the model runner class.

For example, you could create a base class OmniGPUWorker:

# In a new file, e.g., vllm_omni/worker/base_worker.py from vllm.v1.worker.gpu_worker import Worker as GPUWorker # ... other imports class OmniGPUWorker(GPUWorker): _model_runner_cls = None def init_device(self): # All the common code from init_device goes here # ... if self._model_runner_cls is None: raise NotImplementedError("Subclasses of OmniGPUWorker must set _model_runner_cls") # Construct the model runner self.model_runner = self._model_runner_cls( self.vllm_config, self.device) if self.rank == 0: report_usage_stats(self.vllm_config)

Then ARGPUWorker and DiffusionGPUWorker would inherit from it and just set _model_runner_cls, removing the need to override init_device.

gemini-code-assist · 2025-09-23T02:20:20Z

+
+        if self.rank == 0:
+            # If usage stat is enabled, collect relevant info.
+            report_usage_stats(self.vllm_config)


This file is missing a final newline. It's a common convention to end files with a newline character.

gemini-code-assist · 2025-09-23T02:20:20Z

+class DiffusionGPUWorker(GPUWorker):
+    def init_device(self):
+        if self.device_config.device.type == "cuda":
+            # torch.distributed.all_reduce does not free the input tensor until
+            # the synchronization point. This causes the memory usage to grow
+            # as the number of all_reduce calls increases. This env var disables
+            # this behavior.
+            # Related issue:
+            # https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/191573
+            os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
+
+            # This env var set by Ray causes exceptions with graph building.
+            os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
+            self.device = torch.device(f"cuda:{self.local_rank}")
+            current_platform.set_device(self.device)
+
+            _check_if_gpu_supports_dtype(self.model_config.dtype)
+            gc.collect()
+            torch.cuda.empty_cache()
+
+            # take current memory snapshot
+            self.init_snapshot = MemorySnapshot()
+            self.requested_memory = (self.init_snapshot.total_memory *
+                                     self.cache_config.gpu_memory_utilization)
+            if self.init_snapshot.free_memory < self.requested_memory:
+                GiB = lambda b: round(b / GiB_bytes, 2)
+                raise ValueError(
+                    f"Free memory on device "
+                    f"({GiB(self.init_snapshot.free_memory)}/"
+                    f"{GiB(self.init_snapshot.total_memory)} GiB) on startup "
+                    f"is less than desired GPU memory utilization "
+                    f"({self.cache_config.gpu_memory_utilization}, "
+                    f"{GiB(self.requested_memory)} GiB). Decrease GPU memory "
+                    f"utilization or reduce GPU memory used by other processes."
+                )
+        else:
+            raise RuntimeError(
+                f"Not support device type: {self.device_config.device}")
+        # Initialize the distributed environment.
+        init_worker_distributed_environment(self.vllm_config, self.rank,
+                                            self.distributed_init_method,
+                                            self.local_rank,
+                                            current_platform.dist_backend)
+        # Set random seed.
+        set_random_seed(self.model_config.seed)
+
+        # Construct the model runner
+        self.model_runner: DiffusionGPUModelRunner = DiffusionGPUModelRunner(
+            self.vllm_config, self.device)
+
+        if self.rank == 0:
+            # If usage stat is enabled, collect relevant info.
+            report_usage_stats(self.vllm_config)


This init_device method is almost entirely duplicated between DiffusionGPUWorker and ARGPUWorker. To improve maintainability and reduce code duplication, consider introducing a shared base class that handles the common initialization logic. Please see my comment on vllm_omni/worker/AR_gpu_worker.py for a detailed suggestion on how to refactor this.

gemini-code-assist · 2025-09-23T02:20:20Z

+
+        if self.rank == 0:
+            # If usage stat is enabled, collect relevant info.
+            report_usage_stats(self.vllm_config)


This file is missing a final newline. It's a common convention to end files with a newline character.

gemini-code-assist · 2025-09-23T02:20:20Z

+        if hasattr(self.model, "forward"):
+            return self.model.forward(**kwargs)
+
+        # if hasattr(self.model, "sample"):
+        #     return self.model.sample(**kwargs)
+        # if hasattr(self.model, "forward"):
+        #     return self.model.forward(**kwargs)
+        # if hasattr(self.model, "diffuse"):
+        #     return self.model.diffuse(**kwargs)
+
+        raise RuntimeError(
+            "The loaded model does not expose diffusion interfaces 'sample', "
+            "'forward', or 'diffuse'. Please implement one of them or adapt the runner.")


The implementation of _run_diffusion currently only supports models with a forward method. However, the docstring and the commented-out code suggest a more flexible approach with fallbacks to sample and diffuse methods. To make this runner more generally applicable to different diffusion models and align with the design goal of being future-compatible, it would be beneficial to implement the fallback logic.

Suggested change

if hasattr(self.model, "forward"):

return self.model.forward(**kwargs)

# if hasattr(self.model, "sample"):

# return self.model.sample(**kwargs)

# if hasattr(self.model, "forward"):

# return self.model.forward(**kwargs)

# if hasattr(self.model, "diffuse"):

# return self.model.diffuse(**kwargs)

raise RuntimeError(

"The loaded model does not expose diffusion interfaces 'sample', "

"'forward', or 'diffuse'. Please implement one of them or adapt the runner.")

if hasattr(self.model, "sample"):

return self.model.sample(**kwargs)

if hasattr(self.model, "forward"):

return self.model.forward(**kwargs)

if hasattr(self.model, "diffuse"):

return self.model.diffuse(**kwargs)

raise RuntimeError(

"The loaded model does not expose diffusion interfaces 'sample', "

"'forward', or 'diffuse'. Please implement one of them or adapt the runner.")

hsliuustc0106 · 2025-10-24T07:17:36Z

other PRs finished

Wzliu bagel dev

Signed-off-by: wuhang <wuhang6@huawei.com>

Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: wuhang <whlbx@hotmail.com>

Review: 5 issues identified, all fixed: vllm-project#1 (hard) KV cache ghost allocation: Add init-time assertions that SKIP_SCAFFOLD requires enable_prefix_caching=false and max_num_seqs=1. These implicit preconditions are now explicit ValueError checks. vllm-project#2 (hard) _scaffold_dummy device mismatch: Recreate dummy tensor when input device changes (multi-GPU / PP scenario). vllm-project#3 (hard) is_active_decode false positive: Add _prefill_completed flag set at the end of _forward_prefill(). Scaffold skip now requires both _prefill_completed=True AND _prev_feat_embed exists, preventing stale state from a previous request from triggering skip. vllm-project#4 (soft) _FREE_SCAFFOLD + _SKIP_SCAFFOLD interaction: Skip the zero-out operation when SKIP_SCAFFOLD is set (no point zeroing weights that will never be read). vllm-project#5 (soft) Perf timer on skip path: Removed timer from the skip branch so scaffold_forward metric only reflects real scaffold runs.

P0 fixes: vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually releases VRAM). Only runs when SKIP_SCAFFOLD is also set. Called lazily after first prefill, not at load time. vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug). _sliding_vae_decode now falls back to full decode until proper overlap-add is implemented. vllm-project#3: Complete per-request state reset in preprocess: now clears _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio, _prev_audio_len, _decode_step_count, _precomputed_stop_logits. vllm-project#4: compute_logits fallback forces stop (not continue) when _prefill_completed=True, preventing runaway generation. vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately; _free_scaffold_weights called after first prefill completes, so scaffold is available for prefill then released. P1 fixes: vllm-project#6: Log all active config flags at load time. vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code. vllm-project#8: Remove broken audio_duration formula from postprocess. vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level. vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable (incompatible with CUDA Graph capture).

Review: 5 issues identified, all fixed: vllm-project#1 (hard) KV cache ghost allocation: Add init-time assertions that SKIP_SCAFFOLD requires enable_prefix_caching=false and max_num_seqs=1. These implicit preconditions are now explicit ValueError checks. vllm-project#2 (hard) _scaffold_dummy device mismatch: Recreate dummy tensor when input device changes (multi-GPU / PP scenario). vllm-project#3 (hard) is_active_decode false positive: Add _prefill_completed flag set at the end of _forward_prefill(). Scaffold skip now requires both _prefill_completed=True AND _prev_feat_embed exists, preventing stale state from a previous request from triggering skip. vllm-project#4 (soft) _FREE_SCAFFOLD + _SKIP_SCAFFOLD interaction: Skip the zero-out operation when SKIP_SCAFFOLD is set (no point zeroing weights that will never be read). vllm-project#5 (soft) Perf timer on skip path: Removed timer from the skip branch so scaffold_forward metric only reflects real scaffold runs.

P0 fixes: vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually releases VRAM). Only runs when SKIP_SCAFFOLD is also set. Called lazily after first prefill, not at load time. vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug). _sliding_vae_decode now falls back to full decode until proper overlap-add is implemented. vllm-project#3: Complete per-request state reset in preprocess: now clears _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio, _prev_audio_len, _decode_step_count, _precomputed_stop_logits. vllm-project#4: compute_logits fallback forces stop (not continue) when _prefill_completed=True, preventing runaway generation. vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately; _free_scaffold_weights called after first prefill completes, so scaffold is available for prefill then released. P1 fixes: vllm-project#6: Log all active config flags at load time. vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code. vllm-project#8: Remove broken audio_duration formula from postprocess. vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level. vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable (incompatible with CUDA Graph capture).

Support stage scale out

Refactor scheduler to centralize shared flow and add batching support

tzhouam added 2 commits September 18, 2025 13:12

docs: add diffusion executor worker runner architecture design

9fd8365

add Omni-Worker&Model_Runner

ba20e2b

tzhouam requested review from Gaohan123, congw729, Copilot and hsliuustc0106 September 23, 2025 02:18

Copilot AI reviewed Sep 23, 2025

View reviewed changes

gemini-code-assist Bot reviewed Sep 23, 2025

View reviewed changes

hsliuustc0106 mentioned this pull request Oct 17, 2025

vllm-omni framework and support for qwen2.5-omni [WIP] #7

Closed

Copilot AI mentioned this pull request Oct 24, 2025

Code review documentation for PR #17: OmniGPUModelRunner and OmniModelInputForGPU - ✅ All Issues Resolved #21

Closed

hsliuustc0106 closed this Oct 24, 2025

natureofnature pushed a commit to natureofnature/vllm-omni that referenced this pull request Jan 5, 2026

Merge pull request vllm-project#4 from natureofnature/wzliu_bagel_dev

a5728ba

Wzliu bagel dev

zyqzhang1996 mentioned this pull request Jan 21, 2026

[Bug]: [NPU]Qwen-Image-2512 use usp failed in 1328*1328 #845

Closed

1 task

yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 11, 2026

stage cli (#4)

772dae6

Signed-off-by: wuhang <wuhang6@huawei.com>

yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 13, 2026

stage cli (#4)

1b88261

Signed-off-by: wuhang <wuhang6@huawei.com>

yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 13, 2026

stage cli (#4)

e6942bd

Signed-off-by: wuhang <wuhang6@huawei.com>

yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 16, 2026

stage cli (#4)

fa94c9a

Signed-off-by: wuhang <wuhang6@huawei.com>

yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 16, 2026

stage cli (#4)

0d74355

Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: wuhang <whlbx@hotmail.com>

willamhou mentioned this pull request Apr 11, 2026

[RFC]: diffusion engine clean up #2335

Open

5 tasks

lishunyang12 mentioned this pull request Apr 16, 2026

[Frontend] Configurable video storage backends with TTL #2531

Open

5 tasks

Sy0307 mentioned this pull request Apr 19, 2026

[Bugfix] Sync main into dev/migrate-MR-v2 and fix build errors #2923

Merged

6 tasks

yinpeiqi referenced this pull request in ZhengWG/vllm-omni Apr 20, 2026

Merge pull request #4 from ZhengWG/support-stage-scale-out

61b631b

Support stage scale out

Acerak01-fy pushed a commit to Acerak01-fy/vllm-omni that referenced this pull request Apr 23, 2026

Merge pull request vllm-project#4 from omni-nicelab/dev/batch

1d56e84

Refactor scheduler to centralize shared flow and add batching support

		"The loaded model does not expose diffusion interfaces 'sample', "
		"'forward', or 'diffuse'. Please implement one of them or adapt the runner.")

	Reuse `vllm/v1/outputs.py::ModelRunnerOutput`：
	Reuse `vllm/v1/outputs.py::ModelRunnerOutput`:

	negtive_prompt_embeds = self.pipe.embed_prompt(" ")
	negative_prompt_embeds = self.pipe.embed_prompt(" ")

Conversation

tzhouam commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Changes

1. ARGPUModelRunner (vllm_omni/worker/AR_gpu_model_runner.py)

2. DiffusionGPUModelRunner (vllm_omni/worker/diffusion_model_runner.py)

3. New Workers

4. Documentation

Design & Compatibility

vLLM v1 Alignment

Diffusion Integration (Minimal Changes)

Multimodal + PP / TP / DP

Robustness

Usage

Autoregressive (AR)

Diffusion

Backward Compatibility

Files Changed

Uh oh!

gemini-code-assist Bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

tzhouam commented Sep 23, 2025 •

edited

Loading

1. ARGPUModelRunner (`vllm_omni/worker/AR_gpu_model_runner.py`)

2. DiffusionGPUModelRunner (`vllm_omni/worker/diffusion_model_runner.py`)