diff --git a/.claude/skills/vllm-omni-npu-upgrade/SKILL.md b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md new file mode 100644 index 0000000000..1ef7ab3930 --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md @@ -0,0 +1,300 @@ +--- +name: vllm-omni-npu-model-runner-upgrade +description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic." +--- + +# vLLM-Omni NPU Model Runner Upgrade Skill + +## Overview + +This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs. + +## File Structure + +### NPU Model Runner Files +``` +vllm-omni/vllm_omni/platforms/npu/worker/ +├── __init__.py +├── npu_model_runner.py # OmniNPUModelRunner (base class) +├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive) +├── npu_ar_worker.py # AR worker +├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR) +└── npu_generation_worker.py # Generation worker +``` + +### GPU Reference Files (for omni-specific logic sync) +``` +vllm-omni/vllm_omni/worker/ +├── __init__.py +├── gpu_model_runner.py # OmniGPUModelRunner +├── gpu_ar_model_runner.py # GPUARModelRunner +├── gpu_ar_worker.py +├── gpu_generation_model_runner.py +├── gpu_generation_worker.py +├── mixins.py +├── base.py +└── gpu_memory_utils.py +``` + +### vllm-ascend Reference Files +``` +vllm-ascend/vllm_ascend/worker/ +├── model_runner_v1.py # NPUModelRunner (base class to copy from) +├── npu_input_batch.py +├── block_table.py +├── pcp_utils.py +└── worker.py +``` + +## Inheritance Hierarchy + +``` + GPUModelRunner (vllm) + | + +----------------+----------------+ + | | + OmniGPUModelRunner NPUModelRunner (vllm-ascend) + (vllm_omni/worker) (vllm_ascend/worker) + | | + +----------- OmniNPUModelRunner --+ + (multiple inheritance) + | + +---------------+---------------+ + | | + NPUARModelRunner NPUGenerationModelRunner + (autoregressive) (non-autoregressive/diffusion) +``` + +## Omni-Specific Comment Markers + +Omni-specific logic is marked with comment blocks: +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# ... omni-specific code ... +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +Or simpler variations: +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# ------------------------------------------------------------------------------------------------ +``` + +**Important**: +- Always preserve and add these markers when modifying code. +- **The reference documents (`references/omni-specific-blocks.md`) may not be up-to-date.** Always grep for `Omni-new` in the GPU implementations to find the authoritative list of omni-specific blocks. +- When you discover new omni-specific code that is not documented in the references, please update the reference files. + +## Key Methods Requiring Attention + +### OmniNPUModelRunner (npu_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `load_model` | Load model and initialize talker_mtp | Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers | +| `_dummy_run` | Warmup/profiling run | talker_mtp dummy forward, `extract_multimodal_outputs` | +| `_model_forward` | Forward pass wrapper | Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates | +| `_talker_mtp_forward` | Talker MTP forward for Qwen3-Omni | Uses `set_ascend_forward_context` | + +### NPUARModelRunner (npu_ar_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `__init__` | Initialize with KV transfer manager | `OmniKVTransferManager` setup | +| `execute_model` | Main inference entry | KV transfer handling, `_update_states` override, `extract_multimodal_outputs` | +| `sample_tokens` | Token sampling | Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput` | +| `_resolve_global_request_id` | Request ID resolution | For disaggregated inference | + +### NPUGenerationModelRunner (npu_generation_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `_update_request_states` | Update request states for async chunk | async_chunk handling | +| `execute_model` | Generation forward | async_chunk, `seq_token_counts`, `_run_generation_model` | +| `sample_tokens` | Output processing | multimodal output packaging to `OmniModelRunnerOutput` | +| `_dummy_run` | Dummy run override | model_kwargs initialization, multimodal extraction | +| `_run_generation_model` | Run generation model | Calls `_model_forward` with sampler | + +## Upgrade Workflow + +### Step 1: Preparation + +1. **Identify target versions**(Use gh cli to check): + - We're using vllm-omni main branch + - Check the last release of vllm-omni + - Target vllm-ascend version(Just directly use the local latest vllm-ascend code) + +2. **Check GPU-side changes** (since last release): + ```bash + cd /root/vllm-workspace/vllm-omni + git log --oneline --since="" -- vllm_omni/worker/ + ``` + +3. **Read latest vllm-ascend code**: + - We don't track vllm-ascend changes - just directly use the latest code from `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py` + - Copy the relevant methods and re-insert omni-specific blocks + +### Step 2: Analyze Omni-Specific Logic + +For each NPU model runner file: + +1. **Extract existing omni-specific blocks**: + ```bash + grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py + ``` + +2. **Document each omni block**: + - Which method it belongs to + - What functionality it provides + - Dependencies on other omni code + +### Step 3: Update Base Class (OmniNPUModelRunner) + +**Note**: Always check the GPU implementation `gpu_model_runner.py` for any new omni logic not yet documented in references. + +1. **Read the latest vllm-ascend `NPUModelRunner.load_model`** +2. **Copy the method, keeping the structure** +3. **Re-insert omni-specific logic** (check GPU `gpu_model_runner.py` for authoritative list): + - Replace `CUDAGraphWrapper` with `ACLGraphWrapper` + - Keep talker_mtp initialization + - Preserve buffer allocations for talker + - Check for any new omni blocks added since last sync + +4. **Update `_dummy_run`**: + - Copy from vllm-ascend + - Compare with GPU `_dummy_run` for omni-specific blocks + - Re-insert all `Omni-new` marked code from GPU version + +5. **Update `_model_forward`**: + - Keep the omni wrapper logic + - Update NPU-specific parts (graph params, SP all-gather) + - Check GPU version for any new omni logic + +### Step 4: Update AR Model Runner + +1. **Compare with GPU `gpu_ar_model_runner.py`** for any new omni features +2. **Copy `execute_model` from vllm-ascend** +3. **Re-insert omni blocks** (reference `references/omni-specific-blocks.md`, but note it may be incomplete): + - **IMPORTANT**: Always check the GPU implementation `gpu_ar_model_runner.py` for all `Omni-new` marked code blocks + - The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive + - When discovering new omni code blocks, please update `references/omni-specific-blocks.md` + - Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc. + +4. **Update `sample_tokens`** (also compare with GPU implementation): + - Compare with `gpu_ar_model_runner.py`'s `sample_tokens` method + - Identify all `Omni-new` marked code blocks + - Ensure NPU version includes all omni-specific logic + +### Step 5: Update Generation Model Runner + +**Note**: Generation model runner may have unique omni logic for diffusion/non-AR models. + +1. **Compare with GPU `gpu_generation_model_runner.py`** - grep for all `Omni-new` blocks +2. **Update `execute_model`**: + - Check GPU version for all omni-specific blocks + - Keep async_chunk handling + - Keep `seq_token_counts` injection + - Update forward/context setup from vllm-ascend + - Look for any new omni logic not documented in references + +3. **Update `_dummy_run`**: + - Copy from vllm-ascend base + - Compare with GPU `_dummy_run` if exists + - Re-insert all omni-specific logic + +### Step 6: Update Imports + +Check and update imports at the top of each file: + +```python +# Common vllm-ascend imports +from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context +from vllm_ascend.attention.attention_v1 import AscendAttentionState +from vllm_ascend.attention.utils import using_paged_attention +from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params +from vllm_ascend.ops.rotary_embedding import update_cos_sin +from vllm_ascend.utils import enable_sp, lmhead_tp_enable +from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner + +# Omni-specific imports +from vllm_omni.model_executor.models.output_templates import OmniOutput +from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner +from vllm_omni.outputs import OmniModelRunnerOutput +from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager +``` + +### Step 7: Sync GPU-Side Omni Changes + +1. **Check recent GPU worker changes**: + ```bash + git diff .. -- vllm_omni/worker/gpu_model_runner.py + git diff .. -- vllm_omni/worker/gpu_ar_model_runner.py + ``` + +2. **Identify new omni features** that need to be ported to NPU + +3. **Apply corresponding changes** to NPU runners + +### Step 8: Validation + +1. **Run type checking**: + ```bash + cd /root/vllm-workspace/vllm-omni + python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py + python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py + python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py + ``` + +2. **Run import test**: + ```bash + python -c "from vllm_omni.platforms.npu.worker import *" + ``` + +3. **Run model serving test** (if hardware available): + ```bash + vllm serve --trust-remote-code + ``` + +## Common Pitfalls + +### 1. Forward Context Differences +- GPU uses `set_forward_context` +- NPU uses `set_ascend_forward_context` +- Parameters may differ slightly + +### 2. Graph Wrapper Differences +- GPU: `CUDAGraphWrapper` +- NPU: `ACLGraphWrapper` +- Constructor parameters may differ + +### 3. Buffer Creation +- GPU: `_make_buffer` returns different structure +- NPU: May need numpy=True/False parameter + +### 4. Attention Metadata +- GPU: Uses vllm attention metadata builders +- NPU: Uses `AscendCommonAttentionMetadata` + +### 5. Sampling +- GPU: Uses vllm sampler +- NPU: Uses `AscendSampler` + +## Checklist Before Commit + +- [ ] All omni-specific comment markers preserved +- [ ] New omni logic from GPU side synced +- [ ] Imports updated to latest vllm-ascend +- [ ] No `CUDAGraphWrapper` references in NPU code +- [ ] `set_ascend_forward_context` used instead of `set_forward_context` +- [ ] `ACLGraphWrapper` used for talker_mtp wrapping +- [ ] Type hints match vllm-ascend signatures +- [ ] No duplicate code blocks +- [ ] Python syntax valid (py_compile passes) + +## Reference Files for Comparison + +When upgrading, keep these files open for reference: + +1. **vllm-ascend NPUModelRunner**: `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py` +2. **vllm GPUModelRunner**: `/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py` +3. **vllm-omni OmniGPUModelRunner**: `/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py` diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md new file mode 100644 index 0000000000..89067d37b2 --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md @@ -0,0 +1,335 @@ +# GPU to NPU Translation Patterns + +This document provides a quick reference for translating GPU code patterns to NPU equivalents when porting omni-specific logic. + +## Import Translations + +### Forward Context +```python +# GPU +from vllm.forward_context import set_forward_context + +# NPU +from vllm_ascend.ascend_forward_context import set_ascend_forward_context +``` + +### Graph Wrapper +```python +# GPU +from vllm.compilation.cuda_graph import CUDAGraphWrapper + +# NPU +from vllm_ascend.compilation.acl_graph import ACLGraphWrapper +``` + +### Attention State +```python +# GPU (no equivalent - uses FlashAttention states directly) + +# NPU +from vllm_ascend.attention.attention_v1 import AscendAttentionState +``` + +### Utilities +```python +# GPU +# (directly use torch.cuda functions) + +# NPU +from vllm_ascend.utils import enable_sp, lmhead_tp_enable +from vllm_ascend.ops.rotary_embedding import update_cos_sin +``` + +## Context Manager Translations + +### Forward Context Setup +```python +# GPU +with set_forward_context( + attn_metadata, + self.vllm_config, + num_tokens=num_tokens_padded, + num_tokens_across_dp=num_tokens_across_dp, + cudagraph_runtime_mode=cudagraph_mode, + batch_descriptor=batch_desc, +): + # forward pass + +# NPU +with set_ascend_forward_context( + attn_metadata, + self.vllm_config, + num_tokens=num_tokens_padded, + num_tokens_across_dp=num_tokens_across_dp, + aclgraph_runtime_mode=cudagraph_mode, # Note: 'aclgraph' not 'cudagraph' + batch_descriptor=batch_desc, + num_actual_tokens=scheduler_output.total_num_scheduled_tokens, + model_instance=self.model, +): + # forward pass +``` + +### Graph Capture Context +```python +# GPU +from vllm.compilation.cuda_graph import graph_capture as cuda_graph_capture +with cuda_graph_capture(self.device): + # capture + +# NPU +from vllm_ascend.worker.model_runner_v1 import graph_capture +with graph_capture(self.device): + # capture +``` + +## Graph Wrapper Usage + +### Creating Graph Wrapper +```python +# GPU +if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + self.talker_mtp = CUDAGraphWrapper( + talker_mtp, + self.vllm_config, + runtime_mode=CUDAGraphMode.FULL + ) + +# NPU +if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + self.talker_mtp = ACLGraphWrapper( + talker_mtp, + self.vllm_config, + runtime_mode=CUDAGraphMode.FULL + ) +``` + +### Checking Graph Wrapper Type +```python +# GPU +if not isinstance(self.talker_mtp, CUDAGraphWrapper): + _cudagraph_mode = CUDAGraphMode.NONE + +# NPU +if not isinstance(self.talker_mtp, ACLGraphWrapper): + _cudagraph_mode = CUDAGraphMode.NONE +``` + +## Device Operations + +### Synchronization +```python +# GPU +torch.cuda.synchronize() + +# NPU +torch.npu.synchronize() +``` + +### Stream Operations +```python +# GPU +stream = torch.cuda.Stream(device=device) +torch.cuda.current_stream() + +# NPU +stream = torch.npu.Stream(device=device) +torch.npu.current_stream() +``` + +## Attention Metadata + +### State Setting (NPU-specific) +```python +# GPU - handled internally by attention backends + +# NPU - explicit state setting required +self.attn_state = AscendAttentionState.DecodeOnly +if self.speculative_config and self.speculative_config.method == "mtp": + if self.vllm_config.model_config.use_mla: + self.attn_state = AscendAttentionState.SpecDecoding + else: + self.attn_state = AscendAttentionState.ChunkedPrefill +``` + +### Building Attention Metadata +```python +# GPU - uses vllm attention builders + +# NPU - may need additional parameters +(attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata( + num_tokens=num_tokens_unpadded, + num_tokens_padded=num_tokens_padded, + num_reqs=num_reqs, + num_reqs_padded=num_reqs_padded, + max_query_len=max_num_scheduled_tokens, + ubatch_slices=ubatch_slices_attn, + logits_indices=logits_indices, + use_spec_decode=use_spec_decode, + num_scheduled_tokens=scheduler_output.num_scheduled_tokens, + num_scheduled_tokens_np=num_scheduled_tokens_np, + cascade_attn_prefix_lens=cascade_attn_prefix_lens, +) +``` + +## Rotary Embedding + +### Update Cos/Sin Cache +```python +# GPU - typically handled inside attention + +# NPU - explicit update required before forward +from vllm_ascend.ops.rotary_embedding import update_cos_sin +update_cos_sin(positions) +``` + +## Sequence Parallelism + +### Enable SP Check +```python +# GPU - use vllm distributed utilities + +# NPU - use vllm-ascend wrapper +from vllm_ascend.utils import enable_sp + +if enable_sp(): + # sequence parallelism enabled +``` + +## Sampler + +### Sampler Type +```python +# GPU - uses vllm sampler +self.sampler = Sampler() + +# NPU - uses AscendSampler +from vllm_ascend.sample.sampler import AscendSampler +self.sampler = AscendSampler() +``` + +## Input Batch + +### Batch Class +```python +# GPU +from vllm.v1.worker.gpu_input_batch import InputBatch + +# NPU +from vllm_ascend.worker.npu_input_batch import NPUInputBatch +``` + +## Graph Parameter Updates + +### Full Graph Params Update (NPU-specific) +```python +# GPU - not needed + +# NPU - required for FULL graph mode +from vllm_ascend.compilation.acl_graph import update_full_graph_params + +forward_context = get_forward_context() +if ( + forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL + and not forward_context.capturing + and not self.use_sparse +): + update_full_graph_params( + self.attn_backend, + self.update_stream, + forward_context, + num_tokens_padded, + self.vllm_config, + self.speculative_config, + positions.shape[0], + ) +``` + +## Paged Attention Check + +```python +# GPU - not typically needed + +# NPU +from vllm_ascend.attention.utils import using_paged_attention + +if is_graph_capturing and using_paged_attention(num_tokens, self.vllm_config): + seq_lens = SEQ_LEN_WITH_MAX_PA_WORKSPACE +``` + +## Common Method Signature Differences + +### _dummy_run Parameters +```python +# GPU (v0.17.0) +def _dummy_run( + self, + num_tokens: int, + cudagraph_runtime_mode: CUDAGraphMode | None = None, + force_attention: bool = False, + uniform_decode: bool = False, + allow_microbatching: bool = True, + skip_eplb: bool = False, + is_profile: bool = False, + create_mixed_batch: bool = False, + remove_lora: bool = True, + is_graph_capturing: bool = False, + num_active_loras: int = 0, +) -> tuple[torch.Tensor, torch.Tensor]: + +# NPU (v0.17.0) - adds with_prefill, activate_lora +def _dummy_run( + self, + num_tokens: int, + with_prefill: bool = False, + cudagraph_runtime_mode: CUDAGraphMode | None = None, + force_attention: bool = False, + uniform_decode: bool = False, + is_profile: bool = False, + create_mixed_batch: bool = False, + allow_microbatching: bool = True, + skip_eplb: bool = False, + remove_lora: bool = True, + activate_lora: bool = False, + is_graph_capturing: bool = False, + num_active_loras: int = 0, +) -> tuple[torch.Tensor, torch.Tensor]: +``` + +### _model_forward Parameters +```python +# GPU - no num_tokens_padded +def _model_forward( + self, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): + +# NPU - has num_tokens_padded as first parameter +def _model_forward( + self, + num_tokens_padded: int, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): +``` + +## Quick Reference Table + +| Feature | GPU | NPU | +|---------|-----|-----| +| Graph wrapper | `CUDAGraphWrapper` | `ACLGraphWrapper` | +| Forward context | `set_forward_context` | `set_ascend_forward_context` | +| Runtime mode param | `cudagraph_runtime_mode` | `aclgraph_runtime_mode` | +| Device sync | `torch.cuda.synchronize()` | `torch.npu.synchronize()` | +| Stream | `torch.cuda.Stream` | `torch.npu.Stream` | +| Current stream | `torch.cuda.current_stream()` | `torch.npu.current_stream()` | +| Input batch | `InputBatch` | `NPUInputBatch` | +| Sampler | `Sampler` | `AscendSampler` | +| Attention state | N/A | `AscendAttentionState` | +| RoPE update | N/A | `update_cos_sin()` | diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md new file mode 100644 index 0000000000..8c5d32ab4c --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md @@ -0,0 +1,374 @@ +# Omni-Specific Code Blocks Reference + +This document catalogs omni-specific code blocks in the NPU model runners, making it easier to identify what needs to be preserved during upgrades. + +> **IMPORTANT**: This document may not be complete or up-to-date! +> +> - Always grep for `Omni-new` in the GPU implementations (`vllm_omni/worker/`) to find the authoritative list +> - New omni features may be added that are not yet documented here +> - When you discover new omni-specific blocks during an upgrade, please update this document +> - Last verified: Check git history for this file + +## OmniNPUModelRunner (npu_model_runner.py) + +### load_model - Talker MTP Initialization + +```python +def load_model(self, *args, **kwargs) -> None: + NPUModelRunner.load_model(self, *args, **kwargs) + # Initialize enable_sp cache to avoid get_current_vllm_config() error + # in _pad_for_sequence_parallelism during execute_model. + # This is a workaround for vllm-ascend not passing vllm_config to enable_sp(). + enable_sp(self.vllm_config) + # TODO move this model specific logic to a separate class + # TTS model IS the talker (no .talker sub-attr); use getattr to support both Omni and TTS. + talker_mtp = getattr(self.model, "talker_mtp", None) + if talker_mtp is not None: + self.talker_mtp = talker_mtp # type: ignore[assignment] + cudagraph_mode = self.compilation_config.cudagraph_mode + assert cudagraph_mode is not None + # Only wrap talker_mtp in CUDAGraphWrapper for Omni models that + # have a separate .talker sub-module. TTS models' code predictor + # has internal AR loops / torch.multinomial — not graph-safe. + has_separate_talker = getattr(self.model, "talker", None) is not None + if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + # NOTE: Use ACLGraphWrapper on NPU, not CUDAGraphWrapper + self.talker_mtp = ACLGraphWrapper(talker_mtp, self.vllm_config, runtime_mode=CUDAGraphMode.FULL) + # TTS exposes mtp_hidden_size; Omni uses hf_text_config.hidden_size. + hidden_size = int( + getattr(self.model, "mtp_hidden_size", 0) or getattr(self.model_config.hf_text_config, "hidden_size") + ) + max_batch_size = max(self.max_num_reqs, self.compilation_config.max_cudagraph_capture_size) + self.talker_mtp_input_ids = self._make_buffer(max_batch_size, dtype=torch.int32) + self.talker_mtp_inputs_embeds = self._make_buffer( + max_batch_size, hidden_size, dtype=self.dtype, numpy=False + ) + self.last_talker_hidden = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False) + self.text_step = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False) +``` + +### _dummy_run - Talker MTP Dummy Forward + +Location: Inside `set_ascend_forward_context` block, before main model forward + +```python +# ---------------------------------------Omni-new---------------------------------------------- +if getattr(self.model, "talker", None) is not None and hasattr(self.model, "talker_mtp"): + num_tokens_padded_talker_mtp = num_tokens_padded + if num_tokens_padded_talker_mtp == self.max_num_tokens: + num_tokens_padded_talker_mtp = self.talker_mtp_input_ids.gpu.shape[0] + outputs = self.talker_mtp( + self.talker_mtp_input_ids.gpu[:num_tokens_padded_talker_mtp], + self.talker_mtp_inputs_embeds.gpu[:num_tokens_padded_talker_mtp], + self.last_talker_hidden.gpu[:num_tokens_padded_talker_mtp], + self.text_step.gpu[:num_tokens_padded_talker_mtp], + ) + self.compilation_config.cache_dir = None +# ---------------------------------------Omni-new---------------------------------------------- +``` + +### _dummy_run - Extract Multimodal Outputs + +Location: After model forward, before dummy_compute_logits + +```python +# ---------------------------------------Omni-new---------------------------------------------- +hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states) +# ---------------------------------------Omni-new---------------------------------------------- +``` + +### _model_forward - Omni Output Wrapping + +```python +def _model_forward( + self, + num_tokens_padded: int, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): + """Override to combine NPUModelRunner's signature with OmniGPUModelRunner's logic.""" + # Omni-specific: build and inject extra model kwargs + model_kwargs_extra = self._build_model_kwargs_extra() + + # Call the model forward (same as NPUModelRunner) + assert self.model is not None + model_output = self.model( + input_ids=input_ids, + positions=positions, + intermediate_tensors=intermediate_tensors, + inputs_embeds=inputs_embeds, + **model_kwargs, + **model_kwargs_extra, + ) + + # Omni-specific: wrap output if needed + if not isinstance(model_output, OmniOutput) and hasattr(self.model, "make_omni_output"): + model_output = self.model.make_omni_output(model_output, **model_kwargs_extra) + + # Omni-specific: cache model output for later sample_tokens + self._omni_last_model_output = model_output + + # NPU-specific: update full graph params (keep from vllm-ascend) + forward_context = get_forward_context() + # ... NPU graph update logic ... + + # NPU-specific: all-gather for sequence parallelism (keep from vllm-ascend) + if get_forward_context().sp_enabled and not isinstance(model_output, IntermediateTensors): + model_output = self._all_gather_hidden_states_and_aux(model_output) + + return model_output +``` + +--- + +## NPUARModelRunner (npu_ar_model_runner.py) + +### __init__ - KV Transfer Manager + +```python +def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.input_ids = self._make_buffer(self.max_num_tokens, dtype=torch.int32) + # each model stage has their own hidden size + self.hidden_size = self.model_config.hf_text_config.hidden_size + self.inputs_embeds = self._make_buffer(self.max_num_tokens, self.hidden_size, dtype=self.dtype, numpy=False) + # Initialize KV cache manager (preserve vllm_config fallback behavior) + self.kv_transfer_manager = OmniKVTransferManager.from_vllm_config(self.vllm_config, self.model_config) +``` + +### execute_model - KV Transfer Before Update States + +Location: At the very beginning of execute_model + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# [Omni] Handle KV transfer BEFORE updating states (which removes finished requests) +self.kv_extracted_req_ids = self.kv_transfer_manager.handle_finished_requests_kv_transfer( + finished_reqs=getattr(scheduler_output, "finished_requests_needing_kv_transfer", {}), + kv_caches=self.kv_caches, + block_size=self.cache_config.block_size, + cache_dtype=str(self.cache_config.cache_dtype), + request_id_resolver=self._resolve_global_request_id, +) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Custom _update_states Call + +Location: Inside synchronize_input_prep context + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +self._update_states(scheduler_output) +# ------------------------------------------------------------------------------------------------ +``` + +### execute_model - Extract Multimodal Outputs + +Location: In post process section, after hidden_states assignment + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states) + +if multimodal_outputs is not None: + keys_or_type = ( + list(multimodal_outputs.keys()) + if isinstance(multimodal_outputs, dict) + else type(multimodal_outputs) + ) + logger.debug(f"[AR] execute_model: multimodal_outputs keys = {keys_or_type}") +else: + logger.debug("[AR] execute_model: multimodal_outputs is None") +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Compute Logits with sampling_metadata + +Location: In both broadcast_pp_output True and False branches + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# Try with sampling_metadata first; fall back to without for models that don't support it +try: + logits = self.model.compute_logits( + sample_hidden_states, sampling_metadata=self.input_batch.sampling_metadata + ) +except TypeError: + logits = self.model.compute_logits(sample_hidden_states) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - KV Extracted Req IDs + +Location: At the beginning of sample_tokens + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +kv_extracted_req_ids = getattr(self, "kv_extracted_req_ids", None) +self.kv_extracted_req_ids = None +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - Process Additional Information and Build Output + +Location: After bookkeeping sync, replacing the original output construction + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states_cpu = hidden_states.detach().to("cpu").contiguous() +num_scheduled_tokens_np = getattr(self, "_omni_num_scheduled_tokens_np", None) +if num_scheduled_tokens_np is None: + req_ids = self.input_batch.req_ids + num_scheduled_tokens_np = np.array( + [scheduler_output.num_scheduled_tokens[rid] for rid in req_ids], + dtype=np.int32, + ) + +self._process_additional_information_updates( + hidden_states, multimodal_outputs, num_scheduled_tokens_np, scheduler_output +) + +pooler_output: list[dict[str, object]] = [] +for rid in req_ids_output_copy: + idx = req_id_to_index_output_copy[rid] + start = int(self.query_start_loc.cpu[idx]) + sched = int(num_scheduled_tokens_np[idx]) + end = start + sched + hidden_slice = hidden_states_cpu[start:end] + payload: dict[str, object] = {"hidden": hidden_slice} + if isinstance(multimodal_outputs, dict) and multimodal_outputs: + # ... multimodal output slicing logic ... + pooler_output.append(payload) + +model_runner_output = OmniModelRunnerOutput( + req_ids=req_ids_output_copy, + req_id_to_index=req_id_to_index_output_copy, + sampled_token_ids=valid_sampled_token_ids, + logprobs=logprobs_lists, + prompt_logprobs_dict=prompt_logprobs_dict, + pooler_output=(pooler_output if self.vllm_config.model_config.engine_output_type != "text" else None), + kv_connector_output=kv_connector_output, +) +model_runner_output.kv_extracted_req_ids = kv_extracted_req_ids +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +--- + +## NPUGenerationModelRunner (npu_generation_model_runner.py) + +### execute_model - Async Chunk Update + +Location: Inside prepare input section, before synchronize_input_prep + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +if self.model_config.async_chunk and num_scheduled_tokens: + self._update_request_states(scheduler_output) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Seq Token Counts + +Location: After _preprocess call + +```python +# [Omni] Pass token counts per request for code2wav output slicing +model_kwargs["seq_token_counts"] = tokens +``` + +### execute_model - Run Generation Model + +Location: Inside forward context + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +outputs = self._run_generation_model( + num_tokens_padded=num_tokens_padded, + input_ids=input_ids, + positions=positions, + intermediate_tensors=intermediate_tensors, + inputs_embeds=inputs_embeds, + model_kwargs=model_kwargs, + logits_indices=logits_indices, +) +_, multimodal_outputs = self.extract_multimodal_outputs(outputs) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - Multimodal Output Processing + +The entire sample_tokens method body is omni-specific for generation models: + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +pooler_output: list[object] = [] +if isinstance(multimodal_outputs, torch.Tensor): + # ... tensor handling ... +elif isinstance(multimodal_outputs, list): + # ... list handling ... +elif isinstance(multimodal_outputs, dict): + # ... dict handling per request ... +else: + raise RuntimeError("Unsupported diffusion output type") +# [Omni] Copy req_id mappings to avoid async scheduling mutation. +req_ids_output_copy = self.input_batch.req_ids.copy() +req_id_to_index_output_copy = self.input_batch.req_id_to_index.copy() +output = OmniModelRunnerOutput( + req_ids=req_ids_output_copy, + req_id_to_index=req_id_to_index_output_copy, + sampled_token_ids=[], + logprobs=None, + prompt_logprobs_dict={}, + pooler_output=pooler_output, + kv_connector_output=kv_connector_output, + num_nans_in_logits={}, + ec_connector_output=ec_connector_output if self.supports_mm_inputs else None, +) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### _dummy_run - Model Kwargs Init and Multimodal Extract + +Location: Before model forward and after + +```python +model_kwargs = self._init_model_kwargs() # Before forward + +# ... forward ... + +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states, _ = self.extract_multimodal_outputs(hidden_states) +# ------------------------------------------------------------------------------------------------- +``` + +--- + +## ExecuteModelState Extension + +The `ExecuteModelState` NamedTuple is extended for omni: + +```python +class ExecuteModelState(NamedTuple): + """Ephemeral cached state transferred between execute_model() and + sample_tokens(), after execute_model() returns None.""" + + scheduler_output: SchedulerOutput + logits: torch.Tensor + spec_decode_metadata: SpecDecodeMetadata | None + spec_decode_common_attn_metadata: AscendCommonAttentionMetadata | None + hidden_states: torch.Tensor + sample_hidden_states: torch.Tensor + aux_hidden_states: list[torch.Tensor] | None + attn_metadata: PerLayerAttnMetadata + positions: torch.Tensor + ec_connector_output: ECConnectorOutput | None + cudagraph_stats: CUDAGraphStat | None + multimodal_outputs: Any # <-- Omni extension +``` + +This extended state must be imported from `npu_ar_model_runner` in `npu_generation_model_runner`. diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md new file mode 100644 index 0000000000..4f184df0ec --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md @@ -0,0 +1,222 @@ +# NPU Model Runner Upgrade Workflow Checklist + +> **Note**: Reference documents (`omni-specific-blocks.md`) may not be complete. Always grep for `Omni-new` in GPU implementations to find all omni-specific code blocks. Update the reference docs when discovering new blocks. + +## Pre-Upgrade Preparation + +### 1. Version Information +- [ ] Identify current vllm-omni version: `_________` +- [ ] Identify target vllm-ascend version: `_________` +- [ ] Identify target vllm version: `_________` +- [ ] Last release date for GPU worker changes: `_________` + +### 2. Gather Git History +```bash +# GPU-side omni changes since last release +cd /root/vllm-workspace/vllm-omni +git log --oneline --since="YYYY-MM-DD" -- vllm_omni/worker/ + +# vllm-ascend NPUModelRunner changes +cd /root/vllm-workspace/vllm-ascend +git log --oneline .. -- vllm_ascend/worker/model_runner_v1.py +``` + +### 3. Backup Current Files +- [ ] Create backup of current NPU runners: + ```bash + cp -r vllm_omni/platforms/npu/worker vllm_omni/platforms/npu/worker.backup + ``` + +--- + +## OmniNPUModelRunner (npu_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_model_runner.py` +- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py` +- [ ] Read latest `vllm_omni/worker/gpu_model_runner.py` + +### Method: load_model +- [ ] Document existing omni-specific logic +- [ ] Copy latest NPUModelRunner.load_model structure +- [ ] Re-insert: `enable_sp(self.vllm_config)` call +- [ ] Re-insert: talker_mtp detection and setup +- [ ] Replace: `CUDAGraphWrapper` → `ACLGraphWrapper` +- [ ] Re-insert: Buffer allocations (talker_mtp_input_ids, etc.) + +### Method: _dummy_run +- [ ] Document existing omni-specific logic locations +- [ ] Copy latest NPUModelRunner._dummy_run +- [ ] Re-insert: talker_mtp dummy forward block (inside context) +- [ ] Re-insert: `extract_multimodal_outputs` call +- [ ] Verify: Comment markers are present + +### Method: _model_forward +- [ ] Copy latest NPUModelRunner._model_forward structure +- [ ] Re-insert: `_build_model_kwargs_extra()` call +- [ ] Re-insert: OmniOutput wrapping logic +- [ ] Re-insert: `_omni_last_model_output` caching +- [ ] Keep: NPU graph params update +- [ ] Keep: SP all-gather logic + +### Method: _talker_mtp_forward +- [ ] Verify: Uses `set_ascend_forward_context` +- [ ] Verify: Uses `ACLGraphWrapper` check +- [ ] Sync any changes from GPU `_talker_mtp_forward` + +### Imports +- [ ] Update vllm-ascend imports to latest paths +- [ ] Verify all omni imports are present +- [ ] Remove any deprecated imports + +--- + +## NPUARModelRunner (npu_ar_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_ar_model_runner.py` +- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py` execute_model +- [ ] Read latest `vllm_omni/worker/gpu_ar_model_runner.py` + +### Method: __init__ +- [ ] Sync any new initialization from GPU side +- [ ] Keep: `OmniKVTransferManager` setup +- [ ] Keep: Custom buffer allocations + +### Method: execute_model +- [ ] Document all omni blocks with line numbers +- [ ] Copy latest NPUModelRunner.execute_model structure +- [ ] Re-insert: KV transfer handling (beginning) +- [ ] Re-insert: Custom `_update_states` call +- [ ] Re-insert: `extract_multimodal_outputs` +- [ ] Re-insert: `compute_logits` with sampling_metadata try/except +- [ ] Update: ExecuteModelState to include multimodal_outputs + +### Method: sample_tokens +- [ ] Document all omni blocks +- [ ] Copy latest NPUModelRunner.sample_tokens structure +- [ ] Re-insert: `kv_extracted_req_ids` handling +- [ ] Re-insert: Hidden states CPU copy +- [ ] Re-insert: `_process_additional_information_updates` +- [ ] Re-insert: `OmniModelRunnerOutput` construction + +### ExecuteModelState +- [ ] Verify: `multimodal_outputs` field is present +- [ ] Verify: Imported/used correctly in execute_model + +### Imports +- [ ] Update all vllm-ascend imports +- [ ] Keep omni-specific imports + +--- + +## NPUGenerationModelRunner (npu_generation_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_generation_model_runner.py` +- [ ] Read latest GPU `gpu_generation_model_runner.py` + +### Method: _update_request_states +- [ ] Verify: async_chunk handling is correct +- [ ] Sync any changes from GPU side + +### Method: execute_model +- [ ] Document all omni blocks +- [ ] Copy latest NPUModelRunner.execute_model base structure +- [ ] Re-insert: async_chunk update logic +- [ ] Re-insert: `seq_token_counts` injection +- [ ] Re-insert: `_run_generation_model` call +- [ ] Re-insert: `extract_multimodal_outputs` +- [ ] Use: ExecuteModelState from npu_ar_model_runner + +### Method: sample_tokens +- [ ] Keep: Entire omni multimodal output processing +- [ ] Update: Any new output fields needed +- [ ] Keep: `OmniModelRunnerOutput` construction + +### Method: _run_generation_model +- [ ] Sync any changes from GPU side +- [ ] Keep: `_model_forward` call with sampler + +### Method: _dummy_run +- [ ] Copy latest NPUModelRunner._dummy_run +- [ ] Re-insert: `model_kwargs = self._init_model_kwargs()` +- [ ] Re-insert: `extract_multimodal_outputs` at end + +### Imports +- [ ] Import ExecuteModelState from npu_ar_model_runner +- [ ] Update vllm-ascend imports + +--- + +## Post-Upgrade Validation + +### Syntax Validation +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py` +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py` +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py` + +### Import Validation +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_model_runner import OmniNPUModelRunner"` +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_ar_model_runner import NPUARModelRunner"` +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_generation_model_runner import NPUGenerationModelRunner"` + +### Comment Markers +- [ ] Grep for "Omni-new" in all three files +- [ ] Verify all omni blocks have closing markers + +### Code Review +- [ ] No `CUDAGraphWrapper` references +- [ ] All `set_forward_context` replaced with `set_ascend_forward_context` +- [ ] Parameter names correct (`aclgraph_runtime_mode` not `cudagraph_runtime_mode`) +- [ ] No duplicate code blocks +- [ ] No missing imports + +--- + +## Git Commit + +### Commit Message Template +``` +[NPU] Upgrade model runners to align with vllm-ascend vX.Y.Z + +- Update OmniNPUModelRunner with latest NPUModelRunner base +- Update NPUARModelRunner execute_model and sample_tokens +- Update NPUGenerationModelRunner for async_chunk changes +- Sync GPU-side omni changes from vX.Y.Z release +- Preserve all omni-specific logic (marked with Omni-new comments) + +Changes from vllm-ascend: +- + +Changes synced from GPU: +- +``` + +### Files to Stage +- [ ] `vllm_omni/platforms/npu/worker/npu_model_runner.py` +- [ ] `vllm_omni/platforms/npu/worker/npu_ar_model_runner.py` +- [ ] `vllm_omni/platforms/npu/worker/npu_generation_model_runner.py` +- [ ] Any other modified files + +--- + +## Troubleshooting + +### Import Errors +- Check if vllm-ascend module paths have changed +- Verify PYTHONPATH includes both vllm-ascend and vllm-omni + +### Type Errors +- Check method signatures match between GPU and NPU +- Verify NamedTuple fields match expected structure + +### Runtime Errors +- Enable debug logging: `export VLLM_LOGGING_LEVEL=DEBUG` +- Check graph capture issues: try `--enforce-eager` +- Check attention issues: verify AscendAttentionState usage + +### Performance Regression +- Compare with previous version on same model +- Check if graph capture is working: look for ACLGraph logs +- Verify SP/EP configurations are correct diff --git a/.gitignore b/.gitignore index b5e002235e..35dc7571ee 100644 --- a/.gitignore +++ b/.gitignore @@ -263,3 +263,5 @@ tmp_test vllm_omni/_version.py # output files *.wav +# CI overlay yamls materialized from tests/utils.py:_CI_OVERLAYS at test time +tests/.ci_generated/ diff --git a/benchmarks/qwen3-tts/README.md b/benchmarks/qwen3-tts/README.md index 9c01f29aa9..a1c2ebe12f 100644 --- a/benchmarks/qwen3-tts/README.md +++ b/benchmarks/qwen3-tts/README.md @@ -35,8 +35,8 @@ MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice bash run_benchmark.sh --async-only # Use a Voice Clone model MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base TASK_TYPE=Base bash run_benchmark.sh --async-only -# Use bs16 config for higher throughput -STAGE_CONFIG=vllm_omni/configs/qwen3_tts_bs16.yaml bash run_benchmark.sh --async-only +# Use batch size 16 for higher throughput +BATCH_SIZE=16 bash run_benchmark.sh --async-only # Custom GPU, prompt count, concurrency levels GPU_DEVICE=1 NUM_PROMPTS=20 CONCURRENCY="1 4" bash run_benchmark.sh @@ -50,7 +50,8 @@ GPU_DEVICE=1 NUM_PROMPTS=20 CONCURRENCY="1 4" bash run_benchmark.sh CUDA_VISIBLE_DEVICES=0 python -m vllm_omni.entrypoints.cli.main serve \ "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice" \ --omni --host 127.0.0.1 --port 8000 \ - --stage-configs-path benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ + --stage-overrides '{"0":{"max_num_seqs":1,"gpu_memory_utilization":0.3,"max_num_batched_tokens":512},"1":{"max_num_seqs":1,"gpu_memory_utilization":0.3,"max_num_batched_tokens":8192}}' \ --trust-remote-code ``` @@ -84,16 +85,19 @@ python benchmarks/qwen3-tts/plot_results.py \ --output results/comparison.png ``` -## Stage Configs +## Batch-size presets -| Config | max_num_seqs | Description | -|--------|:------------:|-------------| -| `vllm_omni/configs/qwen3_tts_bs1.yaml` | 1 | Single-request processing (lowest latency) | -| `vllm_omni/configs/qwen3_tts_bs16.yaml` | 16 | High-throughput concurrent processing | +The bench script loads the bundled production deploy (`vllm_omni/deploy/qwen3_tts.yaml`) and layers per-stage budgets on top via `--stage-overrides`, driven by the `BATCH_SIZE` env var. Each batch size picks compatible per-stage `max_num_seqs`, `max_num_batched_tokens`, and `gpu_memory_utilization` defaults: -All configs use a 2-stage pipeline (Talker -> Code2Wav) with `async_chunk` streaming enabled. The `SharedMemoryConnector` streams codec frames (25-frame chunks with 25-frame context overlap) between stages. +| `BATCH_SIZE` | Description | +|:--:|-------------| +| `1` (default) | Single-request processing (lowest latency) | +| `4` | Moderate-throughput concurrent processing | +| `16` | High-throughput concurrent processing | -The model is specified via the CLI `--model` flag (or `MODEL` env var), so the same configs work for both the 0.6B and 1.7B model variants. +The 2-stage pipeline (Talker -> Code2Wav) runs with `async_chunk` streaming enabled via the prod deploy; the `SharedMemoryConnector` streams codec frames (25-frame chunks with 25-frame context overlap) between stages. + +The model is specified via the CLI `--model` flag (or `MODEL` env var), so the same bench script works for both the 0.6B and 1.7B model variants. ## Metrics diff --git a/benchmarks/qwen3-tts/run_benchmark.sh b/benchmarks/qwen3-tts/run_benchmark.sh index 283b6b844c..8c3e46903c 100755 --- a/benchmarks/qwen3-tts/run_benchmark.sh +++ b/benchmarks/qwen3-tts/run_benchmark.sh @@ -26,8 +26,8 @@ # # Use Voice Clone model # MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base TASK_TYPE=Base bash run_benchmark.sh --async-only # -# # Use batch_size=4 config: -# STAGE_CONFIG=vllm_omni/configs/qwen3_tts_bs4.yaml bash run_benchmark.sh --async-only +# # Use batch_size=4: +# BATCH_SIZE=4 bash run_benchmark.sh --async-only # # Environment variables: # GPU_DEVICE - GPU index to use (default: 0) @@ -35,9 +35,9 @@ # CONCURRENCY - Space-separated concurrency levels (default: "1 4 10") # MODEL - Model name (default: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice) # PORT - Server port (default: 8000) -# GPU_MEM_TALKER - gpu_memory_utilization for talker stage (default: 0.3) -# GPU_MEM_CODE2WAV - gpu_memory_utilization for code2wav stage (default: 0.2) -# STAGE_CONFIG - Path to stage config YAML (default: configs/qwen3_tts_bs1.yaml) +# BATCH_SIZE - Per-stage ``max_num_seqs`` for both talker and code2wav (default: 1) +# GPU_MEM_TALKER - gpu_memory_utilization for talker stage (default: 0.3 at bs=1, else 0.2) +# GPU_MEM_CODE2WAV - gpu_memory_utilization for code2wav stage (default: 0.3 at bs=1, else 0.2) # TASK_TYPE - Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice) set -euo pipefail @@ -51,14 +51,36 @@ NUM_PROMPTS="${NUM_PROMPTS:-50}" CONCURRENCY="${CONCURRENCY:-1 4 10}" MODEL="${MODEL:-Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice}" PORT="${PORT:-8000}" -GPU_MEM_TALKER="${GPU_MEM_TALKER:-0.3}" -GPU_MEM_CODE2WAV="${GPU_MEM_CODE2WAV:-0.2}" +BATCH_SIZE="${BATCH_SIZE:-1}" +DEFAULT_MEM=$([ "${BATCH_SIZE}" = "1" ] && echo "0.3" || echo "0.2") +GPU_MEM_TALKER="${GPU_MEM_TALKER:-${DEFAULT_MEM}}" +GPU_MEM_CODE2WAV="${GPU_MEM_CODE2WAV:-${DEFAULT_MEM}}" NUM_WARMUPS="${NUM_WARMUPS:-3}" -STAGE_CONFIG="${STAGE_CONFIG:-vllm_omni/configs/qwen3_tts_bs1.yaml}" +DEPLOY_CONFIG="vllm_omni/deploy/qwen3_tts.yaml" RESULT_DIR="${SCRIPT_DIR}/results" TIMESTAMP="$(date +%Y%m%d_%H%M%S)" TASK_TYPE="${TASK_TYPE:-CustomVoice}" +# Build --stage-overrides JSON from BATCH_SIZE + GPU_MEM_*. +STAGE_OVERRIDES=$( + BATCH_SIZE="${BATCH_SIZE}" \ + GPU_MEM_TALKER="${GPU_MEM_TALKER}" \ + GPU_MEM_CODE2WAV="${GPU_MEM_CODE2WAV}" \ + python - <<'PYEOF' +import json, os +bs = int(os.environ["BATCH_SIZE"]) +mem_t = float(os.environ["GPU_MEM_TALKER"]) +mem_c = float(os.environ["GPU_MEM_CODE2WAV"]) +# Prefill budget grows with batch size on both stages. +talker_batched = 512 if bs <= 4 else 4096 +code2wav_batched = 8192 if bs <= 4 else 32768 +print(json.dumps({ + "0": {"max_num_seqs": bs, "gpu_memory_utilization": mem_t, "max_num_batched_tokens": talker_batched}, + "1": {"max_num_seqs": bs, "gpu_memory_utilization": mem_c, "max_num_batched_tokens": code2wav_batched}, +})) +PYEOF +) + # Parse args RUN_ASYNC=true RUN_HF=true @@ -75,41 +97,27 @@ mkdir -p "${RESULT_DIR}" echo "============================================================" echo " Qwen3-TTS Benchmark" echo "============================================================" -echo " GPU: ${GPU_DEVICE}" -echo " Model: ${MODEL}" -echo " Prompts: ${NUM_PROMPTS}" -echo " Concurrency: ${CONCURRENCY}" -echo " Port: ${PORT}" -echo " Stage config: ${STAGE_CONFIG}" -echo " Results: ${RESULT_DIR}" -echo " Task type: ${TASK_TYPE}" +echo " GPU: ${GPU_DEVICE}" +echo " Model: ${MODEL}" +echo " Prompts: ${NUM_PROMPTS}" +echo " Concurrency: ${CONCURRENCY}" +echo " Port: ${PORT}" +echo " Deploy config: ${DEPLOY_CONFIG}" +echo " Batch size: ${BATCH_SIZE}" +echo " GPU mem T/C: ${GPU_MEM_TALKER} / ${GPU_MEM_CODE2WAV}" +echo " Results: ${RESULT_DIR}" +echo " Task type: ${TASK_TYPE}" echo "============================================================" -# Prepare stage config with correct GPU device and memory settings -prepare_config() { - local config_template="$1" - local config_name="$2" - local output_path="${RESULT_DIR}/${config_name}_stage_config.yaml" - - # Use sed to patch GPU device and memory utilization - sed \ - -e "s/devices: \"0\"/devices: \"${GPU_DEVICE}\"/g" \ - -e "s/gpu_memory_utilization: 0.3/gpu_memory_utilization: ${GPU_MEM_TALKER}/g" \ - -e "s/gpu_memory_utilization: 0.2/gpu_memory_utilization: ${GPU_MEM_CODE2WAV}/g" \ - "${config_template}" > "${output_path}" - - echo "${output_path}" -} - # Start server and wait for it to be ready start_server() { - local stage_config="$1" - local config_name="$2" + local config_name="$1" local log_file="${RESULT_DIR}/server_${config_name}_${TIMESTAMP}.log" echo "" echo "Starting server with config: ${config_name}" - echo " Stage config: ${stage_config}" + echo " Deploy config: ${DEPLOY_CONFIG}" + echo " Stage overrides: ${STAGE_OVERRIDES}" echo " Log file: ${log_file}" VLLM_WORKER_MULTIPROC_METHOD=spawn \ @@ -118,7 +126,8 @@ start_server() { --omni \ --host 127.0.0.1 \ --port "${PORT}" \ - --stage-configs-path "${stage_config}" \ + --deploy-config "${DEPLOY_CONFIG}" \ + --stage-overrides "${STAGE_OVERRIDES}" \ --stage-init-timeout 120 \ --trust-remote-code \ --disable-log-stats \ @@ -175,17 +184,13 @@ trap 'stop_server' EXIT # Run benchmark for a given config run_bench() { local config_name="$1" - local config_template="$2" echo "" echo "============================================================" echo " Benchmarking: ${config_name}" echo "============================================================" - local stage_config - stage_config=$(prepare_config "${config_template}" "${config_name}") - - start_server "${stage_config}" "${config_name}" + start_server "${config_name}" # Convert concurrency string to args local conc_args="" @@ -212,7 +217,7 @@ run_bench() { # Run vllm-omni benchmark if [ "${RUN_ASYNC}" = true ]; then - run_bench "async_chunk" "${SCRIPT_DIR}/${STAGE_CONFIG}" + run_bench "async_chunk" fi # Run HuggingFace baseline benchmark diff --git a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml b/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml deleted file mode 100644 index ca441d286d..0000000000 --- a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml +++ /dev/null @@ -1,93 +0,0 @@ -# Qwen3-TTS batch_size=1 config (streaming with async_chunk) -# 2-stage pipeline: Talker -> Code2Wav -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - max_num_seqs: 1 - model_stage: qwen3_tts - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - max_num_seqs: 1 - model_stage: code2wav - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 8192 - max_model_len: 32768 - engine_input_source: [0] - final_output: true - final_output_type: audio - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 1 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - codec_streaming: true - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - codec_chunk_frames: 25 - codec_left_context_frames: 25 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs16.yaml b/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs16.yaml deleted file mode 100644 index 2cc5cf5353..0000000000 --- a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs16.yaml +++ /dev/null @@ -1,94 +0,0 @@ -# Qwen3-TTS max_num_seqs=16 config (streaming with async_chunk) -# High-throughput concurrent request processing -# 2-stage pipeline: Talker -> Code2Wav -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - max_num_seqs: 16 - model_stage: qwen3_tts - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 4096 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - max_num_seqs: 16 - model_stage: code2wav - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - max_num_batched_tokens: 16384 - max_model_len: 32768 - engine_input_source: [0] - final_output: true - final_output_type: audio - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 16 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - codec_streaming: true - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - codec_chunk_frames: 25 - codec_left_context_frames: 25 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs4.yaml b/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs4.yaml deleted file mode 100644 index 5de107d497..0000000000 --- a/benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs4.yaml +++ /dev/null @@ -1,94 +0,0 @@ -# Qwen3-TTS batch_size=4 config (streaming with async_chunk) -# Enables concurrent request processing -# 2-stage pipeline: Talker -> Code2Wav -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - max_num_seqs: 4 - model_stage: qwen3_tts - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - max_num_seqs: 4 - model_stage: code2wav - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - max_num_batched_tokens: 8192 - max_model_len: 32768 - engine_input_source: [0] - final_output: true - final_output_type: audio - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 4 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - codec_streaming: true - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - codec_chunk_frames: 25 - codec_left_context_frames: 25 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/benchmarks/qwen3-tts/vllm_omni/run_async_chunk_benchmark.sh b/benchmarks/qwen3-tts/vllm_omni/run_async_chunk_benchmark.sh index 61cf7757a9..0ede359ea3 100755 --- a/benchmarks/qwen3-tts/vllm_omni/run_async_chunk_benchmark.sh +++ b/benchmarks/qwen3-tts/vllm_omni/run_async_chunk_benchmark.sh @@ -31,8 +31,11 @@ PORT_OFF="${PORT_OFF:-8001}" RESULT_DIR="${SCRIPT_DIR}/results" TIMESTAMP="$(date +%Y%m%d_%H%M%S)" -STAGE_CONFIG_ON="vllm_omni/model_executor/stage_configs/qwen3_tts.yaml" -STAGE_CONFIG_OFF="vllm_omni/model_executor/stage_configs/qwen3_tts_no_async_chunk.yaml" +# The bundled ``vllm_omni/deploy/qwen3_tts.yaml`` is auto-loaded by the model +# registry; no ``--deploy-config`` flag needed on the default (ON) path. +# async_chunk OFF is selected by the ``--no-async-chunk`` CLI flag — +# the single ``qwen3_tts`` pipeline dispatches to the end-to-end codec +# processor when ``deploy.async_chunk`` is false. mkdir -p "${RESULT_DIR}" @@ -77,7 +80,6 @@ wait_for_server() { echo "" echo "[Phase 1] Starting async_chunk ON server on port ${PORT_ON}..." CUDA_VISIBLE_DEVICES=${GPU_DEVICE} vllm-omni serve "${MODEL}" \ - --stage-configs-path "${STAGE_CONFIG_ON}" \ --host 0.0.0.0 --port "${PORT_ON}" \ --trust-remote-code --enforce-eager --omni \ > "${RESULT_DIR}/server_on_${TIMESTAMP}.log" 2>&1 & @@ -104,7 +106,7 @@ sleep 5 echo "" echo "[Phase 2] Starting async_chunk OFF server on port ${PORT_OFF}..." CUDA_VISIBLE_DEVICES=${GPU_DEVICE} vllm-omni serve "${MODEL}" \ - --stage-configs-path "${STAGE_CONFIG_OFF}" \ + --no-async-chunk \ --host 0.0.0.0 --port "${PORT_OFF}" \ --trust-remote-code --enforce-eager --omni \ > "${RESULT_DIR}/server_off_${TIMESTAMP}.log" 2>&1 & diff --git a/docs/assets/WeChat.jpg b/docs/assets/WeChat.jpg index 416439f7eb..83252b7569 100644 Binary files a/docs/assets/WeChat.jpg and b/docs/assets/WeChat.jpg differ diff --git a/docs/configuration/README.md b/docs/configuration/README.md index b5761a7f1b..390176e9ce 100644 --- a/docs/configuration/README.md +++ b/docs/configuration/README.md @@ -6,7 +6,7 @@ For options within a vLLM Engine. Please refer to [vLLM Configuration](https://d Currently, the main options are maintained by stage configs for each model. -For specific example, please refer to [Qwen2.5-omni stage config](stage_configs/qwen2_5_omni.yaml) +For a specific example, see the [Qwen2.5-Omni deploy config](gh-file:vllm_omni/deploy/qwen2_5_omni.yaml). The matching frozen pipeline topology lives at [vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py](gh-file:vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py). For introduction, please check [Introduction for stage config](./stage_configs.md) diff --git a/docs/configuration/pd_disaggregation.md b/docs/configuration/pd_disaggregation.md index 1cf6189e60..9196bdb024 100644 --- a/docs/configuration/pd_disaggregation.md +++ b/docs/configuration/pd_disaggregation.md @@ -11,7 +11,7 @@ deployment-specific values usually change per environment: - connector backend and connector ports - connector IPs or bootstrap addresses -Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml) +Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml) and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the changes below. @@ -145,19 +145,13 @@ Compared with the default Qwen3-Omni config: ```yaml runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 edges: - from: 0 to: 1 - window_size: -1 - from: 1 to: 2 - window_size: -1 - from: 2 to: 3 - window_size: -1 ``` ## 4. Launch with your custom config diff --git a/docs/configuration/stage_configs.md b/docs/configuration/stage_configs.md index 95c42afcc7..55b4053cc7 100644 --- a/docs/configuration/stage_configs.md +++ b/docs/configuration/stage_configs.md @@ -3,7 +3,147 @@ In vLLM-Omni, the target model is separated into multiple stages, which are processed by different LLMEngines, DiffusionEngines or other types of engines. Depending on different types of stages, such as Autoregressive (AR) stage or Diffusion transformer (DiT) stage, each can choose corresponding schedulers, model workers to load with the Engines in a plug-in fashion. !!! note - Default stage config YAMLs (for example, `vllm_omni/model_executor/stage_configs/qwen2_5_omni.yaml` and `vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml`) are bundled and loaded automatically when `stage_configs_path` is not provided. They have been verified to work on 1xH100 for Qwen2.5-Omni and 2xH100 for Qwen3-Omni. + Default deploy config YAMLs (for example, `vllm_omni/deploy/qwen2_5_omni.yaml`, `vllm_omni/deploy/qwen3_omni_moe.yaml`, and `vllm_omni/deploy/qwen3_tts.yaml`) are bundled and loaded automatically when neither `--stage-configs-path` nor `--deploy-config` is provided — the model registry resolves the right pipeline + deploy YAML by `model_type`. The bundled defaults have been verified on 1xH100 for Qwen2.5-Omni and 2xH100 for Qwen3-Omni. Models that have not yet migrated to the new schema continue to use the legacy `vllm_omni/model_executor/stage_configs/.yaml` files via `--stage-configs-path`. + +## New deploy schema reference + +The new deploy schema lives under `vllm_omni/deploy/` and is paired with a frozen `PipelineConfig` registered by the model's `pipeline.py`. Each deploy YAML has these top-level fields: + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `base_config` | str (path) | optional | — | Overlay parent (relative or absolute). `stages:` / `platforms:` deep-merged by stage_id; other scalars overlay-wins. Intended for user-authored overlays; prod yamls stay flat. | +| `async_chunk` | bool | optional | `true` | Enable chunked streaming between stages. Pin to `false` if the pipeline runs end-to-end. | +| `connectors` | dict | optional | `null` | Named connector specs (`{name, extra}`). Referenced by each stage's `input_connectors` / `output_connectors`. See [Connector schema](#connector-schema). | +| `edges` | list | optional | `null` | Explicit edge list for the KV transfer graph. Auto-derived from stage inputs if omitted. | +| `stages` | list | required | — | Per-stage engine args + wiring (see [Stage fields](#stage-fields)). | +| `platforms` | dict | optional | `null` | Keyed by `npu` / `rocm` / `xpu`, each contains a `stages:` list with per-platform overrides applied on top of the CUDA defaults. | +| `pipeline` | str | optional | `null` | Override the auto-detected pipeline registry key (used for structural variants like `qwen2_5_omni_thinker_only`). | +| `trust_remote_code` | bool | optional | `true` | **Pipeline-wide.** Trust HF remote code on model load; applies to every stage. | +| `distributed_executor_backend` | str | optional | `"mp"` | **Pipeline-wide.** Executor backend (`"mp"` or `"ray"`). | +| `dtype` | str \| null | optional | `null` | **Pipeline-wide.** Model dtype for every stage. | +| `quantization` | str \| null | optional | `null` | **Pipeline-wide.** Quantization method for every stage. | +| `enable_prefix_caching` | bool | optional | `false` | **Pipeline-wide.** Prefix cache toggle applied to every stage. | +| `enable_chunked_prefill` | bool \| null | optional | `null` | **Pipeline-wide.** Chunked prefill toggle applied to every stage. | +| `data_parallel_size` | int | optional | `1` | **Pipeline-wide.** DP degree for every stage. | +| `pipeline_parallel_size` | int | optional | `1` | **Pipeline-wide.** PP degree for every stage. | + +### Stage fields + +Each entry under `stages:` accepts any `StageDeployConfig` field directly (no nested `engine_args:`). Only fields whose value legitimately varies across stages live here; pipeline-wide settings (trust_remote_code, distributed_executor_backend, dtype, quantization, prefix/chunked prefill, DP/PP sizes) are declared at the top level and applied to every stage. Unknown keys fall through to `engine_extras:` and are forwarded to the engine. + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `stage_id` | int | required | — | Stage identity; matched against `PipelineConfig.stages[*].stage_id`. | +| `max_num_seqs` | int | optional | `64` | Max concurrent sequences per stage. | +| `gpu_memory_utilization` | float | optional | `0.9` | Per-stage memory budget. | +| `tensor_parallel_size` | int | optional | `1` | TP degree for this stage. | +| `enforce_eager` | bool | optional | `false` | Disable CUDA graphs. | +| `max_num_batched_tokens` | int | optional | `32768` | Prefill budget. | +| `max_model_len` | int \| null | optional | `null` | Per-stage context length (auto-sets `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` when larger than HF default). | +| `async_scheduling` | bool \| null | optional | `null` | Per-stage async scheduling toggle. | +| `devices` | str | optional | `"0"` | `CUDA_VISIBLE_DEVICES`-style device list. | +| `output_connectors` | dict \| null | optional | `null` | Keyed by `to_stage_`; values are names registered under top-level `connectors:`. | +| `input_connectors` | dict \| null | optional | `null` | Keyed by `from_stage_`; values are names registered under top-level `connectors:`. | +| `default_sampling_params` | dict \| null | optional | `null` | Baseline sampling params. Deep-merged with pipeline `sampling_constraints` (pipeline wins). | +| `engine_extras` | dict | optional | `{}` | Catch-all for keys not listed above; deep-merged across overlays. Also carries per-stage overrides of pipeline-wide settings (e.g. stage-specific `dtype`). | + +### Connector schema + +Each entry under top-level `connectors:` follows this shape: + +```yaml +connectors: + : + name: # required — class registered in vllm_omni.distributed + extra: # optional — forwarded to the connector's __init__ + : + ... +``` + +| Connector class | Use case | `extra` keys | +|-----------------|----------|--------------| +| `SharedMemoryConnector` | Same-host KV transfer between stages (default for bundled YAMLs). | `shm_threshold_bytes` (int, default `65536`). | +| `MooncakeStoreConnector` | Cross-host KV transfer over TCP. Required for multi-node deployments. | `host`, `metadata_server`, `master`, `segment` (int bytes), `localbuf` (int bytes), `proto` (`"tcp"` / `"rdma"`). | + +A stage references a connector by name in its `input_connectors` / `output_connectors`: + +```yaml +connectors: + shm: + name: SharedMemoryConnector + +stages: + - stage_id: 0 + output_connectors: {to_stage_1: shm} + - stage_id: 1 + input_connectors: {from_stage_0: shm} +``` + +### CLI flags introduced in this refactor + +| Flag | Description | +|------|-------------| +| `--deploy-config PATH` | Load a new-schema deploy YAML. Takes precedence over `--stage-configs-path`. **Optional** — when omitted, the bundled `vllm_omni/deploy/.yaml` is auto-loaded by the model registry. | +| `--stage-overrides JSON` | Per-stage JSON overrides, e.g. `'{"0":{"gpu_memory_utilization":0.5}}'`. Per-stage values always win over global flags. | +| `--async-chunk` / `--no-async-chunk` | Flip the deploy YAML's `async_chunk:` bool. Unset (default) leaves the YAML value in force. | +| `--stage-configs-path` | **Deprecated.** Accepts legacy `stage_args` yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to `--deploy-config`. To be removed in a follow-up PR. | + +### Precedence + +From highest to lowest: + +1. Per-stage flags (`--stage-overrides` JSON, `--stage--` if registered) +2. Explicit global CLI flags (`--gpu-memory-utilization 0.85`, etc.) +3. Platform section (`platforms.npu.stages`, etc.) on top of the base `stages:` +4. Overlay YAML (via `base_config:`) on top of the base YAML +5. Parser defaults + +### Worked override example + +Starting from the bundled `vllm_omni/deploy/qwen3_omni_moe.yaml`: + +```yaml +# vllm_omni/deploy/qwen3_omni_moe.yaml (excerpt) +async_chunk: true +stages: + - stage_id: 0 + gpu_memory_utilization: 0.9 + max_num_seqs: 32 + - stage_id: 1 + gpu_memory_utilization: 0.7 + max_num_seqs: 16 +``` + +A user-authored overlay that inherits the base and overrides only stage 1: + +```yaml +# my_overrides.yaml +base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml +stages: + - stage_id: 1 + gpu_memory_utilization: 0.5 # smaller GPU +``` + +Launched with both an explicit global flag and a per-stage override: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --deploy-config my_overrides.yaml \ + --max-model-len 16384 \ + --stage-overrides '{"0": {"max_num_seqs": 8}}' +``` + +Effective config per stage after the merge: + +| Stage | Field | Final value | Source | +|-------|-------|-------------|--------| +| 0 | `gpu_memory_utilization` | `0.9` | base YAML (overlay didn't touch stage 0) | +| 0 | `max_num_seqs` | `8` | per-stage CLI (`--stage-overrides`) — wins over base `32` | +| 0 | `max_model_len` | `16384` | global CLI | +| 1 | `gpu_memory_utilization` | `0.5` | overlay YAML — wins over base `0.7` | +| 1 | `max_num_seqs` | `16` | base YAML (overlay didn't touch this field) | +| 1 | `max_model_len` | `16384` | global CLI | +| 2 | (all defaults) | — | base YAML (no overrides apply) | Therefore, as a core part of vLLM-Omni, the stage configs for a model have several main functions: @@ -35,7 +175,7 @@ stage_args: - stage_id: 0 # mark the unique id for each stage runtime: # The disaggregated configuration process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) + devices: "0" # Logical device index for this stage (mapped through CUDA_VISIBLE_DEVICES / ASCEND_RT_VISIBLE_DEVICES if set) engine_args: # Engine arguments for a certain engine model_stage: thinker max_num_seqs: 1 @@ -114,16 +254,12 @@ stage_args: # Top-level runtime config (concise): default windows and stage edges runtime: enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage + edges: - from: 0 # thinker → talker: trigger only after receiving full input (-1) to: 1 - window_size: -1 - from: 1 # talker → code2wav: trigger only after receiving full input (-1) to: 2 - window_size: -1 ``` @@ -155,7 +291,9 @@ Default: `true` #### `runtime.devices` -Visible devices for this stage, specified as a string. This controls which GPU devices are available to the stage process, similar to setting `CUDA_VISIBLE_DEVICES` or using `torch.cuda.set_device()`. For example, `"0"` uses GPU 0, `"1"` uses GPU 1, and `"0,1"` makes both GPUs 0 and 1 visible. +Logical device indices for this stage, specified as a string. Values are **logical indices** (`0`, `1`, `2`, ...) — not physical GPU IDs — and are mapped through the platform's visibility env var (`CUDA_VISIBLE_DEVICES` on CUDA, `ASCEND_RT_VISIBLE_DEVICES` on NPU) before being applied via `torch.cuda.set_device()` (or the equivalent). + +Example: if `CUDA_VISIBLE_DEVICES=0,2,4` is set in the environment, then `devices: "0"` selects physical GPU 0 (the first visible), `devices: "1"` selects physical GPU 2, and `devices: "0,1"` makes physical GPUs 0 and 2 available to the stage. If no visibility env var is set, logical and physical IDs coincide. Default: `"0"` diff --git a/docs/configuration/stage_configs/qwen2_5_omni.yaml b/docs/configuration/stage_configs/qwen2_5_omni.yaml deleted file mode 100644 index 690577b84a..0000000000 --- a/docs/configuration/stage_configs/qwen2_5_omni.yaml +++ /dev/null @@ -1,94 +0,0 @@ -# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime. -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - stage_id: 2 - runtime: - process: true - devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/docs/contributing/ci/CI_5levels.md b/docs/contributing/ci/CI_5levels.md index b0428ddd7d..2452ef5d4a 100644 --- a/docs/contributing/ci/CI_5levels.md +++ b/docs/contributing/ci/CI_5levels.md @@ -231,8 +231,7 @@ vllm_omni/ tests/ │ ├── test_qwen3_omni_expansion.py │ ├── test_mimo_audio.py │ ├── test_image_gen_edit.py - │ ├── test_images_generations_lora.py - │ └── stage_configs/ + │ └── test_images_generations_lora.py └── offline_inference/ ✅ ├── test_qwen2_5_omni.py ├── test_qwen3_omni.py @@ -248,11 +247,12 @@ vllm_omni/ tests/ ├── test_diffusion_layerwise_offload.py ├── test_diffusion_lora.py ├── test_sequence_parallel.py - └── stage_configs/ - ├── qwen2_5_omni_ci.yaml - ├── qwen3_omni_ci.yaml - ├── bagel_*.yaml - └── npu/, rocm/, etc. + └── stage_configs/ (legacy schema, still + ├── bagel_*.yaml present for unmigrated + └── npu/, rocm/, etc. models) + +# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under +# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md. ``` diff --git a/docs/contributing/ci/tests_style.md b/docs/contributing/ci/tests_style.md index 69d5b16d7a..392f004721 100644 --- a/docs/contributing/ci/tests_style.md +++ b/docs/contributing/ci/tests_style.md @@ -135,8 +135,7 @@ vllm_omni/ tests/ │ ├── test_qwen3_omni_expansion.py │ ├── test_mimo_audio.py │ ├── test_image_gen_edit.py - │ ├── test_images_generations_lora.py - │ └── stage_configs/ + │ └── test_images_generations_lora.py └── offline_inference/ ✅ ├── test_qwen2_5_omni.py ├── test_qwen3_omni.py @@ -153,11 +152,12 @@ vllm_omni/ tests/ ├── test_diffusion_lora.py ├── test_sequence_parallel.py ├── test_qwen_image_edit_expansion.py - └── stage_configs/ - ├── qwen2_5_omni_ci.yaml - ├── qwen3_omni_ci.yaml - ├── bagel_*.yaml + └── stage_configs/ (legacy schema, still present + ├── bagel_*.yaml for unmigrated models) └── npu/, rocm/, etc. + +# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under +# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md. examples/ tests │ └── examples ├── online_serving/ → ├── online_serving/ @@ -229,6 +229,7 @@ from tests.conftest import ( generate_synthetic_video, merge_base64_and_convert_to_text, ) +from tests.utils import get_deploy_config_path from vllm_omni.platforms import current_omni_platform # Edit: model name and stage config path @@ -236,7 +237,7 @@ models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"] #If you use the default configuration file, you can directly use the following address. def get_default_config(): - return str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_ci.yaml") + return get_deploy_config_path("ci/qwen3_omni_moe.yaml") #If you need to modify the configuration file, you can use modify_stage_config. def get_chunk_config(): diff --git a/docs/contributing/model/adding_omni_model.md b/docs/contributing/model/adding_omni_model.md index a0619e3381..478e77c7d5 100644 --- a/docs/contributing/model/adding_omni_model.md +++ b/docs/contributing/model/adding_omni_model.md @@ -313,7 +313,7 @@ The registry uses lazy loading, so the model class is imported only when needed. ## Stage Configuration -Create a YAML configuration file in `vllm_omni/model_executor/stage_configs/`. For a complete example, see the [Qwen3-Omni configuration file](gh-file:vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml). +Create a YAML configuration file in `vllm_omni/deploy/`. For a complete example, see the [Qwen3-Omni configuration file](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml). ### Key Configuration Fields @@ -614,7 +614,7 @@ For a complete reference implementation, see: - **Thinker**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_thinker.py` - **Talker**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py` - **Code2Wav**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py` -- **Stage config**: `vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml` +- **Stage config**: `vllm_omni/deploy/qwen3_omni_moe.yaml` - **Input processors**: `vllm_omni/model_executor/stage_input_processors/qwen3_omni.py` - **Registry**: `vllm_omni/model_executor/models/registry.py` - **Testing**: `vllm_omni/tests/e2e/offline_inference/test_qwen3_omni.py` diff --git a/docs/contributing/model/adding_tts_model.md b/docs/contributing/model/adding_tts_model.md index e48ae5049f..66da1749ce 100644 --- a/docs/contributing/model/adding_tts_model.md +++ b/docs/contributing/model/adding_tts_model.md @@ -120,8 +120,18 @@ vllm_omni/model_executor/stage_configs/ | `models/qwen3_tts/qwen3_tts.py` | Unified model class | | `models/qwen3_tts/qwen3_tts_code_predictor_vllm.py` | Stage 0 - optimized AR | | `models/qwen3_tts/qwen3_tts_code2wav.py` | Stage 1 - decoder | -| `stage_configs/qwen3_tts.yaml` | Stage config (async_chunk enabled) | -| `stage_configs/qwen3_tts_batch.yaml` | Batch mode config | +| `deploy/qwen3_tts.yaml` (new schema) | Deploy config (async_chunk enabled) — paired with `models/qwen3_tts/pipeline.py` for the frozen topology | + +> **Chunked vs end-to-end modes**: `qwen3_tts` registers a single +> pipeline whose stage 1 declares alternate processor functions — an +> `async_chunk_process_next_stage_input_func` (per-chunk streaming, used +> when `deploy.async_chunk=True`) and a `sync_process_input_func` +> (batch-end, used when `deploy.async_chunk=False`). The loader selects +> one at merge time based on the bool, so `--no-async-chunk` alone +> switches modes — no variant yaml or variant pipeline registration is +> needed. Pipelines that only make sense in one mode (e.g. +> `qwen3_omni_moe` is always chunked) can keep using the unconditional +> `custom_process_*` fields. | `stage_input_processors/qwen3_tts.py` | Stage transition processors | ## Step-by-Step Implementation @@ -574,7 +584,8 @@ Adding a TTS model to vLLM-Omni involves: | `models/qwen3_tts/qwen3_tts.py` | Unified model class | | `models/qwen3_tts/qwen3_tts_code_predictor_vllm.py` | AR stage with vLLM fused ops | | `models/qwen3_tts/qwen3_tts_code2wav.py` | Decoder stage with `chunked_decode_streaming()` | -| `stage_configs/qwen3_tts.yaml` | Stage configuration | +| `models/qwen3_tts/pipeline.py` | Frozen pipeline topology (registered at import time) | +| `deploy/qwen3_tts.yaml` | Deploy config (user-editable, async_chunk + SharedMemoryConnector) | | `stage_input_processors/qwen3_tts.py` | Stage transition processors | For more information, see: diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 418fb707ae..6c209e5659 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -127,10 +127,11 @@ Multi-stage omni serving: ```bash vllm serve Qwen/Qwen2.5-Omni-7B \ --omni \ - --stage-configs-path qwen2_5_omni.yaml \ --port 8091 ``` +(The default deploy config at `vllm_omni/deploy/qwen2_5_omni.yaml` is loaded automatically. Pass `--deploy-config /path/to/custom.yaml` to override.) + Single-stage diffusion serving with torch profiler: ```bash diff --git a/docs/design/feature/teacache.md b/docs/design/feature/teacache.md index 9fa315cee7..8577cff1f0 100644 --- a/docs/design/feature/teacache.md +++ b/docs/design/feature/teacache.md @@ -326,9 +326,41 @@ for prompt in tqdm(prompts, desc="Collecting data"): # Estimate coefficients coeffs = estimator.estimate(poly_order=4) -print(f"Estimated coefficients: {coeffs.tolist()}") +print(f"Estimated coefficients: {coeffs}") ``` +Note: some models may require the vLLM context and config to be initialized to initialize vLLM modules. To this end, you may need a workaround like the following to be able to run coefficient estimation. +```python +from vllm_omni.diffusion.forward_context import set_forward_context +from vllm_omni.diffusion.distributed.parallel_state import ( + init_distributed_environment, + initialize_model_parallel, +) +from vllm.config import VllmConfig +... + +if __name__ == "__main__": + os.environ["MASTER_ADDR"] = "localhost" + os.environ["MASTER_PORT"] = "8192" + os.environ["LOCAL_RANK"] = "0" + os.environ["RANK"] = "0" + os.environ["WORLD_SIZE"] = "1" + + vllm_config = VllmConfig() + init_distributed_environment() + initialize_model_parallel() + + # NOTE: you may have to pass an initialized OmniDiffusionConfig as a kwarg + # here to make current sp checks happy; if this is the case, just create one + # .from_kwargs() with the model name to get around this check for now, + # since your estimator subclass should handle the actual model configuration. + # + # This will be cleaned up in the future + with set_forward_context(vllm_config): + +``` + + **Data Statistics Guide:** | Metric | Good Range | Warning Signs | diff --git a/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png b/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..15112d5862 Binary files /dev/null and b/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png new file mode 100644 index 0000000000..2f0615f77b Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png new file mode 100644 index 0000000000..62d8bc79b6 Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png new file mode 100644 index 0000000000..5838b45319 Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png new file mode 100644 index 0000000000..24be814b7e Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png new file mode 100644 index 0000000000..c8df58ebcd Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png differ diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png new file mode 100644 index 0000000000..2d1a04e9c2 Binary files /dev/null and b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png differ diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png new file mode 100644 index 0000000000..e598b54343 Binary files /dev/null and b/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png differ diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png new file mode 100644 index 0000000000..54452013eb Binary files /dev/null and b/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png differ diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png new file mode 100644 index 0000000000..04c5ad7396 Binary files /dev/null and b/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png differ diff --git a/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png b/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..d93ba0b2af Binary files /dev/null and b/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png b/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png new file mode 100644 index 0000000000..04087b5910 Binary files /dev/null and b/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png differ diff --git a/docs/design/figures/omni/Summary_RTF_vs_features.png b/docs/design/figures/omni/Summary_RTF_vs_features.png new file mode 100644 index 0000000000..c2c8ad4083 Binary files /dev/null and b/docs/design/figures/omni/Summary_RTF_vs_features.png differ diff --git a/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png b/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png new file mode 100644 index 0000000000..3dcc1c5537 Binary files /dev/null and b/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png differ diff --git a/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png b/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..9a5b6c9bda Binary files /dev/null and b/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..68f0ef17e8 Binary files /dev/null and b/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..44be96e96d Binary files /dev/null and b/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png new file mode 100644 index 0000000000..2e5d1482bd Binary files /dev/null and b/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png differ diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png new file mode 100644 index 0000000000..04d8f0bac5 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png differ diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png new file mode 100644 index 0000000000..eb85ec0dd4 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png differ diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png new file mode 100644 index 0000000000..6f0e0e2529 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png differ diff --git a/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png new file mode 100644 index 0000000000..89ea30a864 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png differ diff --git a/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png new file mode 100644 index 0000000000..2b207b8898 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png differ diff --git a/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png new file mode 100644 index 0000000000..f5f7ad72c8 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png differ diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png new file mode 100644 index 0000000000..6f8c1da4a5 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png differ diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png new file mode 100644 index 0000000000..b0fe1d02a9 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png differ diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png new file mode 100644 index 0000000000..008ba9bf78 Binary files /dev/null and b/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png differ diff --git a/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png b/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png new file mode 100644 index 0000000000..7c65aa1177 Binary files /dev/null and b/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png differ diff --git a/docs/design/figures/tts/Summary_mean_rtf_vs_features.png b/docs/design/figures/tts/Summary_mean_rtf_vs_features.png new file mode 100644 index 0000000000..71bb2c5468 Binary files /dev/null and b/docs/design/figures/tts/Summary_mean_rtf_vs_features.png differ diff --git a/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png b/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png new file mode 100644 index 0000000000..cef2546d6f Binary files /dev/null and b/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png differ diff --git a/docs/design/qwen3_omni_tts_performance_optimization.md b/docs/design/qwen3_omni_tts_performance_optimization.md new file mode 100644 index 0000000000..2f18a1b1bc --- /dev/null +++ b/docs/design/qwen3_omni_tts_performance_optimization.md @@ -0,0 +1,539 @@ +# Speech Generation on vLLM-Omni: Performance Optimizations for Qwen3-Omni and Qwen3-TTS + +## Summary + +vLLM-Omni supports end-to-end serving for speech-generating models, including both **Qwen3-Omni** (multimodal understanding + speech) and **Qwen3-TTS** (text-to-speech). Despite their different architectures, both models share the same multi-stage pipeline design and benefit from the same set of stacked optimizations: + +1. **Batching** improves GPU utilization stage by stage and increases overall throughput. +2. **CUDA Graph** reduces CPU launch overhead and decode-time jitter on stable shapes. +3. **Async Chunk and Streaming Output** overlap compute and communication across stages and emit audio incrementally, improving both TTFP and E2E. + +### Model architectures + +**Qwen3-Omni** is a native multimodal model that understands text, audio, image, and video inputs, and generates both text and speech outputs. Its pipeline has three stages: + +- **Thinker**: multimodal understanding and text generation +- **Talker (+ Talker-MTP / code predictor path)**: converts semantic/text representations into codec tokens +- **Code2Wav**: decodes codec tokens into waveform audio + +**Qwen3-TTS** is a lightweight, high-quality text-to-speech model. Its pipeline has two stages: + +- **Talker (AR decoder)**: auto-regressively generates codec tokens from text input +- **Code2Wav (vocoder)**: decodes codec tokens into waveform audio + +The optimizations described in this post apply to both models. We present results for each side by side. + +### vLLM-Omni vs HF Transformers + +Compared with **HF Transformers** (offline, single request), vLLM-Omni with the full optimization stack delivers dramatically lower latency and higher efficiency for both models. + +**Qwen3-Omni** (A100): + + + + + +
Qwen3-Omni E2EL: vLLM vs HFQwen3-Omni TTFP: vLLM vs HFQwen3-Omni RTF: vLLM vs HF
+ +| Metric | vLLM-Omni | HF Transformers | Improvement | +| --- | --- | --- | --- | +| E2E latency (s) | 23.78 | 336.10 | ~93% reduction | +| TTFP (s) | 0.934 | 336.10 | ~99.7% reduction | +| RTF | 0.32 | 3.776 | ~91% reduction (~12× faster) | + +- **E2E latency**: 23.78 s vs 336.10 s - **~93%** reduction +- **TTFP**: 0.934 s vs 336.10 s - **~99.7%** reduction +- **RTF**: 0.32 vs 3.776 - **~91%** reduction (~12x faster) + +**Qwen3-TTS** (H200, concurrency 1): + + + + + +
Qwen3-TTS E2EL: vLLM vs HFQwen3-TTS TTFP: vLLM vs HFQwen3-TTS RTF: vLLM vs HF
+ +| Metric | vLLM-Omni | HF Transformers | Improvement | +| --- | --- | --- | --- | +| E2E latency (ms) | 941 | 15,513 | ~94% reduction | +| TTFP (ms) | 64 | 15,513 | ~99.6% reduction (242× faster) | +| RTF | 0.16 | 2.64 | ~94% reduction (~16.5× faster) | + +- **E2E latency**: 941 ms vs 15,513 ms - **~94%** reduction +- **TTFP**: 64 ms vs 15,513 ms - **~99.6%** reduction (242x faster) +- **RTF**: 0.16 vs 2.64 - **~94%** reduction (~16.5x faster) + +### Stacked optimization summary + +Each optimization stacks on the previous one. The summary plots below show the cumulative effect at each step, with one line per concurrency level (1, 4, 10). + +**Qwen3-Omni** (A100): + + + + + +
Qwen3-Omni E2EL: stacked optimizationQwen3-Omni TTFP: stacked optimizationQwen3-Omni RTF: stacked optimization
+ +- **E2EL reduction**: ~74% at concurrency 10 (410,054 ms -> 104,901 ms); ~90% at concurrency 1 (426,529 ms -> 41,216 ms) +- **TTFP reduction**: ~96% at concurrency 10 (409,705 ms -> 16,482 ms); ~99.7% at concurrency 1 (426,078 ms -> 1,164 ms) +- **RTF reduction**: ~74% at concurrency 10 (2.83 -> 0.74); ~90% at concurrency 1 (2.08 -> 0.21) + +**Qwen3-TTS** (H200): + + + + + +
Qwen3-TTS E2EL: stacked optimizationQwen3-TTS TTFP: stacked optimizationQwen3-TTS RTF: stacked optimization
+ +- **E2EL reduction**: ~85% at concurrency 10 (12,141 ms -> 1,767 ms); ~29% at concurrency 1 (1,323 ms -> 941 ms) +- **TTFP reduction**: ~96.5% at concurrency 10 (12,141 ms -> 425 ms); ~95% at concurrency 1 (1,323 ms -> 64 ms) +- **RTF reduction**: ~86% at concurrency 10 (2.19 -> 0.31); ~30% at concurrency 1 (0.23 -> 0.16) + +**Benchmark environment:** + +| | Qwen3-Omni | Qwen3-TTS | +| --- |-----------------------------| --- | +| **GPU** | A100 | H200 | +| **Model** | Qwen3-Omni-30B-A3B-Instruct | Qwen3-TTS-12Hz-1.7B-CustomVoice | +| **vLLM** | v0.17.0 | v0.18.0 | +| **vllm-omni** | commit 199f7832 | v0.18.0rc2 | +| **CUDA** | 12.9 | 12.8 | + +This post walks through each optimization in the same order they are typically enabled in practice, then ends with deployment playbooks for both models. + +--- + +## Pipeline Batching + +### How stage-wise batching works + +For both Qwen3-Omni and Qwen3-TTS, batching is a pipeline-level optimization: + +- Requests are grouped per stage using `runtime.max_batch_size` +- Each stage executes batch inference with its own scheduler/worker +- Stage outputs are routed to downstream stages with per-request mapping preserved + +**Batching strategy by stage:** The understanding and decode stages (Thinker for Omni, Talker for both) use **continuous batching**: requests can join and leave the batch over time. Code2Wav uses **static batching**: once a batch is formed, the stage runs the whole batch before starting the next. This matches the decode pattern of Code2Wav and keeps implementation simple while still improving throughput. + +### Batching results (Baseline vs. Batch) + +Batching alone greatly reduces E2EL and RTF across all concurrencies. The biggest gains appear at high concurrency where requests share GPU resources. + +**Qwen3-Omni** (A100): + + + + + +
Qwen3-Omni E2EL: Baseline vs BatchQwen3-Omni TTFP: Baseline vs BatchQwen3-Omni RTF: Baseline vs Batch
+ +| Metric | Concurrency | Baseline | + Batch | Improvement | +| --- | --- | --- | --- | --- | +| E2EL (ms) | 1 | 426,529 | 307,719 | 1.4× | +| E2EL (ms) | 4 | 407,213 | 376,934 | 1.1× | +| E2EL (ms) | 10 | 410,054 | 234,844 | 1.7× | +| TTFP (ms) | 1 | 426,078 | 307,262 | 1.4× | +| TTFP (ms) | 4 | 406,843 | 376,466 | 1.1× | +| TTFP (ms) | 10 | 409,705 | 234,557 | 1.7× | +| RTF | 1 | 2.08 | 1.51 | 1.4× | +| RTF | 4 | 2.55 | 1.83 | 1.4× | +| RTF | 10 | 2.83 | 2.28 | 1.2× | + +At concurrency 10, E2EL drops from ~410 s to ~235 s; at concurrency 1, from ~427 s to ~308 s. + +**Qwen3-TTS** (H200): + + + + + +
Qwen3-TTS E2EL: Baseline vs BatchQwen3-TTS TTFP: Baseline vs BatchQwen3-TTS RTF: Baseline vs Batch
+ +| Metric | Concurrency | Baseline | + Batch | Improvement | +| --- | --- | --- | --- | --- | +| E2EL (ms) | 1 | 1,323 | 1,339 | 1.0× | +| E2EL (ms) | 4 | 5,171 | 1,471 | 3.5× | +| E2EL (ms) | 10 | 12,141 | 1,705 | 7.1× | +| RTF | 1 | 0.230 | 0.234 | 1.0× | +| RTF | 4 | 0.908 | 0.255 | 3.6× | +| RTF | 10 | 2.186 | 0.292 | 7.5× | +| Throughput (audio-s/wall-s) | 10 | 3.99 | 33.53 | 8.4× | + +At concurrency 10, batching alone brings Qwen3-TTS RTF from 2.19 (slower than realtime) down to 0.29 (faster than realtime), and throughput from 4.0 to 33.5 audio-sec/wall-sec. + +--- + +## CUDA Graph on the Critical Decode Path + +### Why CUDA Graph helps here + +In decode-heavy serving, repeatedly launching many small kernels from CPU can become a visible overhead. CUDA Graph reduces this overhead by capturing and replaying stable execution graphs. + +In stage configs, this is represented by `enforce_eager: false` for stages where graph capture is desired (Thinker/Talker), while Code2Wav keeps eager mode depending on stage behavior. + +### CUDA Graph results on top of batching + +**Qwen3-Omni** (A100): + + + + + +
Qwen3-Omni E2EL: Batch vs CUDA GraphQwen3-Omni TTFP: Batch vs CUDA GraphQwen3-Omni RTF: Batch vs CUDA Graph
+ +| Metric | Concurrency | Batch | + CUDA Graph | Improvement | +| --- | --- | --- | --- | --- | +| E2EL (ms) | 1 | 307,719 | 61,613 | 5.0× | +| E2EL (ms) | 4 | 376,934 | 79,019 | 4.8× | +| E2EL (ms) | 10 | 234,844 | 126,867 | 1.9× | +| TTFP (ms) | 1 | 307,262 | 61,257 | 5.0× | +| TTFP (ms) | 4 | 376,466 | 78,634 | 4.8× | +| TTFP (ms) | 10 | 234,557 | 126,534 | 1.9× | +| RTF | 1 | 1.51 | 0.32 | 4.7× | +| RTF | 4 | 1.83 | 0.43 | 4.3× | +| RTF | 10 | 2.28 | 0.90 | 2.5× | + +For the larger Qwen3-Omni model (30B-A3B), CUDA Graph provides a significant improvement. At concurrency 1, E2EL drops from ~308 s to ~62 s; at concurrency 10, from ~235 s to ~127 s. + +**Qwen3-TTS** (H200): + + + + + +
TTS E2EL: Batch vs +CGTTS TTFP: Batch vs +CGTTS RTF: Batch vs +CG
+ +| Metric | Concurrency | Batch | + CUDA Graph | Improvement | +| --- | --- | --- | --- | --- | +| E2EL (ms) | 1 | 1,339 | 733 | 1.8× | +| E2EL (ms) | 4 | 1,471 | 987 | 1.5× | +| E2EL (ms) | 10 | 1,705 | 1,197 | 1.4× | +| RTF | 1 | 0.234 | 0.124 | 1.9× | +| RTF | 10 | 0.292 | 0.203 | 1.4× | +| Throughput (audio-s/wall-s) | 10 | 33.53 | 47.15 | 1.4× | + +At concurrency 1, CUDA Graph reduces E2EL from 1,339 ms to 733 ms and RTF from 0.234 to 0.124 - nearly a 2x improvement. The benefit is consistent across all concurrency levels. + +--- + +## Async Chunk and Streaming Output: Earlier Audio and Cross-Stage Overlap + +### Why this step matters for first-packet latency + +Two mechanisms work together to improve user-visible latency: + +- **Streaming output**: audio streaming emits audio chunks as soon as they are decoded (lower **TTFP**). Without streaming, the client waits for larger buffers or end-of-sequence. +- **Async chunk** is the main enabler for *earlier* audio: instead of handing off whole-request results between stages, each stage forwards **chunks** so the next stage can start as soon as the first chunk is ready. For Omni: Thinker -> Talker forwards hidden-state chunks; for both: Talker -> Code2Wav forwards codec chunks; Code2Wav decodes and emits packets incrementally. This **overlaps compute and communication** across stages and directly reduces time-to-first-audio-packet (TTFP) and end-to-end latency (E2EL). + +So in practice: streaming output defines *how* bytes are sent to the client; async chunk defines *when* the pipeline can produce the first bytes. + +**Dependency between the two:** Async chunk and audio streaming output are mutually dependent. Without async chunk, **audio streaming output cannot truly take effect**. Without audio streaming output, async chunk's **TTFP advantage is not fully realized**: the client would still wait for larger buffers or end-of-sequence instead of hearing the first packet as soon as it is ready. We therefore recommend enabling **both** on top of batching + CUDA Graph; the benchmarks in this post use both. + +### Results: Batch + CUDA Graph vs. Batch + CUDA Graph + Async Chunk + Streaming Output + +**Qwen3-Omni** (A100): + + + + + +
Qwen3-Omni E2EL: CG vs Async ChunkQwen3-Omni TTFP: CG vs Async ChunkQwen3-Omni RTF: CG vs Async Chunk
+ +| Metric | Concurrency | Batch + CG | + Async Chunk | Improvement | +| --- | --- | --- | --- | --- | +| E2EL (ms) | 1 | 61,613 | 41,216 | 1.5× | +| E2EL (ms) | 4 | 79,019 | 67,584 | 1.2× | +| E2EL (ms) | 10 | 126,867 | 104,901 | 1.2× | +| TTFP (ms) | 1 | 61,257 | 1,164 | 53× | +| TTFP (ms) | 4 | 78,634 | 3,152 | 24.9× | +| TTFP (ms) | 10 | 126,534 | 16,482 | 7.7× | +| RTF | 1 | 0.32 | 0.21 | 1.5× | +| RTF | 4 | 0.43 | 0.34 | 1.3× | +| RTF | 10 | 0.90 | 0.74 | 1.2× | + +Enabling both brings TTFP down sharply (concurrency 1: 61,257 ms -> 1,164 ms, **~98% reduction**; concurrency 4: 78,634 ms -> 3,152 ms, **~96% reduction**). E2EL and RTF also improve at every concurrency. + +**Qwen3-TTS** (H200): + + + + + +
Qwen3-TTS E2EL: CG vs Async ChunkQwen3-TTS TTFP: CG vs Async ChunkQwen3-TTS RTF: CG vs Async Chunk
+ +| Metric | Concurrency | Batch + CG | + Async Chunk | Improvement | +| --- | --- | --- | --- | --- | +| TTFP (ms) | 1 | 733 | **64** | **11.5×** | +| TTFP (ms) | 4 | 987 | **119** | **8.3×** | +| TTFP (ms) | 10 | 1,197 | **425** | **2.8×** | +| E2EL (ms) | 1 | 733 | 941 | 0.8× | +| E2EL (ms) | 10 | 1,197 | 1,767 | 0.7× | +| RTF | 1 | 0.124 | 0.160 | 0.8× | +| RTF | 10 | 0.203 | 0.314 | 0.6× | + +The TTFP improvement is the headline result for both models. For Qwen3-TTS at concurrency 1, users hear the first audio in **64 ms** instead of 733 ms - an **11.5x reduction**. For Qwen3-Omni at concurrency 1, TTFP drops from 61 s to 1.2 s - a **53x reduction**. + +### Why E2EL and RTF are higher with async chunk (TTS) + +The table above shows that enabling async chunk + streaming *increases* E2EL and RTF for TTS compared to CUDA Graph alone. This is expected - the two configurations optimize for fundamentally different metrics: + +- **CUDA Graph (no async chunk)** generates the entire audio end-to-end before returning. No chunking overhead, so total compute is minimized. +- **Async Chunk + Streaming** splits the pipeline into incremental chunks, adding overhead from chunked transport, context overlap in Code2Wav (`codec_left_context_frames=25`), and smaller effective batch sizes per chunk. + +**The tradeoff is intentional.** Async chunk trades ~30% higher total compute for **11x faster time-to-first-audio**. For interactive applications (voice assistants, chatbots), TTFP determines perceived responsiveness. For offline batch processing, CUDA Graph without async chunk is the better choice. + +--- + +## TTS-Specific: Code Predictor Re-prefill + `torch.compile` + +Qwen3-TTS has a **code predictor** - a small 5-layer transformer that generates residual codebook tokens (groups 1 through Q-1) autoregressively. Each AR step operates on very short sequences (2 to ~16 tokens). + +The naive approach uses a KV cache for this small transformer, similar to the main Talker. But the KV cache machinery (block tables, slot mappings, paged attention) introduces significant overhead relative to the tiny model. Two optimizations replace that: + +### Re-prefill (stateless forward, no KV cache) + +Instead of maintaining a KV cache across steps, the code predictor **re-feeds the full growing sequence** at each AR step using `F.scaled_dot_product_attention`. With sequences of at most ~16 tokens through 5 layers, the O(T^2) attention cost is negligible - and removing the KV cache machinery (block table management, `set_forward_context`, slot mapping) saves far more time than it costs. + +### `torch.compile` on the code predictor forward + +The 5-layer transformer forward pass launches ~60 small CUDA kernels per step. `torch.compile(mode="default", dynamic=True)` fuses these into fewer kernels via Inductor: + +```python +self._compiled_model_fwd = torch.compile( + self.model.forward, + mode="default", # no Inductor CUDA graphs, avoids conflict with vLLM's CUDAGraphWrapper + dynamic=True, # sequence length grows each step (2, 3, ..., num_groups+1) +) +``` + +`mode="default"` is used instead of `mode="reduce-overhead"` to avoid conflicts with vLLM's own CUDA graph capture on the main Talker model. `dynamic=True` handles the growing sequence length without recompilation. + +These optimizations are always-on in the current codebase - all Qwen3-TTS benchmark results in this post include them. + +--- + +## TTS-Specific: Dynamic Initial Chunk for Faster First Audio + +In the async chunk pipeline, the standard `codec_chunk_frames` is 25 (each chunk = ~2 seconds of audio at 12 Hz). Waiting for 25 frames before forwarding the first chunk to Code2Wav adds unnecessary TTFP. The **initial codec chunk** optimization sends a smaller first chunk so Code2Wav can start decoding earlier. + +**Dynamic initial chunk sizing (default behavior):** + +Rather than using a fixed initial chunk size, vLLM-Omni dynamically selects it based on current server load. The initial chunk size is chosen from power-of-2 steps [2, 4, 8, 16] based on load factor (`active_requests / max_batch_size`): + +| Server load | Initial chunk frames | Rationale | +| --- | --- | --- | +| Low (e.g. 1/10 active) | **2** (~167 ms of audio) | Minimize TTFP when there's headroom | +| Medium (e.g. 5/10 active) | **4-8** | Balance TTFP vs decode efficiency | +| High (e.g. 10/10 active) | **16** | Larger first chunk to amortize decode cost | + +After the initial chunk, all subsequent chunks use the standard `codec_chunk_frames` (25) size. + +**How it works in the pipeline:** + +1. Talker generates codec tokens auto-regressively +2. The stage input processor checks current load and picks an initial chunk size (e.g. **2 frames** at low load) +3. After that many frames, the first chunk is forwarded to Code2Wav +4. Code2Wav decodes this small chunk and emits the first audio packet +5. Subsequent chunks use the standard 25-frame size for efficient batch decoding + +**Per-request override:** Clients can also set a fixed initial chunk size via the API: + +```json +{"initial_codec_chunk_frames": 2} +``` + +This overrides the dynamic calculation for that request. + +**Config (server-side):** + +```yaml +runtime: + connectors: + connector_of_shared_memory: + name: SharedMemoryConnector + extra: + codec_streaming: true + codec_chunk_frames: 25 # standard chunk size (~2s of audio) + codec_left_context_frames: 25 + # initial chunk is computed dynamically by default + # set initial_codec_chunk_frames: 2 to force a fixed value +``` + +The 64 ms TTFP result reported above for Qwen3-TTS at concurrency 1 uses the dynamic initial chunk, which picks `initial_codec_chunk_frames=2` at low load. At higher concurrency the dynamic sizing increases the initial chunk to maintain decode efficiency. + +--- + +## Live Demo: Streaming TTS over WebSocket + +vLLM-Omni supports real-time streaming audio output for Qwen3-TTS over WebSocket ([PR #1719](https://github.com/vllm-project/vllm-omni/pull/1719)). With `stream_audio: true`, the server sends chunked PCM audio frames as they are generated, so clients can start playback before full sentence synthesis completes. + +The WebSocket protocol uses `audio.start` / binary PCM chunks / `audio.done` framing per sentence: + +```json +// Client sends: +{"type":"session.config","voice":"Vivian","response_format":"pcm","stream_audio":true} +{"type":"input.text","text":"Hello world. This is a streaming demo."} +{"type":"input.done"} + +// Server streams back per sentence: +{"type":"audio.start","sentence_index":0,"sentence_text":"Hello world.","format":"pcm","sample_rate":24000} + + +... +{"type":"audio.done","sentence_index":0,"total_bytes":96000,"error":false} +{"type":"audio.start","sentence_index":1,"sentence_text":"This is a streaming demo.","format":"pcm","sample_rate":24000} + +... +{"type":"audio.done","sentence_index":1,"total_bytes":72000,"error":false} +{"type":"session.done","total_sentences":2} +``` + + + +--- + +## Deployment Playbook + +### Qwen3-Omni + +#### 1) Serve with the default 3-stage config + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \ + --omni \ + --port 8091 +``` + +Notes: + +- `runtime.max_batch_size` controls stage-level batching. +- Thinker/Talker commonly use `enforce_eager: false` for CUDA Graph paths. +- Code2Wav often remains eager (`enforce_eager: true`) depending on runtime behavior. + +#### 2) Enable async chunk + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \ + --omni \ + --port 8091 \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +``` + +#### 3) Key config knobs + +```yaml +async_chunk: true +stage_args: + - stage_id: 0 # thinker + runtime: + max_batch_size: 64 + engine_args: + enforce_eager: false + max_num_batched_tokens: 32768 + custom_process_next_stage_input_func: >- + vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk + + - stage_id: 1 # talker + runtime: + max_batch_size: 64 + engine_args: + enforce_eager: false + max_num_batched_tokens: 32768 + custom_process_next_stage_input_func: >- + vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk + + - stage_id: 2 # code2wav + runtime: + max_batch_size: 64 + engine_args: + enforce_eager: true + max_num_batched_tokens: 51200 +``` + +#### Reproduce Qwen3-Omni benchmarks + +```bash +vllm bench serve \ + --dataset-name random \ + --port ${PORT} \ + --model ${MODEL_PATH} \ + --endpoint /v1/chat/completions \ + --backend openai-chat-omni \ + --max-concurrency ${MAX_CONCURRENCY} \ + --num-prompts ${NUM_PROMPTS} \ + --random-input-len 2500 \ + --ignore-eos \ + --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \ + --random-output-len 900 \ + --extra_body '{"modalities": ["text","audio"]}' +``` + +### Qwen3-TTS + +#### 1) Serve with async chunk (recommended) + +```bash +vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --omni \ + --port 8000 +``` + +The default config (`qwen3_tts.yaml`) enables the full optimization stack: + +- Batching with `max_batch_size: 10` on the Talker stage +- CUDA Graph on the Talker (`enforce_eager: false`) +- Async chunk with streaming transport + +#### 2) Serve without async chunk (for comparison) + +```bash +vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --omni \ + --port 8000 \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_no_async_chunk.yaml +``` + +#### 3) Key config knobs + +```yaml +async_chunk: true +stage_args: + - stage_id: 0 # Talker (AR decoder) + runtime: + max_batch_size: 10 + engine_args: + enforce_eager: false + max_num_batched_tokens: 512 + custom_process_next_stage_input_func: >- + vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk + + - stage_id: 1 # Code2Wav (vocoder) + runtime: + max_batch_size: 1 + engine_args: + enforce_eager: true + max_num_batched_tokens: 8192 + +runtime: + connectors: + connector_of_shared_memory: + name: SharedMemoryConnector + extra: + codec_streaming: true + codec_chunk_frames: 25 + codec_left_context_frames: 25 +``` + +#### Reproduce Qwen3-TTS benchmarks + +```bash +GPU_DEVICE=0 \ +MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ +NUM_PROMPTS=50 \ +CONCURRENCY="1 4 10" \ +bash benchmarks/qwen3-tts/vllm_omni/run_stacked_benchmark.sh +``` + +This cycles through four configs (Baseline -> + Batch -> + CUDA Graph -> + Async Chunk + Streaming), benchmarks each at the specified concurrency levels, and generates all comparison figures automatically. diff --git a/docs/serving/speech_api.md b/docs/serving/speech_api.md index ecbe8d9ac9..733811081a 100644 --- a/docs/serving/speech_api.md +++ b/docs/serving/speech_api.md @@ -15,7 +15,7 @@ Each server instance runs a single model (specified at startup via `vllm serve < ```bash # Qwen3-TTS: CustomVoice model (predefined speakers) vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ @@ -300,7 +300,7 @@ curl -X POST http://localhost:8091/v1/audio/speech \ ```bash # Start server with VoiceDesign model first vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ @@ -322,7 +322,7 @@ curl -X POST http://localhost:8091/v1/audio/speech \ ```bash # Start server with Base model first vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ @@ -517,15 +517,16 @@ for result in response.json()["results"]: All items are fanned out to `generate()` concurrently. The engine's stage worker automatically batches them up to the configured `max_batch_size` and queues the rest — no client-side throttling needed. -For best throughput, use a batch-optimized stage config with `max_batch_size > 1`: +For best throughput, set both stages' `max_num_seqs` to ≥4 via `--stage-overrides`: ```bash vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml \ - --omni --port 8091 --trust-remote-code --enforce-eager + --omni --port 8091 --trust-remote-code --enforce-eager \ + --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2}, + "1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}' ``` -The default `qwen3_tts.yaml` uses `max_batch_size: 1` (single request). The `qwen3_tts_batch.yaml` config sets `max_batch_size: 4` for ~4x throughput. +The bundled `qwen3_tts.yaml` uses `max_num_seqs: 1` (single request) on both stages. Bumping to 4 yields roughly 4× throughput on the talker and lets stage 1 batch chunks across in-flight requests. ## Supported Models @@ -617,7 +618,7 @@ Enable debug logging: ```bash vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ diff --git a/docs/user_guide/diffusion_features.md b/docs/user_guide/diffusion_features.md index 4e7003cce3..7bdeede446 100644 --- a/docs/user_guide/diffusion_features.md +++ b/docs/user_guide/diffusion_features.md @@ -115,8 +115,8 @@ The following tables show which models support each feature: | **FLUX.2-dev** | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | **GLM-Image** | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | **HunyuanImage3** | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | -| **LongCat-Image** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | -| **LongCat-Image-Edit** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | +| **LongCat-Image** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | +| **LongCat-Image-Edit** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | | **MagiHuman** | ❌ | ❌ | ❌ | ❓ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | | **MammothModa2(T2I)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | **Nextstep_1(T2I)** | ❓ | ❓ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | @@ -140,10 +140,10 @@ The following tables show which models support each feature: |-------|:----------:|:-----------:|:---------------------:|:--------------:|:-----------------:|:------:|:------------------------:|:--------------------:|:--------------:|:----------------:| | **Wan2.2** | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (encode/decode) | ❌ | ❌ | | **Wan2.1-VACE** | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (decode) | ❌ | ❌ | -| **LTX-2** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **LTX-2** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | | **Helios** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | | **HunyuanVideo-1.5 T2V I2V** | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ (decode) | ✅ | ❌ | -| **DreamID-Omni** | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **DreamID-Omni** | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | **Frame Interpolation Support** diff --git a/docs/user_guide/examples/offline_inference/bagel.md b/docs/user_guide/examples/offline_inference/bagel.md index e626686872..1fb4d40457 100644 --- a/docs/user_guide/examples/offline_inference/bagel.md +++ b/docs/user_guide/examples/offline_inference/bagel.md @@ -176,8 +176,6 @@ Example configuration for TP=2 on GPUs 0 and 1: | Parameter | Value | Description | | :-------------------- | :------ | :------------------------------- | -| `window_size` | `-1` | Window size (-1 means unlimited) | -| `max_inflight` | `1` | Maximum inflight requests | | `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) | ## Using Mooncake Connector diff --git a/docs/user_guide/examples/offline_inference/qwen3_tts.md b/docs/user_guide/examples/offline_inference/qwen3_tts.md index 4ece5219d7..7226ac1fe4 100644 --- a/docs/user_guide/examples/offline_inference/qwen3_tts.md +++ b/docs/user_guide/examples/offline_inference/qwen3_tts.md @@ -144,13 +144,13 @@ completes. This demonstrates that audio data is available progressively rather t ## Batched Decoding -The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, provide a stage config with `max_num_seqs > 1` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`. +The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, set `max_num_seqs > 1` on both stages via `--stage-overrides` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`. ``` python end2end.py --query-type CustomVoice \ --txt-prompts benchmark_prompts.txt \ --batch-size 4 \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml + --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},"1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}' ``` **Important:** `--batch-size` must match a CUDA graph capture size (1, 2, 4, 8, 16...) because the Talker's code predictor KV cache is sized to `max_num_seqs`, and CUDA graphs pad the batch to the next capture size. Both stages need `max_num_seqs >= batch_size` in the stage config for batching to take effect. If only stage 1 has a higher `max_num_seqs`, it won't help — stage 1 can only batch chunks from requests that are in-flight simultaneously, which requires stage 0 to also process multiple requests concurrently. diff --git a/docs/user_guide/examples/online_serving/qwen3_omni.md b/docs/user_guide/examples/online_serving/qwen3_omni.md index 6f6d9ae4a9..611eb6fd3f 100644 --- a/docs/user_guide/examples/online_serving/qwen3_omni.md +++ b/docs/user_guide/examples/online_serving/qwen3_omni.md @@ -18,12 +18,12 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 If you want to open async chunking for qwen3-omni, launch the server with command below ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /vllm_omni/deploy/qwen3_omni_moe.yaml ``` If you have custom stage configs file, launch the server with command below ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file ``` ### Send Multi-modal Request @@ -187,7 +187,7 @@ The script supports the following arguments: - `--model`: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct) - `--server-port`: Port for vLLM server (default: 8091) - `--gradio-port`: Port for Gradio demo (default: 7861) -- `--stage-configs-path`: Path to custom stage configs YAML file (optional) +- `--deploy-config`: Path to custom deploy config YAML file (optional) - `--server-host`: Host for vLLM server (default: 0.0.0.0) - `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1) - `--share`: Share Gradio demo publicly (creates a public link) @@ -202,7 +202,7 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 If you have custom stage configs file: ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file ``` **Step 2: Run the Gradio demo** diff --git a/docs/user_guide/examples/online_serving/qwen3_tts.md b/docs/user_guide/examples/online_serving/qwen3_tts.md index 4e632d4c28..95f234f02d 100644 --- a/docs/user_guide/examples/online_serving/qwen3_tts.md +++ b/docs/user_guide/examples/online_serving/qwen3_tts.md @@ -58,7 +58,7 @@ Then open http://localhost:7860 in your browser. ```bash # CustomVoice model (predefined speakers) vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ @@ -66,7 +66,7 @@ vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ # VoiceDesign model vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ @@ -74,7 +74,7 @@ vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \ # Base model (voice cloning) vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --omni \ --port 8091 \ --trust-remote-code \ diff --git a/examples/offline_inference/bagel/README.md b/examples/offline_inference/bagel/README.md index 48517b1cda..3e653d0e3a 100644 --- a/examples/offline_inference/bagel/README.md +++ b/examples/offline_inference/bagel/README.md @@ -173,8 +173,6 @@ Example configuration for TP=2 on GPUs 0 and 1: | Parameter | Value | Description | | :-------------------- | :------ | :------------------------------- | -| `window_size` | `-1` | Window size (-1 means unlimited) | -| `max_inflight` | `1` | Maximum inflight requests | | `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) | ## Using Mooncake Connector diff --git a/examples/offline_inference/qwen2_5_omni/end2end.py b/examples/offline_inference/qwen2_5_omni/end2end.py index d8f1898ec9..dfe124700d 100644 --- a/examples/offline_inference/qwen2_5_omni/end2end.py +++ b/examples/offline_inference/qwen2_5_omni/end2end.py @@ -320,14 +320,7 @@ def main(args): query_result = query_func(audio_path=audio_path, sampling_rate=sampling_rate) else: query_result = query_func() - omni = Omni( - model=model_name, - log_stats=args.log_stats, - stage_init_timeout=args.stage_init_timeout, - batch_timeout=args.batch_timeout, - init_timeout=args.init_timeout, - shm_threshold_bytes=args.shm_threshold_bytes, - ) + omni = Omni.from_cli_args(args, model=model_name) thinker_sampling_params = SamplingParams( temperature=0.0, # Deterministic - no randomness top_p=1.0, # Disable nucleus sampling diff --git a/examples/offline_inference/qwen3_omni/README.md b/examples/offline_inference/qwen3_omni/README.md index d69ad6abfc..0710faa133 100644 --- a/examples/offline_inference/qwen3_omni/README.md +++ b/examples/offline_inference/qwen3_omni/README.md @@ -70,8 +70,8 @@ For true stage-level concurrency -- where downstream stages (Talker, Code2Wav) start **before** the upstream stage (Thinker) finishes -- use the async_chunk example. This requires: -1. A stage config YAML with ``async_chunk: true`` (e.g. - ``qwen3_omni_moe_async_chunk.yaml``). +1. A deploy config YAML with ``async_chunk: true`` (e.g. + ``qwen3_omni_moe.yaml``). 2. Hardware that matches the config (e.g. 2x H100 for the default 3-stage config). @@ -101,7 +101,7 @@ python end2end_async_chunk.py --query-type text --modalities text ```bash python end2end_async_chunk.py \ --query-type use_audio \ - --stage-configs-path /path/to/your_async_chunk.yaml + --deploy-config /path/to/your_deploy_config.yaml ``` > **Note**: The synchronous ``end2end.py`` (using ``Omni``) is still the diff --git a/examples/offline_inference/qwen3_omni/end2end.py b/examples/offline_inference/qwen3_omni/end2end.py index 02ebe9dbec..65d2779aea 100644 --- a/examples/offline_inference/qwen3_omni/end2end.py +++ b/examples/offline_inference/qwen3_omni/end2end.py @@ -294,14 +294,7 @@ def main(args): else: query_result = query_func() - omni = Omni( - model=model_name, - dtype=args.dtype, - stage_configs_path=args.stage_configs_path, - log_stats=args.log_stats, - stage_init_timeout=args.stage_init_timeout, - init_timeout=args.init_timeout, - ) + omni = Omni.from_cli_args(args, model=model_name) thinker_sampling_params = SamplingParams( temperature=0.9, diff --git a/examples/offline_inference/qwen3_omni/end2end_async_chunk.py b/examples/offline_inference/qwen3_omni/end2end_async_chunk.py index c644ab2f4d..ecb6154160 100644 --- a/examples/offline_inference/qwen3_omni/end2end_async_chunk.py +++ b/examples/offline_inference/qwen3_omni/end2end_async_chunk.py @@ -14,7 +14,7 @@ Usage ----- python end2end_async_chunk.py --query-type use_audio \ - --stage-configs-path + --deploy-config See ``--help`` for all options. """ @@ -179,20 +179,26 @@ def clone_prompt_for_request(template: dict) -> dict: return cloned -def _default_async_chunk_stage_configs_path() -> str | None: - """Best-effort default stage config for running Qwen3-Omni with async_chunk. +def _default_deploy_config_path() -> str | None: + """Best-effort default deploy config for running Qwen3-Omni with async_chunk. - When this example is executed from within the repository, we resolve the - default YAML path relative to this file. When installed elsewhere, the - file may not exist and callers should pass --stage-configs-path explicitly. + The default ``vllm_omni/deploy/qwen3_omni_moe.yaml`` ships with + ``async_chunk: true`` at the top level, so loading it is enough to + enable async-chunk semantics. To disable it, copy the YAML and set + ``async_chunk: false`` (or pass ``--deploy-config`` to a YAML that + overrides the flag). + + When this example is executed from within the repository, we resolve + the default YAML path relative to this file. When installed elsewhere, + the file may not exist and callers should pass ``--deploy-config`` + explicitly. """ repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../..")) candidate = os.path.join( repo_root, "vllm_omni", - "model_executor", - "stage_configs", - "qwen3_omni_moe_async_chunk.yaml", + "deploy", + "qwen3_omni_moe.yaml", ) return candidate if os.path.exists(candidate) else None @@ -380,15 +386,16 @@ async def run_all(args): prompt["modalities"] = output_modalities # Create AsyncOmni - print(f"[Info] Creating AsyncOmni with stage_configs_path={args.stage_configs_path}") + print(f"[Info] Creating AsyncOmni with deploy_config={args.deploy_config}") async_omni = None try: - async_omni = AsyncOmni( - model=args.model, - stage_configs_path=args.stage_configs_path, - log_stats=args.log_stats, - stage_init_timeout=args.stage_init_timeout, - ) + # ``from_cli_args`` expands vars(args) into kwargs and auto-captures + # ``_cli_explicit_keys`` from ``sys.argv[1:]`` so argparse defaults + # do not silently override deploy YAML values. Mirrors the + # ``EngineArgs.from_cli_args`` pattern used throughout vllm / + # vllm-omni. ``deploy_config=None`` (the default) falls through to + # the bundled ``vllm_omni/deploy/qwen3_omni_moe.yaml``. + async_omni = AsyncOmni.from_cli_args(args) # Use default sampling params from stage config (they are pre-configured # in the YAML for each stage). @@ -476,11 +483,11 @@ def parse_args(): help="Query type.", ) parser.add_argument( - "--stage-configs-path", + "--deploy-config", type=str, - default=_default_async_chunk_stage_configs_path(), + default=_default_deploy_config_path(), help=( - "Path to an async_chunk stage config YAML. " + "Path to a deploy config YAML. " "If not set, uses the model's default config " "(make sure it has async_chunk: true)." ), diff --git a/examples/offline_inference/qwen3_omni/run_multiple_prompts_async_chunk.sh b/examples/offline_inference/qwen3_omni/run_multiple_prompts_async_chunk.sh index 809054867c..2f2be20915 100755 --- a/examples/offline_inference/qwen3_omni/run_multiple_prompts_async_chunk.sh +++ b/examples/offline_inference/qwen3_omni/run_multiple_prompts_async_chunk.sh @@ -17,7 +17,7 @@ REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)" python "${SCRIPT_DIR}/end2end_async_chunk.py" \ --query-type text \ --txt-prompts "${SCRIPT_DIR}/text_prompts_10.txt" \ - --stage-configs-path "${REPO_ROOT}/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml" \ + --deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \ --output-dir output_audio_async_chunk \ --max-in-flight 2 \ "$@" diff --git a/examples/offline_inference/qwen3_omni/run_single_prompt_async_chunk.sh b/examples/offline_inference/qwen3_omni/run_single_prompt_async_chunk.sh index 918c7ee4fd..9ef69293cb 100755 --- a/examples/offline_inference/qwen3_omni/run_single_prompt_async_chunk.sh +++ b/examples/offline_inference/qwen3_omni/run_single_prompt_async_chunk.sh @@ -6,13 +6,13 @@ # achieving true stage-level concurrency via chunk-level streaming. # # Prerequisites: -# - An async_chunk stage config YAML (e.g. qwen3_omni_moe_async_chunk.yaml) +# - A deploy config YAML (e.g. qwen3_omni_moe.yaml) # - Hardware matching the config (e.g. 2x H100 for the default 3-stage config) # # Usage: # bash run_single_prompt_async_chunk.sh # bash run_single_prompt_async_chunk.sh --query-type text --modalities text -# bash run_single_prompt_async_chunk.sh --stage-configs-path /path/to/custom.yaml +# bash run_single_prompt_async_chunk.sh --deploy-config /path/to/custom.yaml set -euo pipefail @@ -21,6 +21,6 @@ REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)" python "${SCRIPT_DIR}/end2end_async_chunk.py" \ --query-type use_audio \ - --stage-configs-path "${REPO_ROOT}/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml" \ + --deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \ --output-dir output_audio_async_chunk \ "$@" diff --git a/examples/offline_inference/qwen3_tts/README.md b/examples/offline_inference/qwen3_tts/README.md index 9c63f7c409..98432eef3f 100644 --- a/examples/offline_inference/qwen3_tts/README.md +++ b/examples/offline_inference/qwen3_tts/README.md @@ -104,13 +104,13 @@ completes. This demonstrates that audio data is available progressively rather t ## Batched Decoding -The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, provide a stage config with `max_num_seqs > 1` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`. +The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, set `max_num_seqs > 1` on both stages via `--stage-overrides` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`. ``` python end2end.py --query-type CustomVoice \ --txt-prompts benchmark_prompts.txt \ --batch-size 4 \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml + --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},"1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}' ``` **Important:** `--batch-size` must match a CUDA graph capture size (1, 2, 4, 8, 16...) because the Talker's code predictor KV cache is sized to `max_num_seqs`, and CUDA graphs pad the batch to the next capture size. Both stages need `max_num_seqs >= batch_size` in the stage config for batching to take effect. If only stage 1 has a higher `max_num_seqs`, it won't help — stage 1 can only batch chunks from requests that are in-flight simultaneously, which requires stage 0 to also process multiple requests concurrently. diff --git a/examples/offline_inference/qwen3_tts/end2end.py b/examples/offline_inference/qwen3_tts/end2end.py index c508aab789..f5797b2c30 100644 --- a/examples/offline_inference/qwen3_tts/end2end.py +++ b/examples/offline_inference/qwen3_tts/end2end.py @@ -375,12 +375,7 @@ def main(args): output_dir = args.output_dir os.makedirs(output_dir, exist_ok=True) - omni = Omni( - model=model_name, - stage_configs_path=args.stage_configs_path, - log_stats=args.log_stats, - stage_init_timeout=args.stage_init_timeout, - ) + omni = Omni.from_cli_args(args, model=model_name) batch_size = args.batch_size for batch_start in range(0, len(inputs), batch_size): @@ -396,12 +391,7 @@ async def main_streaming(args): output_dir = args.output_dir os.makedirs(output_dir, exist_ok=True) - omni = AsyncOmni( - model=model_name, - stage_configs_path=args.stage_configs_path, - log_stats=args.log_stats, - stage_init_timeout=args.stage_init_timeout, - ) + omni = AsyncOmni.from_cli_args(args, model=model_name) for i, prompt in enumerate(inputs): request_id = str(i) diff --git a/examples/offline_inference/voxcpm2/end2end.py b/examples/offline_inference/voxcpm2/end2end.py index 687e596018..6b6bf78ddf 100644 --- a/examples/offline_inference/voxcpm2/end2end.py +++ b/examples/offline_inference/voxcpm2/end2end.py @@ -65,6 +65,12 @@ def parse_args(): default=None, help="Text matching --prompt-audio for continuation mode.", ) + parser.add_argument( + "--ref-text", + type=str, + default=None, + help="Optional transcript of --reference-audio (enables ref_continuation mode).", + ) return parser.parse_args() @@ -103,24 +109,40 @@ def main(): stage_configs_path=args.stage_configs_path, ) - additional: dict = {} - if args.reference_audio: - additional["reference_audio"] = args.reference_audio - if args.prompt_audio and args.prompt_text: - additional["prompt_audio"] = args.prompt_audio - additional["prompt_text"] = args.prompt_text + from transformers import AutoTokenizer - prompt: dict = {"prompt": args.text} - if additional: - prompt["additional_information"] = additional + from vllm_omni.model_executor.models.voxcpm2.voxcpm2_talker import ( + build_cjk_split_map, + build_voxcpm2_prompt, + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True) + split_map = build_cjk_split_map(tokenizer) + hf_config = engine.engine.stage_vllm_configs[0].model_config.hf_config + + ref_audio_arg = args.reference_audio or args.prompt_audio + ref_text_arg = args.ref_text or args.prompt_text + ref_wav, ref_sr = (None, None) + if ref_audio_arg: + ref_wav_arr, ref_sr = sf.read(ref_audio_arg) + ref_wav = ref_wav_arr.mean(axis=-1).tolist() if ref_wav_arr.ndim > 1 else ref_wav_arr.tolist() + + prompt = build_voxcpm2_prompt( + hf_config=hf_config, + tokenizer=tokenizer, + split_map=split_map, + text=args.text, + ref_audio=ref_wav, + ref_sr=ref_sr, + ref_text=ref_text_arg, + ) print(f"Model : {args.model}") print(f"Text : {args.text}") - if args.reference_audio: - print(f"Ref audio : {args.reference_audio}") - if args.prompt_audio: - print(f"Prompt audio: {args.prompt_audio}") - print(f"Prompt text : {args.prompt_text}") + if ref_audio_arg: + print(f"Ref audio : {ref_audio_arg}") + if ref_text_arg: + print(f"Ref text : {ref_text_arg}") print(f"Output dir : {output_dir}") t_start = time.perf_counter() diff --git a/examples/offline_inference/x_to_video_audio/x_to_video_audio.py b/examples/offline_inference/x_to_video_audio/x_to_video_audio.py index 322b184e52..497284ceb9 100644 --- a/examples/offline_inference/x_to_video_audio/x_to_video_audio.py +++ b/examples/offline_inference/x_to_video_audio/x_to_video_audio.py @@ -58,6 +58,11 @@ def parse_args() -> argparse.Namespace: default=False, help="Enable CPU offloading for diffusion models.", ) + parser.add_argument( + "--enable-layerwise-offload", + action="store_true", + help="Enable layerwise (blockwise) offloading on DiT modules.", + ) return parser.parse_args() @@ -126,6 +131,7 @@ def main() -> None: parallel_config=parallel_config, model_type=args.model_type, enable_cpu_offload=args.enable_cpu_offload, + enable_layerwise_offload=args.enable_layerwise_offload, ) start = time.perf_counter() outputs = omni.generate(prompt, sampling_params) diff --git a/examples/online_serving/qwen3_omni/README.md b/examples/online_serving/qwen3_omni/README.md index ff02642247..32722b3db4 100644 --- a/examples/online_serving/qwen3_omni/README.md +++ b/examples/online_serving/qwen3_omni/README.md @@ -12,17 +12,159 @@ Please refer to [README.md](../../../README.md) vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 ``` -If you want to open async chunking for qwen3-omni, launch the server with command below +The default deploy config at `vllm_omni/deploy/qwen3_omni_moe.yaml` is loaded +automatically by the model registry — no `--deploy-config` flag needed for the +common case. Async-chunk streaming is **enabled by default** in the bundled config. +NPU / ROCm / XPU per-platform deltas are merged in automatically from the +`platforms:` section of the same YAML. + +**Note:** The OpenAI-style **`/v1/realtime`** WebSocket (streaming PCM audio in, audio + transcription out) is **not supported** when `async_chunk` is enabled. Use the default omni layout or a stage config with `async_chunk: false` for realtime sessions. + +If you have a custom deploy YAML, point at it explicitly: ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --deploy-config /path/to/your_deploy_config.yaml ``` -If you have custom stage configs file, launch the server with command below +### Tuning deployment parameters + +Most engine knobs (`max_num_batched_tokens`, `max_model_len`, `enforce_eager`, +`gpu_memory_utilization`, `tensor_parallel_size`, …) can be tuned without +editing the YAML. There are three layers, in increasing specificity: + +#### 1. Global CLI flags (apply to every stage) + +```bash +# Tighter memory budget on a smaller GPU +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --gpu-memory-utilization 0.85 + +# Disable cudagraphs (e.g. for debugging) +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --enforce-eager + +# Reduce context length +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --max-model-len 32768 + +# Toggle prefix caching on every stage (yaml default: off) +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --enable-prefix-caching +# ...or force it off if the yaml turned it on +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --no-enable-prefix-caching + +# Toggle pipeline-wide async chunked streaming between stages +# (yaml default for qwen3_omni_moe: on) +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --no-async-chunk +``` + +For the TTS counterpart (synchronous codec variant), see +[qwen3_tts README](../qwen3_tts/README.md#sync-vs-async-chunk-mode). + +Explicit CLI flags **override** the deploy YAML (which itself overrides the +parser defaults). If you don't pass a flag, the YAML value wins. + +> **Note on `--no-async-chunk`**: Flips the deploy yaml's `async_chunk:` +> bool. Pipelines that implement alternate processor functions for +> chunked vs end-to-end modes (e.g. qwen3_tts code2wav) dispatch +> automatically based on that bool — no extra flag or variant yaml is +> needed. + +> ⚠️ **For multi-stage models that share GPUs (qwen3_omni_moe by default +> shares cuda:1 between stages 1 and 2), avoid using global memory flags.** +> A global `--gpu-memory-utilization 0.85` would apply to every stage and +> oversubscribe the shared device. Use per-stage overrides instead — see +> below. + +#### 2. Per-stage overrides via `--stage-overrides` (recommended for memory) + ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file +# Lower stage 1's memory budget; leave others at the YAML default +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --stage-overrides '{ + "1": {"gpu_memory_utilization": 0.5}, + "2": {"max_num_batched_tokens": 65536} + }' ``` +Per-stage values are always treated as explicit and beat YAML defaults for +the named stage. Other stages keep their YAML values. + +#### 3. Custom deploy YAML + +When per-stage overrides get long, write a small overlay YAML that inherits +from the bundled default: + +```yaml +# my_qwen3_omni_overrides.yaml +base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml + +stages: + - stage_id: 0 + max_num_batched_tokens: 65536 + enforce_eager: true + - stage_id: 1 + gpu_memory_utilization: 0.5 + - stage_id: 2 + max_model_len: 8192 +``` + +Then start the server with `--deploy-config my_qwen3_omni_overrides.yaml`. +The `base_config:` line tells the loader to inherit everything else (stages, +connectors, edges, platforms section) from the bundled production YAML, so +you only need to spell out the deltas. + +#### 4. Multi-node deployment (cross-host transfer connector) + +The bundled `qwen3_omni_moe.yaml` uses `SharedMemoryConnector` between stages, +which only works when all stages run on the same physical host. For +**cross-node** deployments, write a small overlay YAML that swaps in a +network-capable connector (e.g. `MooncakeStoreConnector`) and re-points each +stage's connector wiring at it. The connector spec carries your own server +addresses — there is no checked-in default because every cluster is +different. + +```yaml +# my_qwen3_omni_multinode.yaml +base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml + +connectors: + mooncake_connector: + name: MooncakeStoreConnector + extra: + host: "127.0.0.1" + metadata_server: "http://YOUR_METADATA_HOST:8080/metadata" + master: "YOUR_MASTER_HOST:50051" + segment: 512000000 # 512 MB transfer segment + localbuf: 64000000 # 64 MB local buffer + proto: "tcp" + +stages: + - stage_id: 0 + output_connectors: + to_stage_1: mooncake_connector + - stage_id: 1 + input_connectors: + from_stage_0: mooncake_connector + output_connectors: + to_stage_2: mooncake_connector + - stage_id: 2 + input_connectors: + from_stage_1: mooncake_connector +``` + +Then launch with `--deploy-config my_qwen3_omni_multinode.yaml`. Same +pattern works for Qwen2.5-Omni — replace `base_config:` with the path to +`vllm_omni/deploy/qwen2_5_omni.yaml`. + +> ⚠️ Replace `YOUR_METADATA_HOST` / `YOUR_MASTER_HOST` with the actual +> mooncake server addresses for your cluster. The `base_config:` overlay +> inherits all stage budgets, devices, and edges from the bundled prod +> YAML — you only need to spell out the connector swap. + ### Send Multi-modal Request Get into the example folder @@ -38,36 +180,43 @@ python examples/online_serving/openai_chat_completion_client_for_multimodal_gene #### Realtime WebSocket client (`openai_realtime_client.py`) -[`openai_realtime_client.py`](./openai_realtime_client.py) connects to **`ws://:/v1/realtime`**, uploads a local audio file as **PCM16 mono @ 16 kHz** chunks (OpenAI-style `input_audio_buffer.append` / `commit`), and prints **streaming transcription** (`transcription.delta` / `transcription.done`). +[`openai_realtime_client.py`](./openai_realtime_client.py) connects to **`ws://:/v1/realtime`**, streams a local WAV as **PCM16 mono @ 16 kHz** in fixed-size chunks (OpenAI-style `input_audio_buffer.append` / `commit`), and receives **`response.audio.delta`** (incremental PCM for the reply) plus **`transcription.*`** events. By default it concatenates audio deltas and writes **`--output-wav`** (model output is typically **24 kHz**). Optional **`--delta-dump-dir`** saves each delta as `delta_000001.wav`, … for debugging. + +Streaming input works well for translation-style use cases; if the Thinker runs while input is still incomplete, consider limiting **`max_tokens`** in your session / server defaults to avoid over-generation. **Dependencies:** ```bash -pip install websockets numpy +pip install websockets ``` **From this directory** (`examples/online_serving/qwen3_omni`): ```bash python openai_realtime_client.py \ - --host localhost \ - --port 8091 \ + --url ws://localhost:8091/v1/realtime \ --model Qwen/Qwen3-Omni-30B-A3B-Instruct \ - --audio_path /path/to/your.wav + --input-wav /path/to/input_16k_mono.wav \ + --output-wav realtime_output.wav \ + --delta-dump-dir ./rt_delta_wavs ``` -If `--audio_path` is omitted, the script uses a bundled default clip (`mary_had_lamb` via vLLM assets). - **Arguments:** | Flag | Default | Description | |------|---------|-------------| -| `--host` | `localhost` | API server host | -| `--port` | `8000` | API server port (match your `vllm serve` port, e.g. `8091`) | -| `--model` | `Qwen/Qwen3-Omni-30B-A3B-Instruct` | Must match the served model (also sent in `session.update`) | -| `--audio_path` | *(optional)* | Path to input audio; resampled to 16 kHz mono inside the client | - -Ensure the vLLM-Omni server is running with realtime support for this endpoint, for example: +| `--url` | `ws://localhost:8091/v1/realtime` | Full WebSocket URL including path | +| `--model` | `Qwen/Qwen3-Omni-30B-A3B-Instruct` | Must match the served model (sent in `session.update`) | +| `--input-wav` | *(required)* | Input WAV: mono, 16-bit PCM, **16 kHz** | +| `--output-wav` | `realtime_output.wav` | Output path for concatenated reply audio | +| `--output-text` | *(optional)* | If set, write final transcription text to this path | +| `--chunk-ms` | `200` | Size of each uploaded audio chunk (milliseconds of audio) | +| `--send-delay-ms` | `0` | Delay between chunk sends (simulate realtime upload) | +| `--delta-dump-dir` | *(optional)* | Directory to write per-`response.audio.delta` WAV files | +| `--num-requests` | `1` | Number of sequential sessions (see `--concurrency`) | +| `--concurrency` | `1` | Max concurrent WebSocket sessions when `--num-requests` > 1 | + +Ensure the server is running **without** `async_chunk` if you use `/v1/realtime`, for example: ```bash vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 @@ -276,7 +425,7 @@ The script supports the following arguments: - `--model`: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct) - `--server-port`: Port for vLLM server (default: 8091) - `--gradio-port`: Port for Gradio demo (default: 7861) -- `--stage-configs-path`: Path to custom stage configs YAML file (optional) +- `--deploy-config`: Path to custom deploy config YAML file (optional) - `--server-host`: Host for vLLM server (default: 0.0.0.0) - `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1) - `--share`: Share Gradio demo publicly (creates a public link) @@ -291,7 +440,7 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 If you have custom stage configs file: ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file ``` **Step 2: Run the Gradio demo** diff --git a/examples/online_serving/qwen3_omni/openai_realtime_client.py b/examples/online_serving/qwen3_omni/openai_realtime_client.py index 660e4ac336..79e30a3f50 100644 --- a/examples/online_serving/qwen3_omni/openai_realtime_client.py +++ b/examples/online_serving/qwen3_omni/openai_realtime_client.py @@ -1,81 +1,118 @@ -""" -This script demonstrates how to use the vLLM-Omni Realtime WebSocket API to perform -audio transcription by uploading an audio file. +"""Realtime client for vLLM-Omni /v1/realtime (audio + text events). + +This client: +1) Reads a local WAV file (must be mono, 16-bit PCM, 16kHz), +2) Streams PCM16 chunks to /v1/realtime with OpenAI-style events, +3) Receives response.audio.* and transcription.* events, +4) Saves synthesized audio to an output WAV file and optional text file. -Before running this script, you must start the vLLM-Omni server with a realtime-capable -model, for example: +By default each ``response.audio.delta`` is treated as an **incremental PCM** +chunk and all chunks are concatenated into the final ``--output-wav``. - vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni +Optional debugging: pass ``--delta-dump-dir DIR`` to write every +``response.audio.delta`` payload as ``delta_000001.wav``, ``delta_000002.wav``, … -Requirements: -- vllm with audio support -- websockets -- soundfile -- numpy +Usage: + python openai_realtime_client.py \ + --url ws://localhost:8091/v1/realtime \ + --model Qwen/Qwen3-Omni-30B-A3B-Instruct \ + --input-wav input_16k_mono.wav \ + --output-wav realtime_output.wav \ + --delta-dump-dir ./rt_delta_wavs -The script: -1. Connects to the Realtime WebSocket endpoint -2. Converts an audio file to PCM16 @ 16kHz -3. Sends audio chunks to the server -4. Receives and prints transcription as it streams +Dependencies: + pip install websockets """ +from __future__ import annotations + import argparse import asyncio import base64 import json +import wave +from pathlib import Path + +try: + import websockets +except ImportError: + print("Please install websockets: pip install websockets") + raise SystemExit(1) + + +def _read_wav_pcm16(path: Path) -> bytes: + with wave.open(str(path), "rb") as wf: + nchannels = wf.getnchannels() + sampwidth = wf.getsampwidth() + framerate = wf.getframerate() + comptype = wf.getcomptype() + nframes = wf.getnframes() + + if nchannels != 1: + raise ValueError(f"Input WAV must be mono (got {nchannels} channels).") + if sampwidth != 2: + raise ValueError(f"Input WAV must be 16-bit PCM (got sample width={sampwidth}).") + if framerate != 16000: + raise ValueError(f"Input WAV must be 16kHz (got {framerate} Hz).") + if comptype != "NONE": + raise ValueError(f"Input WAV must be uncompressed PCM (got comptype={comptype}).") + if nframes <= 0: + raise ValueError("Input WAV has no audio frames.") + + return wf.readframes(nframes) + + +def _write_wav_pcm16(path: Path, pcm16_bytes: bytes, sample_rate_hz: int) -> None: + with wave.open(str(path), "wb") as wf: + wf.setnchannels(1) + wf.setsampwidth(2) + wf.setframerate(sample_rate_hz) + wf.writeframes(pcm16_bytes) + + +async def run_client( + url: str, + model: str, + input_wav: Path, + output_wav: Path, + output_text: Path | None, + chunk_ms: int, + send_delay_ms: int, + delta_dump_dir: Path | None, + request_idx: int = 1, + total_requests: int = 1, +) -> None: + log_prefix = f"[req {request_idx:02d}/{total_requests:02d}] " if total_requests > 1 else "" + pcm16 = _read_wav_pcm16(input_wav) + bytes_per_ms = 16000 * 2 // 1000 # mono PCM16 at 16kHz + chunk_bytes = max(bytes_per_ms * chunk_ms, 2) -import numpy as np -import websockets -from vllm.assets.audio import AudioAsset -from vllm.multimodal.media.audio import load_audio - - -def audio_to_pcm16_base64(audio_path: str) -> str: - """ - Load an audio file and convert it to base64-encoded PCM16 @ 16kHz. - """ - # Load audio and resample to 16kHz mono - audio, _ = load_audio(audio_path, sr=16000, mono=True) - # Convert to PCM16 - pcm16 = (audio * 32767).astype(np.int16) - # Encode as base64 - return base64.b64encode(pcm16.tobytes()).decode("utf-8") - - -async def realtime_transcribe(audio_path: str, host: str, port: int, model: str): - """ - Connect to the Realtime API and transcribe an audio file. - """ - uri = f"ws://{host}:{port}/v1/realtime" - - async with websockets.connect(uri) as ws: - # Wait for session.created - response = json.loads(await ws.recv()) - if response["type"] == "session.created": - print(f"Session created: {response['id']}") - else: - print(f"Unexpected response: {response}") - return - - # Validate model - await ws.send(json.dumps({"type": "session.update", "model": model})) - - # Signal ready to start - await ws.send(json.dumps({"type": "input_audio_buffer.commit"})) - - # Convert audio file to base64 PCM16 - print(f"Loading audio from: {audio_path}") - audio_base64 = audio_to_pcm16_base64(audio_path) - - # Send audio in chunks (4KB of raw audio = ~8KB base64) - chunk_size = 4096 - audio_bytes = base64.b64decode(audio_base64) - total_chunks = (len(audio_bytes) + chunk_size - 1) // chunk_size - - print(f"Sending {total_chunks} audio chunks...") - for i in range(0, len(audio_bytes), chunk_size): - chunk = audio_bytes[i : i + chunk_size] + incremental_pcm_parts: list[bytes] = [] + output_sample_rate = 24000 + delta_index = 0 + text_chunks: list[str] = [] + final_text: str = "" + + if delta_dump_dir is not None: + delta_dump_dir.mkdir(parents=True, exist_ok=True) + + async with websockets.connect(url, max_size=64 * 1024 * 1024) as ws: + # 1) Validate model. + await ws.send( + json.dumps( + { + "type": "session.update", + "model": model, + } + ) + ) + + # 2) Start generation once (non-final commit). + await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": False})) + + # 3) Stream audio chunks. + for i in range(0, len(pcm16), chunk_bytes): + chunk = pcm16[i : i + chunk_bytes] await ws.send( json.dumps( { @@ -84,63 +121,212 @@ async def realtime_transcribe(audio_path: str, host: str, port: int, model: str) } ) ) + if send_delay_ms > 0: + await asyncio.sleep(send_delay_ms / 1000.0) - # Signal all audio is sent + # 4) Final commit closes input stream. await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True})) - print("Audio sent. Waiting for transcription...\n") - # Receive transcription - print("Transcription: ", end="", flush=True) + # 5) Receive server events until audio done. while True: - response = json.loads(await ws.recv()) - if response["type"] == "transcription.delta": - print(response["delta"], end="", flush=True) - elif response["type"] == "transcription.done": - print(f"\n\nFinal transcription: {response['text']}") - if response.get("usage"): - print(f"Usage: {response['usage']}") - break - elif response["type"] == "error": - print(f"\nError: {response['error']}") + message = await ws.recv() + if isinstance(message, bytes): + # We only expect JSON text frames. + continue + + event = json.loads(message) + event_type = event.get("type") + + if event_type == "session.created": + continue + + if event_type == "response.audio.delta": + sr = event.get("sample_rate_hz") + if isinstance(sr, int) and sr > 0: + output_sample_rate = sr + audio_b64 = event.get("audio", "") + if audio_b64: + pcm_delta = base64.b64decode(audio_b64) + incremental_pcm_parts.append(pcm_delta) + if delta_dump_dir is not None and pcm_delta: + delta_index += 1 + dump_path = delta_dump_dir / f"delta_{delta_index:06d}.wav" + _write_wav_pcm16(dump_path, pcm_delta, output_sample_rate) + print( + f"{log_prefix}delta dump #{delta_index}: {dump_path} " + f"(pcm bytes={len(pcm_delta)}, sr={output_sample_rate})" + ) + continue + + if event_type == "transcription.delta": + delta = event.get("delta", "") + if delta: + text_chunks.append(delta) + print(delta, end="", flush=True) + continue + + if event_type == "transcription.done": + final_text = event.get("text", "") or "".join(text_chunks) + usage = event.get("usage") + final_text_with_tag = f"Final transcription: {final_text}" + if text_chunks: + print() + print(f"{log_prefix}{final_text_with_tag}") + if usage: + print(f"{log_prefix}text usage: {usage}") + continue + + if event_type == "response.audio.done": break + if event_type == "error": + raise RuntimeError(f"Server error: {event}") -def main(args): - if args.audio_path: - audio_path = args.audio_path - else: - # Use default audio asset - audio_path = str(AudioAsset("mary_had_lamb").get_local_path()) - print(f"No audio path provided, using default: {audio_path}") + all_pcm16 = b"".join(incremental_pcm_parts) + if not all_pcm16: + raise RuntimeError("No audio received from server.") - asyncio.run(realtime_transcribe(audio_path, args.host, args.port, args.model)) + output_wav.parent.mkdir(parents=True, exist_ok=True) + _write_wav_pcm16(output_wav, all_pcm16, output_sample_rate) + print(f"{log_prefix}Saved realtime audio to: {output_wav} (incremental chunks joined)") + if output_text is not None: + text_to_save = final_text if final_text else "".join(text_chunks) + output_text.parent.mkdir(parents=True, exist_ok=True) + output_text.write_text(text_to_save, encoding="utf-8") + print(f"{log_prefix}Saved realtime text to: {output_text}") -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Realtime WebSocket Transcription Client") + +def _indexed_output_path(path: Path | None, index: int, total: int) -> Path | None: + if path is None or total <= 1: + return path + return path.with_name(f"{path.stem}_{index:02d}{path.suffix}") + + +async def run_clients_concurrent( + *, + url: str, + model: str, + input_wav: Path, + output_wav: Path, + output_text: Path | None, + chunk_ms: int, + send_delay_ms: int, + delta_dump_dir: Path | None, + num_requests: int, + concurrency: int, +) -> None: + sem = asyncio.Semaphore(concurrency) + + async def _run_one(index: int) -> tuple[int, bool, str | None]: + per_output_wav = _indexed_output_path(output_wav, index, num_requests) + per_output_text = _indexed_output_path(output_text, index, num_requests) + per_delta_dir = None + if delta_dump_dir is not None: + per_delta_dir = delta_dump_dir / f"req_{index:02d}" + async with sem: + try: + await run_client( + url=url, + model=model, + input_wav=input_wav, + output_wav=per_output_wav, + output_text=per_output_text, + chunk_ms=chunk_ms, + send_delay_ms=send_delay_ms, + delta_dump_dir=per_delta_dir, + request_idx=index, + total_requests=num_requests, + ) + return index, True, None + except Exception as exc: + return index, False, str(exc) + + tasks = [asyncio.create_task(_run_one(i), name=f"rt-client-{i}") for i in range(1, num_requests + 1)] + results = await asyncio.gather(*tasks) + + failed = [(idx, err) for idx, ok, err in results if not ok] + succeeded = num_requests - len(failed) + print(f"[summary] succeeded={succeeded}, failed={len(failed)}, total={num_requests}") + if failed: + for idx, err in failed: + print(f"[summary] req {idx:02d} failed: {err}") + raise RuntimeError(f"{len(failed)} concurrent request(s) failed") + + +def main() -> None: + parser = argparse.ArgumentParser(description="Realtime audio/text client for vLLM-Omni") + parser.add_argument("--url", default="ws://localhost:8091/v1/realtime", help="WebSocket URL") parser.add_argument( "--model", - type=str, default="Qwen/Qwen3-Omni-30B-A3B-Instruct", - help="Model that is served and should be pinged.", + help="Model name for session.update", ) + parser.add_argument("--input-wav", required=True, type=Path, help="Input WAV (mono, PCM16, 16kHz)") + parser.add_argument("--output-wav", default=Path("realtime_output.wav"), type=Path, help="Output WAV path") parser.add_argument( - "--audio_path", - type=str, + "--output-text", default=None, - help="Path to the audio file to transcribe.", + type=Path, + help="Optional output text path for final transcription", ) + parser.add_argument("--chunk-ms", type=int, default=200, help="Input chunk size in milliseconds") parser.add_argument( - "--host", - type=str, - default="localhost", - help="vLLM-Omni server host (default: localhost)", + "--send-delay-ms", + type=int, + default=0, + help="Delay between chunk sends; set >0 to simulate realtime upload", ) parser.add_argument( - "--port", + "--delta-dump-dir", + type=Path, + default=None, + help="If set, each response.audio.delta is saved as delta_NNNNNN.wav under this directory", + ) + parser.add_argument("--num-requests", type=int, default=1, help="Total number of requests to send") + parser.add_argument( + "--concurrency", type=int, - default=8000, - help="vLLM-Omni server port (default: 8000)", + default=1, + help="Maximum number of concurrent websocket requests", ) args = parser.parse_args() - main(args) + + if args.num_requests <= 0: + raise ValueError("--num-requests must be >= 1") + if args.concurrency <= 0: + raise ValueError("--concurrency must be >= 1") + concurrency = min(args.concurrency, args.num_requests) + + if args.num_requests == 1: + asyncio.run( + run_client( + url=args.url, + model=args.model, + input_wav=args.input_wav, + output_wav=args.output_wav, + output_text=args.output_text, + chunk_ms=args.chunk_ms, + send_delay_ms=args.send_delay_ms, + delta_dump_dir=args.delta_dump_dir, + ) + ) + else: + asyncio.run( + run_clients_concurrent( + url=args.url, + model=args.model, + input_wav=args.input_wav, + output_wav=args.output_wav, + output_text=args.output_text, + chunk_ms=args.chunk_ms, + send_delay_ms=args.send_delay_ms, + delta_dump_dir=args.delta_dump_dir, + num_requests=args.num_requests, + concurrency=concurrency, + ) + ) + + +if __name__ == "__main__": + main() diff --git a/examples/online_serving/qwen3_tts/README.md b/examples/online_serving/qwen3_tts/README.md index b48db9cf45..350fcb71ca 100644 --- a/examples/online_serving/qwen3_tts/README.md +++ b/examples/online_serving/qwen3_tts/README.md @@ -43,7 +43,7 @@ Then open http://localhost:7860 in your browser. ### Launch the Server -The default stage config is located at `vllm_omni/model_executor/stage_configs/qwen3_tts.yaml`. For other platforms (e.g., NPU), refer to `vllm_omni/platforms/npu/stage_configs/qwen3_tts.yaml`. +The default deploy config is located at `vllm_omni/deploy/qwen3_tts.yaml` and is loaded automatically by the model registry — no `--deploy-config` flag needed for default use. Platform-specific deltas (NPU, ROCm, XPU) are merged in automatically from the `platforms:` block of the same YAML based on the detected runtime. ```bash # CustomVoice model (predefined speakers) @@ -70,6 +70,22 @@ vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ --port 8091 ``` +#### Sync vs async-chunk mode + +Qwen3-TTS supports both **chunked streaming** (default, lower latency) and +**synchronous end-to-end** modes from the same deploy YAML. The bundled +`qwen3_tts.yaml` ships with `async_chunk: true`; flip with `--no-async-chunk` +and the pipeline automatically dispatches to the end-to-end codec processor: + +```bash +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091 \ + --no-async-chunk +``` + +No variant YAML or extra flag is needed — the `StagePipelineConfig` on each +stage declares both processor functions and the runtime picks based on the +`async_chunk:` bool. + Alternatively, use the convenience script: ```bash ./run_server.sh # Default: CustomVoice model diff --git a/examples/online_serving/qwen3_tts/batch_speech_client.py b/examples/online_serving/qwen3_tts/batch_speech_client.py index 7d48e650f8..47fdc3691c 100644 --- a/examples/online_serving/qwen3_tts/batch_speech_client.py +++ b/examples/online_serving/qwen3_tts/batch_speech_client.py @@ -5,11 +5,13 @@ batch level and generate many utterances in the cloned voice without repeating the reference for each item. -Start the server (with batch-optimized config for best throughput): +Start the server (with batch-optimized stage settings for best throughput): vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml \ - --trust-remote-code + --omni \ + --trust-remote-code \ + --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2}, + "1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}' Examples: # Batch with a predefined voice diff --git a/examples/online_serving/qwen3_tts/run_gradio_demo.sh b/examples/online_serving/qwen3_tts/run_gradio_demo.sh index bcc0ddb7cf..d79be3c2ab 100644 --- a/examples/online_serving/qwen3_tts/run_gradio_demo.sh +++ b/examples/online_serving/qwen3_tts/run_gradio_demo.sh @@ -127,7 +127,7 @@ echo "Starting vLLM server..." LOG_FILE="/tmp/vllm_tts_server_${SERVER_PORT}.log" vllm-omni serve "$MODEL" \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --host "$SERVER_HOST" \ --port "$SERVER_PORT" \ --gpu-memory-utilization 0.9 \ diff --git a/examples/online_serving/qwen3_tts/run_server.sh b/examples/online_serving/qwen3_tts/run_server.sh index 6f4aa83a0b..78dd2c305d 100755 --- a/examples/online_serving/qwen3_tts/run_server.sh +++ b/examples/online_serving/qwen3_tts/run_server.sh @@ -31,7 +31,7 @@ esac echo "Starting Qwen3-TTS server with model: $MODEL" vllm-omni serve "$MODEL" \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --deploy-config vllm_omni/deploy/qwen3_tts.yaml \ --host 0.0.0.0 \ --port 8091 \ --gpu-memory-utilization 0.9 \ diff --git a/examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py b/examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py index 38a2bdea92..7790fa5127 100644 --- a/examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py +++ b/examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py @@ -5,7 +5,7 @@ using SLERP and sends the result to the /v1/audio/speech API. Requirements: - pip install torch resampy soundfile numpy httpx + pip install torch soundfile numpy httpx Examples: # Extract and save an embedding @@ -143,11 +143,12 @@ def _load_speaker_encoder_weights(encoder: torch.nn.Module, model_path: str) -> def compute_mel_spectrogram(audio: np.ndarray, sr: int = 24000) -> torch.Tensor: """Compute 128-bin mel spectrogram matching Qwen3-TTS's extraction pipeline.""" - from vllm.multimodal.audio import resample_audio_resampy + from vllm.multimodal.audio import AudioResampler # Resample to 24kHz if needed if sr != 24000: - audio = resample_audio_resampy(audio.astype(np.float32), orig_sr=sr, target_sr=24000) + resampler = AudioResampler(target_sr=24000) + audio = resampler.resample(audio.astype(np.float32), orig_sr=sr) y = torch.from_numpy(audio).unsqueeze(0).float() diff --git a/recipes/Qwen/Qwen3-Omni.md b/recipes/Qwen/Qwen3-Omni.md new file mode 100644 index 0000000000..081e1453d3 --- /dev/null +++ b/recipes/Qwen/Qwen3-Omni.md @@ -0,0 +1,90 @@ +# Qwen3-Omni for multimodal chat on 1x A100 80GB + +## Summary + +- Vendor: Qwen +- Model: `Qwen/Qwen3-Omni-30B-A3B-Instruct` +- Task: Multimodal chat with text, image, audio, or video input +- Mode: Online serving with the OpenAI-compatible API +- Maintainer: Community + +## When to use this recipe + +Use this recipe when you want a known-good starting point for serving +`Qwen/Qwen3-Omni-30B-A3B-Instruct` with vLLM-Omni on a single 80 GB A100 and +validate the deployment with the existing multimodal client examples in this +repository. + +## References + +- Upstream or canonical docs: + [`docs/user_guide/examples/online_serving/qwen3_omni.md`](../../docs/user_guide/examples/online_serving/qwen3_omni.md) +- Related example under `examples/`: + [`examples/online_serving/qwen3_omni/README.md`](../../examples/online_serving/qwen3_omni/README.md) +- Related issue or discussion: + [RFC: add recipes folder](https://github.com/vllm-project/vllm-omni/issues/2645) + +## Hardware Support + +This recipe currently documents one tested-style reference configuration for +CUDA GPU serving. Add more sections for other hardware as community validation +lands. + +## GPU + +### 1x A100 80GB + +#### Environment + +- OS: Linux +- Python: 3.10+ +- Driver / runtime: NVIDIA CUDA environment with an A100 80 GB GPU +- vLLM version: Match the repository requirements for your checkout +- vLLM-Omni version or commit: Use the commit you are deploying from + +#### Command + +Start the server from the repository root: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 +``` + +To enable async chunking, use the bundled stage config: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \ + --omni \ + --port 8091 \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +``` + +#### Verification + +Run one of the existing example clients after the server is ready: + +```bash +python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \ + --model Qwen/Qwen3-Omni-30B-A3B-Instruct \ + --query-type use_image \ + --port 8091 \ + --host localhost +``` + +For a quick API smoke test, request text-only output: + +```bash +curl http://localhost:8091/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", + "messages": [{"role": "user", "content": "Describe vLLM in brief."}], + "modalities": ["text"] + }' +``` + +#### Notes + +- Memory usage: Size depends on runtime options and output modalities; leave headroom for multimodal workloads. +- Key flags: `--omni` is required; `--stage-configs-path` is optional for custom or async-chunk stage configs. +- Known limitations: This starter recipe is intentionally narrow and focuses on the single-GPU online-serving path already documented in the repo examples. diff --git a/recipes/README.md b/recipes/README.md new file mode 100644 index 0000000000..5b3dfb5430 --- /dev/null +++ b/recipes/README.md @@ -0,0 +1,35 @@ +# Community Recipes + +This directory contains community-maintained recipes for answering a +practical user question: + +> How do I run model X on hardware Y for task Z? + +Add recipes for this repository under this in-repo `recipes/` directory. To +keep naming and layout consistent, organize recipes by model vendor in a way +that is aligned with +[`vllm-project/recipes`](https://github.com/vllm-project/recipes), but treat +that external repository as a reference for structure rather than the place to +add files for this repo. Use one Markdown file per model family by default. + +Example layout: + +```text +recipes/ + Qwen/ + Qwen3-Omni.md + Qwen3-TTS.md + Tencent-Hunyuan/ + HunyuanVideo.md +``` + +## Available Recipes + +- [`Qwen/Qwen3-Omni.md`](./Qwen/Qwen3-Omni.md): online serving recipe for + multimodal chat on `1x A100 80GB` + +Within a single recipe file, include different hardware support sections such +as `GPU`, `ROCm`, and `NPU`, and add concrete tested configurations like +`1x A100 80GB` or `2x L40S` inside those sections when applicable. + +See [TEMPLATE.md](./TEMPLATE.md) for the recommended format. diff --git a/recipes/TEMPLATE.md b/recipes/TEMPLATE.md new file mode 100644 index 0000000000..9bf8cb9c75 --- /dev/null +++ b/recipes/TEMPLATE.md @@ -0,0 +1,82 @@ +# Recipe Title + +> Example: Qwen3-Omni for speech chat on 1x A100 80GB + +## Summary + +- Vendor: +- Model: +- Task: +- Mode: +- Maintainer: + +## When to use this recipe + +Briefly describe the concrete scenario this recipe covers. + +## References + +- Upstream or canonical docs: +- Related example under `examples/`: +- Related issue or discussion: + +## Hardware Support + +Add one section per platform, such as `GPU`, `ROCm`, or `NPU`. Under each +platform section, document one or more tested hardware configurations. + +## GPU + +### 1x A100 80GB + +#### Environment + +- OS: +- Python: +- Driver / runtime: +- vLLM version: +- vLLM-Omni version or commit: + +#### Command + +```bash +# Add the exact command(s) here +``` + +#### Verification + +```bash +# Add a quick validation command or expected output here +``` + +#### Notes + +- Memory usage: +- Key flags: +- Known limitations: + +### 2x L40S + +Repeat the same structure for other hardware setups as needed. + +## ROCm + +### Example hardware configuration + +Repeat the same nested structure for ROCm setups as needed: + +- `#### Environment` +- `#### Command` +- `#### Verification` +- `#### Notes` + +## NPU + +### Example hardware configuration + +Repeat the same nested structure for NPU setups as needed: + +- `#### Environment` +- `#### Command` +- `#### Verification` +- `#### Notes` diff --git a/requirements/common.txt b/requirements/common.txt index 1f44d343c6..63e16d580f 100644 --- a/requirements/common.txt +++ b/requirements/common.txt @@ -1,7 +1,6 @@ # Common dependencies for all platforms av>=14.0.0 omegaconf>=2.3.0 -resampy>=0.4.3 diffusers>=0.36.0 accelerate==1.12.0 soundfile>=0.13.1 diff --git a/tests/comfyui/test_comfyui_integration.py b/tests/comfyui/test_comfyui_integration.py index 80e86d8241..5164f3b9ac 100644 --- a/tests/comfyui/test_comfyui_integration.py +++ b/tests/comfyui/test_comfyui_integration.py @@ -523,6 +523,7 @@ def run_server(): "Qwen/Qwen-Image-Edit", True, id="image-to-image-dalle-endpoint", + marks=pytest.mark.skip(reason="Temporarily disabled due to failure."), ), pytest.param( ServerCase( diff --git a/tests/config/__init__.py b/tests/config/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/tests/config/test_pipeline_registry.py b/tests/config/test_pipeline_registry.py new file mode 100644 index 0000000000..3483d530c6 --- /dev/null +++ b/tests/config/test_pipeline_registry.py @@ -0,0 +1,111 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for the central pipeline registry (2.5/N).""" + +from __future__ import annotations + +import pytest + +from vllm_omni.config.pipeline_registry import ( + _DIFFUSION_PIPELINES, + _OMNI_PIPELINES, + _VLLM_OMNI_PIPELINES, +) +from vllm_omni.config.stage_config import ( + _PIPELINE_REGISTRY, + PipelineConfig, + StageExecutionType, + StagePipelineConfig, + register_pipeline, +) + + +class TestCentralRegistryDeclarations: + """Every in-tree pipeline must be declared exactly once in the central registry.""" + + def test_union_contains_all_omni(self): + for key in _OMNI_PIPELINES: + assert key in _VLLM_OMNI_PIPELINES + + def test_union_contains_all_diffusion(self): + for key in _DIFFUSION_PIPELINES: + assert key in _VLLM_OMNI_PIPELINES + + def test_no_duplicate_model_type_between_omni_and_diffusion(self): + overlap = set(_OMNI_PIPELINES) & set(_DIFFUSION_PIPELINES) + assert not overlap, f"Duplicate model_types across omni/diffusion: {overlap}" + + def test_expected_omni_pipelines_present(self): + # Guard against accidental removal during future refactors. + assert "qwen2_5_omni" in _OMNI_PIPELINES + assert "qwen2_5_omni_thinker_only" in _OMNI_PIPELINES + assert "qwen3_omni_moe" in _OMNI_PIPELINES + assert "qwen3_tts" in _OMNI_PIPELINES + + +class TestLazyLoading: + """Pipelines are imported only on first access.""" + + def test_contains_without_import(self): + # ``in`` hits the lazy map, not the loaded cache. + assert "qwen3_omni_moe" in _PIPELINE_REGISTRY + + def test_getitem_loads_correct_pipeline(self): + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + assert pipeline.model_type == "qwen3_omni_moe" + assert pipeline.model_arch == "Qwen3OmniMoeForConditionalGeneration" + + def test_unknown_model_type_returns_none_via_get(self): + assert _PIPELINE_REGISTRY.get("not_a_real_pipeline") is None + + def test_unknown_model_type_raises_keyerror_via_getitem(self): + with pytest.raises(KeyError): + _PIPELINE_REGISTRY["not_a_real_pipeline"] + + def test_iteration_yields_registered_pipelines(self): + keys = set(_PIPELINE_REGISTRY) + assert "qwen2_5_omni" in keys + assert "qwen3_omni_moe" in keys + + +class TestDynamicRegistration: + """``register_pipeline()`` still works for plugins and tests.""" + + def test_register_adds_to_registry(self): + custom = PipelineConfig( + model_type="_test_dynamic_registration", + model_arch="DynamicTestModel", + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="test", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + ), + ), + ) + register_pipeline(custom) + try: + assert "_test_dynamic_registration" in _PIPELINE_REGISTRY + assert _PIPELINE_REGISTRY["_test_dynamic_registration"] is custom + finally: + # Don't leak the test registration into other tests. + if "_test_dynamic_registration" in _PIPELINE_REGISTRY: + del _PIPELINE_REGISTRY["_test_dynamic_registration"] + + def test_dynamic_registration_overrides_lazy_entry(self): + # Build a substitute for qwen3_omni_moe that we can distinguish. + original = _PIPELINE_REGISTRY["qwen3_omni_moe"] + override = PipelineConfig( + model_type="qwen3_omni_moe", + model_arch="OverriddenArch", + stages=original.stages, + ) + register_pipeline(override) + try: + assert _PIPELINE_REGISTRY["qwen3_omni_moe"].model_arch == "OverriddenArch" + finally: + # Remove the dynamic override so later tests see the original. + if "qwen3_omni_moe" in _PIPELINE_REGISTRY._loaded: + del _PIPELINE_REGISTRY["qwen3_omni_moe"] diff --git a/tests/conftest.py b/tests/conftest.py index 3434eb0aed..83752521f2 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -47,6 +47,7 @@ from vllm.logger import init_logger from vllm.utils.network_utils import get_open_port +from vllm_omni.config.stage_config import resolve_deploy_yaml from vllm_omni.entrypoints.omni import Omni from vllm_omni.inputs.data import OmniSamplingParams from vllm_omni.outputs import OmniRequestOutput @@ -1339,12 +1340,14 @@ def delete_by_path(config_dict: dict, path: str) -> None: else: print(f"Path {path} does not exist") + _stage_key = "stages" if "stages" in config else "stage_args" + # Apply deletions first if deletes: for key, value in deletes.items(): - if key == "stage_args": + if key in ("stage_args", "stages"): if value and isinstance(value, dict): - stage_args = config.get("stage_args", []) + stage_args = config.get(_stage_key, []) if not stage_args: raise ValueError("stage_args does not exist in config") @@ -1377,9 +1380,9 @@ def delete_by_path(config_dict: dict, path: str) -> None: # Apply updates if updates: for key, value in updates.items(): - if key == "stage_args": + if key in ("stage_args", "stages"): if value and isinstance(value, dict): - stage_args = config.get("stage_args", []) + stage_args = config.get(_stage_key, []) if not stage_args: raise ValueError("stage_args does not exist in config") @@ -1585,32 +1588,46 @@ def __init__( self.stage_config_path = stage_config_path self.master_port = get_open_port() self.visible_device_list = self._load_visible_device_list(env_dict) - self.stage_runtime_devices = self._load_stage_runtime_devices(stage_config_path) - self.stage_ids = stage_ids or self._load_stage_ids(stage_config_path) + resolved_cfg = resolve_deploy_yaml(stage_config_path) + # Dump the resolved deploy config so CI logs show each stage's + # gpu_memory_utilization / max_model_len / max_num_seqs after + # base_config inheritance and overlay merge — essential when + # diagnosing OOMs that depend on the merged values. + print( + f"[OmniServerStageCli] Resolved deploy config from {stage_config_path}:\n" + f"{yaml.safe_dump(resolved_cfg, sort_keys=False, default_flow_style=False)}", + flush=True, + ) + self.stage_runtime_devices = self._load_stage_runtime_devices(resolved_cfg) + self.stage_ids = stage_ids or self._load_stage_ids(resolved_cfg) if 0 not in self.stage_ids: raise ValueError(f"Stage CLI test requires stage_id=0 in config: {stage_config_path}") self.stage_procs: dict[int, subprocess.Popen] = {} self.proc = None @staticmethod - def _load_stage_ids(stage_config_path: str) -> list[int]: - with open(stage_config_path, encoding="utf-8") as f: - cfg = yaml.safe_load(f) or {} + def _stage_entries(cfg: dict) -> list[dict]: + """Return the list of stage entries from either legacy (``stage_args``) + or new-schema (``stages``) deploy YAMLs.""" + return cfg.get("stage_args") or cfg.get("stages") or [] - stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage] + @staticmethod + def _load_stage_ids(resolved_config: dict) -> list[int]: + stage_ids = [ + stage["stage_id"] for stage in OmniServerStageCli._stage_entries(resolved_config) if "stage_id" in stage + ] if not stage_ids: - raise ValueError(f"No stage IDs found in config: {stage_config_path}") + raise ValueError("No stage IDs found in resolved config") return stage_ids @staticmethod - def _load_stage_runtime_devices(stage_config_path: str) -> dict[int, str]: - with open(stage_config_path, encoding="utf-8") as f: - cfg = yaml.safe_load(f) or {} - + def _load_stage_runtime_devices(resolved_config: dict) -> dict[int, str]: runtime_devices: dict[int, str] = {} - for stage in cfg.get("stage_args", []): + for stage in OmniServerStageCli._stage_entries(resolved_config): stage_id = stage.get("stage_id") - devices = stage.get("runtime", {}).get("devices") + # New schema: stage.devices is flat at stage level. + # Legacy schema: stage.runtime.devices is nested. + devices = stage.get("devices") or stage.get("runtime", {}).get("devices") if stage_id is not None and devices: runtime_devices[int(stage_id)] = str(devices) return runtime_devices @@ -1696,10 +1713,21 @@ def _launch_stage(self, stage_id: int, *, headless: bool) -> None: cmd = self._build_stage_cmd(stage_id, headless=headless) print(f"Launching OmniServerStageCli stage {stage_id}: {' '.join(cmd)}") + # Capture each subprocess's stdout+stderr to a per-stage log file so + # debugging "Stage N exited before API server ready" doesn't rely on + # guessing; the file is surfaced in the RuntimeError message. + log_path = Path(tempfile.gettempdir()) / f"omni_stage_{stage_id}_{self.master_port}.log" + self._stage_log_paths = getattr(self, "_stage_log_paths", {}) + self._stage_log_paths[stage_id] = log_path + log_fh = open(log_path, "w", buffering=1) # noqa: SIM115 - closed in __exit__ + self._stage_log_files = getattr(self, "_stage_log_files", {}) + self._stage_log_files[stage_id] = log_fh proc = subprocess.Popen( cmd, env=env, cwd=os.path.dirname(os.path.dirname(os.path.abspath(__file__))), + stdout=log_fh, + stderr=subprocess.STDOUT, ) self.stage_procs[stage_id] = proc if stage_id == 0: @@ -1709,7 +1737,18 @@ def _ensure_stage_processes_alive(self) -> None: for stage_id, proc in self.stage_procs.items(): ret = proc.poll() if ret is not None: - raise RuntimeError(f"Stage {stage_id} exited with code {ret} before API server became ready.") + log_path = getattr(self, "_stage_log_paths", {}).get(stage_id) + tail = "" + if log_path and log_path.exists(): + try: + with open(log_path, encoding="utf-8", errors="replace") as f: + lines = f.readlines() + tail = "\n=== Last 60 lines of stage {} log ({}) ===\n{}".format( + stage_id, log_path, "".join(lines[-60:]) or "" + ) + except Exception as exc: # pragma: no cover - diagnostic only + tail = f"\n" + raise RuntimeError(f"Stage {stage_id} exited with code {ret} before API server became ready.{tail}") def _start_server(self) -> None: ordered_stage_ids = [0, *[stage_id for stage_id in self.stage_ids if stage_id != 0]] @@ -1735,7 +1774,46 @@ def _start_server(self) -> None: raise RuntimeError(f"OmniServerStageCli failed to start within {max_wait} seconds") + def _dump_stage_logs_for_debug(self, head_lines: int = 300, tail_lines: int = 500) -> None: + """Tail each stage's subprocess log back to stdout on teardown. + + Stage subprocesses redirect stdout/stderr to ``/tmp/omni_stage_*.log`` + so we don't spam the main CI stream while tests run; but that also + hides engine init (KV cache size, Available KV cache memory, vLLM + engine config) when things go wrong. Dump them here so buildkite + captures them post-run. Head covers engine init; tail covers + whatever state the stage was in when it was torn down. + """ + log_paths = getattr(self, "_stage_log_paths", {}) or {} + for stage_id in sorted(log_paths): + log_path = log_paths[stage_id] + if not log_path or not log_path.exists(): + continue + try: + with open(log_path, encoding="utf-8", errors="replace") as f: + lines = f.readlines() + except Exception as exc: # pragma: no cover - diagnostic only + print(f"[OmniServerStageCli] stage {stage_id} log read failed: {exc}", flush=True) + continue + total = len(lines) + if total <= head_lines + tail_lines: + head_chunk = lines + tail_chunk = [] + elided = 0 + else: + head_chunk = lines[:head_lines] + tail_chunk = lines[-tail_lines:] + elided = total - head_lines - tail_lines + print(f"\n=== stage {stage_id} log HEAD ({log_path}) ===", flush=True) + print("".join(head_chunk).rstrip("\n"), flush=True) + if tail_chunk: + print(f"\n... [{elided} lines elided] ...", flush=True) + print(f"\n=== stage {stage_id} log TAIL ({log_path}) ===", flush=True) + print("".join(tail_chunk).rstrip("\n"), flush=True) + print(f"=== end stage {stage_id} log ===\n", flush=True) + def __exit__(self, exc_type, exc_val, exc_tb): + self._dump_stage_logs_for_debug() for stage_id in sorted(self.stage_procs, reverse=True): proc = self.stage_procs[stage_id] if proc.poll() is None: @@ -1781,10 +1859,18 @@ def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: st if run_level == "advanced_model" and stage_config_path is not None: with open(stage_config_path, encoding="utf-8") as f: cfg = yaml.safe_load(f) or {} - stage_ids = [stage["stage_id"] for stage in cfg.get("stage_args", []) if "stage_id" in stage] + # Strip ``load_format: dummy`` (CI overlay default) so advanced_model + # tests use real weights. New schema (``stages:``) writes the field + # flat at stage level; legacy schema (``stage_args:``) nests it as + # ``engine_args.load_format``. Handle both. + new_schema_stages = cfg.get("stages") + stage_key = "stages" if new_schema_stages is not None else "stage_args" + delete_path = "load_format" if new_schema_stages is not None else "engine_args.load_format" + stage_entries = cfg.get(stage_key, []) + stage_ids = [stage["stage_id"] for stage in stage_entries if "stage_id" in stage] stage_config_path = modify_stage_config( stage_config_path, - deletes={"stage_args": {stage_id: ["engine_args.load_format"] for stage_id in stage_ids}}, + deletes={stage_key: {stage_id: [delete_path] for stage_id in stage_ids}}, ) server_args = params.server_args or [] @@ -1801,6 +1887,7 @@ def omni_server(request: pytest.FixtureRequest, run_level: str, model_prefix: st raise ValueError("omni_server with use_stage_cli=True requires use_omni=True") if stage_config_path is None: raise ValueError("omni_server with use_stage_cli=True requires a stage_config_path") + server_args += ["--stage-configs-path", stage_config_path] with OmniServerStageCli( model, @@ -3291,7 +3378,7 @@ def omni_runner(request, model_prefix): with _omni_server_lock: model, stage_config_path = request.param model = model_prefix + model - with OmniRunner(model, seed=42, stage_configs_path=stage_config_path) as runner: + with OmniRunner(model, seed=42, stage_configs_path=stage_config_path, stage_init_timeout=300) as runner: print("OmniRunner started successfully") yield runner print("OmniRunner stopping...") diff --git a/tests/core/sched/test_omni_scheduler_mixin.py b/tests/core/sched/test_omni_scheduler_mixin.py new file mode 100644 index 0000000000..e04a9c39fb --- /dev/null +++ b/tests/core/sched/test_omni_scheduler_mixin.py @@ -0,0 +1,129 @@ +"""Unit tests for OmniSchedulerMixin streaming session replacement. + +These tests pin the behavior of `_replace_session_with_streaming_update` against +current vLLM `Request` / `StreamingUpdate` (and Omni patches). When upgrading +vLLM, failures here should highlight incompatible changes to request state or +update payloads early. +""" + +from __future__ import annotations + +from dataclasses import replace + +import pytest + +# Imports must run in this order: vllm_omni applies patches to vllm.v1.request before +# Request / StreamingUpdate are bound in this module. Ruff isort would reorder them. +# isort: off +import vllm_omni # noqa: F401 - import for side effects (patch vLLM) +from vllm.sampling_params import SamplingParams +from vllm.v1.engine import EngineCoreEventType +from vllm.v1.request import Request, RequestStatus, StreamingUpdate +from vllm_omni.core.sched.omni_scheduler_mixin import OmniSchedulerMixin + +# isort: on + +pytestmark = [pytest.mark.core_model, pytest.mark.cpu] + + +class _SchedulerStub(OmniSchedulerMixin): + """Minimal scheduler surface required by OmniSchedulerMixin.""" + + def __init__(self, *, log_stats: bool = False) -> None: + self.num_waiting_for_streaming_input = 0 + self.log_stats = log_stats + + +def _make_request(**kwargs) -> Request: + sp = SamplingParams(max_tokens=8) + defaults = dict( + request_id="req-mixin-test", + prompt_token_ids=[1, 2, 3], + sampling_params=sp, + pooling_params=None, + arrival_time=100.0, + block_hasher=None, + ) + defaults.update(kwargs) + return Request(**defaults) + + +def _make_update(**kwargs) -> StreamingUpdate: + sp_new = SamplingParams(max_tokens=16) + defaults = dict( + mm_features=None, + prompt_token_ids=[10, 20], + max_tokens=32, + arrival_time=200.0, + sampling_params=sp_new, + ) + defaults.update(kwargs) + return StreamingUpdate(**defaults) + + +class TestReplaceSessionWithStreamingUpdate: + def test_resets_tokens_and_prompt_from_update(self) -> None: + sched = _SchedulerStub() + session = _make_request() + session.append_output_token_ids([7, 8]) + session.num_computed_tokens = 99 + session.status = RequestStatus.WAITING_FOR_STREAMING_REQ + + update = _make_update(prompt_token_ids=[40, 41, 42]) + sched.num_waiting_for_streaming_input = 3 + sched._replace_session_with_streaming_update(session, update) + + assert session._output_token_ids == [] + assert list(session._all_token_ids) == [40, 41, 42] + assert session.prompt_token_ids == [40, 41, 42] + assert session.num_computed_tokens == 0 + assert session.num_prompt_tokens == 3 + assert session.arrival_time == 200.0 + assert session.sampling_params is update.sampling_params + assert session.status == RequestStatus.WAITING + assert sched.num_waiting_for_streaming_input == 2 + + def test_none_prompt_token_ids_becomes_empty(self) -> None: + sched = _SchedulerStub() + session = _make_request() + session.status = RequestStatus.RUNNING + update = _make_update(prompt_token_ids=None) + sched._replace_session_with_streaming_update(session, update) + + assert session.prompt_token_ids == () + assert list(session._all_token_ids) == [] + assert session.num_prompt_tokens == 0 + assert sched.num_waiting_for_streaming_input == 0 + + def test_additional_information_cleared_when_update_omits_it(self) -> None: + sched = _SchedulerStub() + session = _make_request() + if not hasattr(session, "additional_information"): + pytest.skip("Request has no additional_information (Omni patch inactive?)") + session.additional_information = {"keep": True} + session.status = RequestStatus.RUNNING + + base = _make_update() + if not hasattr(base, "additional_information"): + pytest.skip("StreamingUpdate has no additional_information (Omni patch inactive?)") + update = replace(base, additional_information=None) + + sched._replace_session_with_streaming_update(session, update) + assert session.additional_information is None + + def test_does_not_decrement_waiting_when_not_streaming_status(self) -> None: + sched = _SchedulerStub() + session = _make_request() + session.status = RequestStatus.RUNNING + sched.num_waiting_for_streaming_input = 5 + sched._replace_session_with_streaming_update(session, _make_update()) + assert sched.num_waiting_for_streaming_input == 5 + + def test_records_queued_event_when_log_stats_enabled(self) -> None: + sched = _SchedulerStub(log_stats=True) + session = _make_request() + session.status = RequestStatus.WAITING_FOR_STREAMING_REQ + sched._replace_session_with_streaming_update(session, _make_update()) + + assert session.events + assert session.events[-1].type == EngineCoreEventType.QUEUED diff --git a/tests/dfx/conftest.py b/tests/dfx/conftest.py index 997f25e6e5..b8edeba9d5 100644 --- a/tests/dfx/conftest.py +++ b/tests/dfx/conftest.py @@ -40,22 +40,32 @@ def modify_stage(default_path, updates, deletes): def create_unique_server_params( configs: list[dict[str, Any]], stage_configs_dir: Path, -) -> list[tuple[str, str, str]]: +) -> list[tuple[str, str, str | None, str | None, tuple[str, ...]]]: unique_params = [] seen = set() for config in configs: test_name = config["test_name"] - model = config["server_params"]["model"] - stage_config_name = config["server_params"].get("stage_config_name") + server_params = config["server_params"] + model = server_params["model"] + stage_config_name = server_params.get("stage_config_name") if stage_config_name: stage_config_path = str(stage_configs_dir / stage_config_name) - delete = config["server_params"].get("delete", None) - update = config["server_params"].get("update", None) + delete = server_params.get("delete", None) + update = server_params.get("update", None) stage_config_path = modify_stage(stage_config_path, update, delete) else: stage_config_path = None - server_param = (test_name, model, stage_config_path) + stage_overrides = server_params.get("stage_overrides") + stage_overrides_json = json.dumps(stage_overrides) if stage_overrides else None + + # ``extra_cli_args`` passes raw CLI flags straight through to + # ``vllm_omni.entrypoints.cli.main serve`` — used for flags that + # don't map to stage-level overrides, e.g. ``--async-chunk`` / + # ``--no-async-chunk`` toggling the deploy-level async_chunk bool. + extra_cli_args = tuple(server_params.get("extra_cli_args") or ()) + + server_param = (test_name, model, stage_config_path, stage_overrides_json, extra_cli_args) if server_param not in seen: seen.add(server_param) unique_params.append(server_param) diff --git a/tests/dfx/perf/scripts/run_benchmark.py b/tests/dfx/perf/scripts/run_benchmark.py index bea46f684b..0de60c6a54 100644 --- a/tests/dfx/perf/scripts/run_benchmark.py +++ b/tests/dfx/perf/scripts/run_benchmark.py @@ -48,8 +48,8 @@ def _get_config_file_from_argv() -> str | None: OMNI_RESULT_TEMPLATE_PATH = Path(__file__).parent / "result_omni_template.json" -STAGE_CONFIGS_DIR = Path(__file__).parent.parent / "stage_configs" -test_params = create_unique_server_params(BENCHMARK_CONFIGS, STAGE_CONFIGS_DIR) +DEPLOY_CONFIGS_DIR = Path(__file__).parent.parent / "deploy" +test_params = create_unique_server_params(BENCHMARK_CONFIGS, DEPLOY_CONFIGS_DIR) server_to_benchmark_mapping = create_test_parameter_mapping(BENCHMARK_CONFIGS) _omni_server_lock = threading.Lock() @@ -62,13 +62,19 @@ def omni_server(request): Multi-stage initialization can take 10-20+ minutes. """ with _omni_server_lock: - test_name, model, stage_config_path = request.param + test_name, model, stage_config_path, stage_overrides, extra_cli_args = request.param print(f"Starting OmniServer with test: {test_name}, model: {model}") server_args = ["--stage-init-timeout", "600", "--init-timeout", "900"] + # --deploy-config and --stage-overrides compose at the CLI (see vllm_omni/entrypoints/utils.py): + # deploy-config sets the base; stage-overrides are applied on top. Both can be set. if stage_config_path: - server_args = ["--stage-configs-path", stage_config_path] + server_args + server_args = ["--deploy-config", stage_config_path] + server_args + if stage_overrides: + server_args = ["--stage-overrides", stage_overrides] + server_args + if extra_cli_args: + server_args = list(extra_cli_args) + server_args with OmniServer(model, server_args) as server: server.test_name = test_name print("OmniServer started successfully") diff --git a/tests/dfx/perf/stage_configs/qwen3_omni.yaml b/tests/dfx/perf/stage_configs/qwen3_omni.yaml deleted file mode 100644 index 2add22b873..0000000000 --- a/tests/dfx/perf/stage_configs/qwen3_omni.yaml +++ /dev/null @@ -1,101 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -async_chunk: false -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 1 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 100000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/dfx/perf/stage_configs/qwen3_tts.yaml b/tests/dfx/perf/stage_configs/qwen3_tts.yaml deleted file mode 100644 index 97b3090560..0000000000 --- a/tests/dfx/perf/stage_configs/qwen3_tts.yaml +++ /dev/null @@ -1,96 +0,0 @@ -# Stage config for running Qwen3-TTS with 2-stage architecture -# Stage 0: Talker (text -> 8-layer RVQ codec codes) -# Stage 1: Code2Wav (codec codes -> audio waveform) -# -# The following config has been verified on 1x H100-80G GPU. -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - max_num_seqs: 4 - model_stage: qwen3_tts - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - max_num_seqs: 4 - model_stage: code2wav - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - max_num_batched_tokens: 8192 - max_model_len: 32768 - engine_input_source: [0] - final_output: true - final_output_type: audio - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 4 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - codec_streaming: true - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - codec_chunk_frames: 25 - codec_left_context_frames: 72 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/tests/dfx/perf/tests/test_qwen_omni.json b/tests/dfx/perf/tests/test_qwen_omni.json index 4662f8c0c7..39fd266544 100644 --- a/tests/dfx/perf/tests/test_qwen_omni.json +++ b/tests/dfx/perf/tests/test_qwen_omni.json @@ -2,8 +2,7 @@ { "test_name": "test_qwen3_omni", "server_params": { - "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", - "stage_config_name": "qwen3_omni.yaml" + "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct" }, "benchmark_params": [ { @@ -109,25 +108,7 @@ "test_name": "test_qwen3_omni_chunk", "server_params": { "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", - "stage_config_name": "qwen3_omni.yaml", - "update": { - "async_chunk": true, - "stage_args": { - "0": { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk" - }, - "1": { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk" - } - } - }, - "delete": { - "stage_args": { - "2": [ - "custom_process_input_func" - ] - } - } + "extra_cli_args": ["--async-chunk"] }, "benchmark_params": [ { diff --git a/tests/dfx/stability/scripts/test_benchmark_stability.py b/tests/dfx/stability/scripts/test_benchmark_stability.py index a9faae8ab8..3d6b41e762 100644 --- a/tests/dfx/stability/scripts/test_benchmark_stability.py +++ b/tests/dfx/stability/scripts/test_benchmark_stability.py @@ -35,7 +35,7 @@ from tests.dfx.perf.scripts.run_benchmark import run_benchmark STABILITY_DIR = Path(__file__).resolve().parent.parent -STAGE_CONFIGS_DIR = STABILITY_DIR / "stage_configs" +DEPLOY_CONFIGS_DIR = STABILITY_DIR / "deploy" CONFIG_FILE_PATH = str(STABILITY_DIR / "tests" / "test.json") DEFAULT_NUM_PROMPTS_PER_BATCH = 20 @@ -45,7 +45,7 @@ except FileNotFoundError: BENCHMARK_CONFIGS = [] -test_params = create_unique_server_params(BENCHMARK_CONFIGS, STAGE_CONFIGS_DIR) if BENCHMARK_CONFIGS else [] +test_params = create_unique_server_params(BENCHMARK_CONFIGS, DEPLOY_CONFIGS_DIR) if BENCHMARK_CONFIGS else [] server_to_benchmark_mapping = create_test_parameter_mapping(BENCHMARK_CONFIGS) if BENCHMARK_CONFIGS else {} _omni_server_lock = threading.Lock() @@ -219,11 +219,20 @@ def omni_server(request): Multi-stage initialization can take 10-20+ minutes. """ with _omni_server_lock: - test_name, model, stage_config_path = request.param + test_name, model, stage_config_path, stage_overrides, extra_cli_args = request.param print(f"Starting OmniServer with test: {test_name}, model: {model}") - with OmniServer(model, ["--stage-configs-path", stage_config_path, "--stage-init-timeout", "120"]) as server: + server_args = ["--stage-init-timeout", "120"] + # --deploy-config and --stage-overrides compose at the CLI (see vllm_omni/entrypoints/utils.py): + # deploy-config sets the base; stage-overrides are applied on top. Both can be set. + if stage_config_path: + server_args = ["--deploy-config", stage_config_path] + server_args + if stage_overrides: + server_args = ["--stage-overrides", stage_overrides] + server_args + if extra_cli_args: + server_args = list(extra_cli_args) + server_args + with OmniServer(model, server_args) as server: server.test_name = test_name print("OmniServer started successfully") yield server diff --git a/tests/dfx/stability/stage_configs/qwen3_omni.yaml b/tests/dfx/stability/stage_configs/qwen3_omni.yaml deleted file mode 100644 index 802f8dd249..0000000000 --- a/tests/dfx/stability/stage_configs/qwen3_omni.yaml +++ /dev/null @@ -1,101 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -async_chunk: false -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type to launch OmniLLM - runtime: - devices: "0" - max_batch_size: 64 - engine_args: - model_stage: thinker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 1 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type to launch OmniLLM - runtime: - devices: "1" - max_batch_size: 64 - engine_args: - model_stage: talker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type to launch OmniLLM - runtime: - devices: "1" - max_batch_size: 64 - engine_args: - model_stage: code2wav - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/dfx/stability/tests/test.json b/tests/dfx/stability/tests/test.json index 95993c9c55..255cd5b109 100644 --- a/tests/dfx/stability/tests/test.json +++ b/tests/dfx/stability/tests/test.json @@ -3,7 +3,11 @@ "test_name": "test_qwen3_omni_stability", "server_params": { "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", - "stage_config_name": "qwen3_omni.yaml" + "stage_overrides": { + "2": { + "max_num_batched_tokens": 1000000 + } + } }, "benchmark_params": [ { @@ -36,25 +40,12 @@ "test_name": "test_qwen3_omni_stability_async_chunk", "server_params": { "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", - "stage_config_name": "qwen3_omni.yaml", - "update": { - "async_chunk": true, - "stage_args": { - "0": { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk" - }, - "1": { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk" - } + "stage_overrides": { + "2": { + "max_num_batched_tokens": 1000000 } }, - "delete": { - "stage_args": { - "2": [ - "custom_process_input_func" - ] - } - } + "extra_cli_args": ["--async-chunk"] }, "benchmark_params": [ { diff --git a/tests/diffusion/layers/test_rotary_emb_equivalence.py b/tests/diffusion/layers/test_rotary_emb_equivalence.py new file mode 100644 index 0000000000..2fbb7a31f5 --- /dev/null +++ b/tests/diffusion/layers/test_rotary_emb_equivalence.py @@ -0,0 +1,112 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +Numerical equivalence tests for rotary embedding implementations (#2436). + +Verifies that the optimized stack+flatten RoPE produces bit-identical results +to the original strided-slice implementation across various tensor shapes and +dtypes, ensuring the refactor is safe. +""" + +from __future__ import annotations + +import pytest +import torch + + +def _apply_rotary_emb_helios_original( + hidden_states: torch.Tensor, + freqs_cis: torch.Tensor, +) -> torch.Tensor: + """Original Helios RoPE using strided slice assignment (pre-#2436).""" + x_1, x_2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1) + cos, sin = freqs_cis.unsqueeze(-2).chunk(2, dim=-1) + out = torch.empty_like(hidden_states) + out[..., 0::2] = x_1 * cos[..., 0::2] - x_2 * sin[..., 1::2] + out[..., 1::2] = x_1 * sin[..., 1::2] + x_2 * cos[..., 0::2] + return out.type_as(hidden_states) + + +def _apply_rotary_emb_helios_optimized( + hidden_states: torch.Tensor, + freqs_cis: torch.Tensor, +) -> torch.Tensor: + """Optimized Helios RoPE using stack+flatten (post-#2436).""" + x_1, x_2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1) + cos, sin = freqs_cis.unsqueeze(-2).chunk(2, dim=-1) + rotated = torch.stack( + ( + x_1 * cos[..., 0::2] - x_2 * sin[..., 1::2], + x_1 * sin[..., 1::2] + x_2 * cos[..., 0::2], + ), + dim=-1, + ) + return rotated.flatten(-2, -1).type_as(hidden_states) + + +def _make_inputs( + batch: int, + seq_len: int, + num_heads: int, + head_dim: int, + dtype: torch.dtype = torch.float32, +) -> tuple[torch.Tensor, torch.Tensor]: + """Generate random hidden_states and freqs_cis for testing.""" + torch.manual_seed(42) + hidden_states = torch.randn(batch, seq_len, num_heads, head_dim, dtype=dtype) + # freqs_cis: [B, seq, head_dim*2] — cos and sin concatenated along last dim + freqs_cis = torch.randn(batch, seq_len, head_dim * 2, dtype=dtype) + return hidden_states, freqs_cis + + +class TestHeliosRoPEEquivalence: + """Verify optimized Helios RoPE is numerically identical to original.""" + + @pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16]) + def test_equivalence_across_dtypes(self, dtype: torch.dtype) -> None: + """Optimized output must be bit-identical to original across dtypes.""" + hidden, freqs = _make_inputs(2, 16, 8, 64, dtype=dtype) + original = _apply_rotary_emb_helios_original(hidden, freqs) + optimized = _apply_rotary_emb_helios_optimized(hidden, freqs) + torch.testing.assert_close(optimized, original, atol=0, rtol=0) + + @pytest.mark.parametrize( + "batch,seq_len,num_heads,head_dim", + [ + (1, 8, 1, 32), # minimal: single batch, single head + (2, 16, 8, 64), # typical transformer config + (1, 8192, 4, 64), # video-scale patch tokens (720p DiT) + (4, 32, 16, 128), # large head_dim + ], + ) + def test_equivalence_across_shapes(self, batch: int, seq_len: int, num_heads: int, head_dim: int) -> None: + """Equivalence must hold across different tensor shapes.""" + hidden, freqs = _make_inputs(batch, seq_len, num_heads, head_dim) + original = _apply_rotary_emb_helios_original(hidden, freqs) + optimized = _apply_rotary_emb_helios_optimized(hidden, freqs) + torch.testing.assert_close(optimized, original, atol=0, rtol=0) + + def test_output_contiguous(self) -> None: + """Optimized output should be contiguous in memory.""" + hidden, freqs = _make_inputs(2, 16, 8, 64) + optimized = _apply_rotary_emb_helios_optimized(hidden, freqs) + assert optimized.is_contiguous() + + def test_output_shape_preserved(self) -> None: + """Output shape must match input shape.""" + hidden, freqs = _make_inputs(2, 16, 8, 64) + optimized = _apply_rotary_emb_helios_optimized(hidden, freqs) + assert optimized.shape == hidden.shape + + def test_output_dtype_preserved(self) -> None: + """Output dtype must match input dtype.""" + hidden, freqs = _make_inputs(2, 16, 8, 64, dtype=torch.float16) + optimized = _apply_rotary_emb_helios_optimized(hidden, freqs) + assert optimized.dtype == hidden.dtype + + def test_odd_head_dim_raises(self) -> None: + """Odd head_dim should fail at unflatten (not a valid RoPE config).""" + hidden = torch.randn(1, 4, 2, 63) + freqs = torch.randn(1, 4, 126) + with pytest.raises(RuntimeError): + _apply_rotary_emb_helios_optimized(hidden, freqs) diff --git a/tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py b/tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py index 3cdda1f9ff..bec82e0257 100644 --- a/tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py +++ b/tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py @@ -567,6 +567,7 @@ def test_wan22_i2v_diffusers_offline_generates_video( @pytest.mark.benchmark @pytest.mark.diffusion @hardware_test(res={"cuda": "H100"}, num_cards=2) +@pytest.mark.skip(reason="issue: #2874") @pytest.mark.parametrize("omni_server", SERVER_CASES, indirect=True) def test_wan22_i2v_online_serving_generates_video( omni_server, diff --git a/tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml b/tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml index 1f0d06cb8c..b7768c071f 100644 --- a/tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml +++ b/tests/e2e/offline_inference/stage_configs/bagel_mooncake_ci.yaml @@ -64,9 +64,6 @@ stage_args: # Top-level runtime config with Mooncake connector runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: mooncake_connector: name: MooncakeConnector @@ -80,4 +77,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml b/tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml index 36b1d2bbe4..504f3c98e9 100644 --- a/tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml +++ b/tests/e2e/offline_inference/stage_configs/bagel_sharedmemory_ci.yaml @@ -62,10 +62,6 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 - # Distributed connectors configuration (optional) # More connectors will be supported in the future. connectors: @@ -78,4 +74,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/tests/e2e/offline_inference/stage_configs/npu/qwen2_5_omni_ci.yaml b/tests/e2e/offline_inference/stage_configs/npu/qwen2_5_omni_ci.yaml deleted file mode 100644 index f93a6c7147..0000000000 --- a/tests/e2e/offline_inference/stage_configs/npu/qwen2_5_omni_ci.yaml +++ /dev/null @@ -1,103 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# This config is optimized for CI e2e tests. -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" - engine_args: - model_stage: thinker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 896 - max_num_batched_tokens: 896 - max_num_seqs: 1 - gpu_memory_utilization: 0.8 - skip_mm_profiling: true - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 896 - max_num_batched_tokens: 896 - max_num_seqs: 1 - gpu_memory_utilization: 0.8 - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - stage_id: 2 - runtime: - process: true - devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/tests/e2e/offline_inference/test_bagel_img2img.py b/tests/e2e/offline_inference/test_bagel_img2img.py index 63d2a37da7..be79aa7348 100644 --- a/tests/e2e/offline_inference/test_bagel_img2img.py +++ b/tests/e2e/offline_inference/test_bagel_img2img.py @@ -32,30 +32,30 @@ # prompt='Change the grass color to red', # input image: 2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg REFERENCE_PIXELS = [ - {"position": (100, 100), "rgb": (157, 172, 217)}, - {"position": (400, 50), "rgb": (105, 144, 218)}, - {"position": (700, 100), "rgb": (118, 159, 233)}, - {"position": (150, 400), "rgb": (195, 34, 60)}, - {"position": (512, 336), "rgb": (222, 214, 193)}, - {"position": (700, 400), "rgb": (197, 15, 43)}, - {"position": (100, 600), "rgb": (105, 13, 18)}, - {"position": (400, 600), "rgb": (169, 33, 44)}, - {"position": (700, 600), "rgb": (101, 86, 93)}, - {"position": (256, 256), "rgb": (181, 202, 222)}, + {"position": (100, 100), "rgb": (156, 172, 217)}, + {"position": (400, 50), "rgb": (105, 144, 217)}, + {"position": (700, 100), "rgb": (118, 159, 232)}, + {"position": (150, 400), "rgb": (180, 22, 52)}, + {"position": (512, 336), "rgb": (221, 211, 194)}, + {"position": (700, 400), "rgb": (192, 10, 46)}, + {"position": (100, 600), "rgb": (102, 12, 22)}, + {"position": (400, 600), "rgb": (161, 28, 47)}, + {"position": (700, 600), "rgb": (100, 87, 94)}, + {"position": (256, 256), "rgb": (181, 201, 221)}, ] if current_omni_platform.is_rocm(): REFERENCE_PIXELS = [ - {"position": (100, 100), "rgb": (156, 172, 215)}, - {"position": (400, 50), "rgb": (106, 144, 216)}, - {"position": (700, 100), "rgb": (118, 158, 231)}, - {"position": (150, 400), "rgb": (183, 23, 48)}, - {"position": (512, 336), "rgb": (218, 215, 191)}, - {"position": (700, 400), "rgb": (194, 14, 42)}, - {"position": (100, 600), "rgb": (105, 10, 16)}, - {"position": (400, 600), "rgb": (167, 33, 46)}, - {"position": (700, 600), "rgb": (102, 86, 92)}, - {"position": (256, 256), "rgb": (181, 201, 220)}, + {"position": (100, 100), "rgb": (156, 172, 217)}, + {"position": (400, 50), "rgb": (105, 144, 217)}, + {"position": (700, 100), "rgb": (118, 159, 232)}, + {"position": (150, 400), "rgb": (180, 22, 52)}, + {"position": (512, 336), "rgb": (221, 211, 194)}, + {"position": (700, 400), "rgb": (192, 10, 46)}, + {"position": (100, 600), "rgb": (102, 12, 22)}, + {"position": (400, 600), "rgb": (161, 28, 47)}, + {"position": (700, 600), "rgb": (100, 87, 94)}, + {"position": (256, 256), "rgb": (181, 201, 221)}, ] PIXEL_TOLERANCE = 10 diff --git a/tests/e2e/offline_inference/test_bagel_text2img.py b/tests/e2e/offline_inference/test_bagel_text2img.py index e45d64f2ac..534b873068 100644 --- a/tests/e2e/offline_inference/test_bagel_text2img.py +++ b/tests/e2e/offline_inference/test_bagel_text2img.py @@ -37,30 +37,30 @@ # "Generated with seed=52, num_inference_steps=15, # prompt='A futuristic city skyline at twilight, cyberpunk style'" REFERENCE_PIXELS = [ - {"position": (100, 100), "rgb": (121, 118, 100)}, - {"position": (400, 50), "rgb": (163, 162, 143)}, - {"position": (700, 100), "rgb": (170, 156, 127)}, - {"position": (150, 400), "rgb": (129, 127, 112)}, - {"position": (512, 512), "rgb": (135, 61, 59)}, - {"position": (700, 400), "rgb": (205, 107, 43)}, - {"position": (100, 700), "rgb": (197, 177, 157)}, - {"position": (400, 700), "rgb": (139, 107, 86)}, - {"position": (700, 700), "rgb": (247, 205, 146)}, - {"position": (256, 256), "rgb": (171, 160, 153)}, + {"position": (100, 100), "rgb": (115, 113, 94)}, + {"position": (400, 50), "rgb": (159, 160, 144)}, + {"position": (700, 100), "rgb": (164, 151, 123)}, + {"position": (150, 400), "rgb": (120, 121, 107)}, + {"position": (512, 512), "rgb": (165, 133, 127)}, + {"position": (700, 400), "rgb": (217, 130, 66)}, + {"position": (100, 700), "rgb": (191, 168, 152)}, + {"position": (400, 700), "rgb": (130, 96, 77)}, + {"position": (700, 700), "rgb": (247, 203, 140)}, + {"position": (256, 256), "rgb": (167, 156, 150)}, ] if current_omni_platform.is_rocm(): REFERENCE_PIXELS = [ - {"position": (100, 100), "rgb": (123, 119, 100)}, - {"position": (400, 50), "rgb": (162, 161, 142)}, - {"position": (700, 100), "rgb": (171, 156, 127)}, - {"position": (150, 400), "rgb": (131, 128, 112)}, - {"position": (512, 512), "rgb": (134, 61, 59)}, - {"position": (700, 400), "rgb": (204, 107, 43)}, - {"position": (100, 700), "rgb": (201, 180, 165)}, - {"position": (400, 700), "rgb": (140, 108, 87)}, - {"position": (700, 700), "rgb": (247, 205, 145)}, - {"position": (256, 256), "rgb": (171, 160, 153)}, + {"position": (100, 100), "rgb": (115, 113, 94)}, + {"position": (400, 50), "rgb": (159, 160, 144)}, + {"position": (700, 100), "rgb": (164, 151, 123)}, + {"position": (150, 400), "rgb": (120, 121, 107)}, + {"position": (512, 512), "rgb": (165, 133, 127)}, + {"position": (700, 400), "rgb": (217, 130, 66)}, + {"position": (100, 700), "rgb": (191, 168, 152)}, + {"position": (400, 700), "rgb": (130, 96, 77)}, + {"position": (700, 700), "rgb": (247, 203, 140)}, + {"position": (256, 256), "rgb": (167, 156, 150)}, ] # Maximum allowed difference per color channel diff --git a/tests/e2e/offline_inference/test_qwen2_5_omni.py b/tests/e2e/offline_inference/test_qwen2_5_omni.py index 4c4315aab9..4500ebfbe2 100644 --- a/tests/e2e/offline_inference/test_qwen2_5_omni.py +++ b/tests/e2e/offline_inference/test_qwen2_5_omni.py @@ -2,8 +2,6 @@ E2E tests for Qwen2.5-Omni model with mixed modality inputs, audio and text output. """ -from pathlib import Path - import pytest from tests.conftest import ( @@ -12,36 +10,31 @@ generate_synthetic_video, modify_stage_config, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test from vllm_omni.platforms import current_omni_platform models = ["Qwen/Qwen2.5-Omni-7B"] +# Single CI deploy YAML; rocm/xpu deltas are picked automatically via the +# platforms: section. NPU still uses the legacy per-platform YAML until it +# also migrates to the new schema. +_CI_DEPLOY = get_deploy_config_path("ci/qwen2_5_omni.yaml") + def get_cuda_graph_config(): - path = modify_stage_config( - str(Path(__file__).parent.parent / "stage_configs" / "qwen2_5_omni_ci.yaml"), + return modify_stage_config( + _CI_DEPLOY, updates={ - "stage_args": { - 0: { - "engine_args.enforce_eager": "true", - }, - 1: {"engine_args.enforce_eager": "true"}, + "stages": { + 0: {"enforce_eager": True}, + 1: {"enforce_eager": True}, }, }, ) - return path - - -# CI stage config optimized for 24GB GPU (L4/RTX3090) or NPU -if current_omni_platform.is_npu(): - stage_config = str(Path(__file__).parent / "stage_configs" / "npu" / "qwen2_5_omni_ci.yaml") -elif current_omni_platform.is_rocm(): - # ROCm stage config optimized for MI325 GPU - stage_config = str(Path(__file__).parent.parent / "stage_configs" / "rocm" / "qwen2_5_omni_ci.yaml") -elif current_omni_platform.is_xpu(): - # Intel XPU stage config optimized for B60 GPU - stage_config = str(Path(__file__).parent.parent / "stage_configs" / "xpu" / "qwen2_5_omni_ci.yaml") + + +if current_omni_platform.is_rocm() or current_omni_platform.is_xpu() or current_omni_platform.is_npu(): + stage_config = _CI_DEPLOY else: stage_config = get_cuda_graph_config() diff --git a/tests/e2e/offline_inference/test_qwen3_omni.py b/tests/e2e/offline_inference/test_qwen3_omni.py index cc0af437ec..0df89c3e88 100644 --- a/tests/e2e/offline_inference/test_qwen3_omni.py +++ b/tests/e2e/offline_inference/test_qwen3_omni.py @@ -7,41 +7,37 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import ( generate_synthetic_video, modify_stage_config, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test from vllm_omni.platforms import current_omni_platform models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"] +# Single CI deploy YAML; rocm/xpu deltas are picked automatically via the +# platforms: section. Only CUDA needs an extra enforce_eager tweak. +_CI_DEPLOY = get_deploy_config_path("ci/qwen3_omni_moe.yaml") + + def get_cuda_graph_config(): - path = modify_stage_config( - str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_ci.yaml"), + return modify_stage_config( + _CI_DEPLOY, updates={ - "stage_args": { - 0: { - "engine_args.enforce_eager": "true", - }, - 1: {"engine_args.enforce_eager": "true"}, + "stages": { + 0: {"enforce_eager": True}, + 1: {"enforce_eager": True}, }, }, ) - return path -# CI stage config for 2xH100-80G GPUs or AMD GPU MI325 -if current_omni_platform.is_rocm(): - # ROCm stage config optimized for MI325 GPU - stage_configs = [str(Path(__file__).parent.parent / "stage_configs" / "rocm" / "qwen3_omni_ci.yaml")] -elif current_omni_platform.is_xpu(): - stage_configs = [str(Path(__file__).parent.parent / "stage_configs" / "xpu" / "qwen3_omni_ci.yaml")] +if current_omni_platform.is_rocm() or current_omni_platform.is_xpu(): + stage_configs = [_CI_DEPLOY] else: stage_configs = [get_cuda_graph_config()] diff --git a/tests/e2e/offline_inference/test_qwen3_tts_base.py b/tests/e2e/offline_inference/test_qwen3_tts_base.py index be7bd50a36..a706798043 100644 --- a/tests/e2e/offline_inference/test_qwen3_tts_base.py +++ b/tests/e2e/offline_inference/test_qwen3_tts_base.py @@ -13,12 +13,10 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import modify_stage_config -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-0.6B-Base" REF_AUDIO_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav" @@ -26,23 +24,31 @@ def get_cuda_graph_config(): - path = modify_stage_config( - get_stage_config(), + """Build a temp deploy yaml mirroring the deleted qwen3_tts_no_async_chunk.yaml. + + Composes the synchronous (no-async-chunk) variant on top of the bundled + qwen3_tts.yaml prod default, with cudagraphs disabled. Replaces the deleted + standalone variant yaml; same effective config, no checked-in file needed. + """ + return modify_stage_config( + get_deploy_config_path("qwen3_tts.yaml"), updates={ - "stage_args": { + "async_chunk": False, + "stages": { 0: { - "engine_args.enforce_eager": "true", + "max_num_seqs": 1, + "gpu_memory_utilization": 0.2, + "enforce_eager": True, + "async_scheduling": False, + }, + 1: { + "gpu_memory_utilization": 0.2, + "enforce_eager": True, + "async_scheduling": False, }, - 1: {"engine_args.enforce_eager": "true"}, }, }, ) - return path - - -def get_stage_config(name: str = "qwen3_tts_no_async_chunk.yaml"): - """Get the no_async_chunk stage config path (async_chunk disable, cuda_graph disabled).""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) # Same structure as test_qwen3_omni: models, stage_configs, test_params diff --git a/tests/e2e/offline_inference/test_qwen3_tts_customvoice.py b/tests/e2e/offline_inference/test_qwen3_tts_customvoice.py index 67d72df908..cf411349c3 100644 --- a/tests/e2e/offline_inference/test_qwen3_tts_customvoice.py +++ b/tests/e2e/offline_inference/test_qwen3_tts_customvoice.py @@ -13,34 +13,40 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import modify_stage_config -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" def get_cuda_graph_config(): - path = modify_stage_config( - get_stage_config(), + """Build a temp deploy yaml mirroring the deleted qwen3_tts_no_async_chunk.yaml. + + Composes the synchronous (no-async-chunk) variant on top of the bundled + qwen3_tts.yaml prod default, with cudagraphs disabled. Replaces the deleted + standalone variant yaml; same effective config, no checked-in file needed. + """ + return modify_stage_config( + get_deploy_config_path("qwen3_tts.yaml"), updates={ - "stage_args": { + "async_chunk": False, + "stages": { 0: { - "engine_args.enforce_eager": "true", + "max_num_seqs": 1, + "gpu_memory_utilization": 0.2, + "enforce_eager": True, + "async_scheduling": False, + }, + 1: { + "gpu_memory_utilization": 0.2, + "enforce_eager": True, + "async_scheduling": False, }, - 1: {"engine_args.enforce_eager": "true"}, }, }, ) - return path - - -def get_stage_config(name: str = "qwen3_tts_no_async_chunk.yaml"): - """Get the no_async_chunk stage config path (async_chunk disable, cuda_graph disabled).""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) # Same structure as test_qwen3_omni: models, stage_configs, test_params diff --git a/tests/e2e/offline_inference/test_voxcpm2.py b/tests/e2e/offline_inference/test_voxcpm2.py index 6ec4630a45..e37d3f74df 100644 --- a/tests/e2e/offline_inference/test_voxcpm2.py +++ b/tests/e2e/offline_inference/test_voxcpm2.py @@ -100,3 +100,31 @@ def test_voxcpm2_voice_clone_002(voxcpm2_engine): audio = _extract_audio(outputs[0].outputs[0].multimodal_output) duration_s = audio.shape[0] / SAMPLE_RATE assert 0.5 < duration_s < 30.0, f"Audio duration out of range: {duration_s:.2f}s" + + +@pytest.mark.core_model +@pytest.mark.omni +@hardware_test(res={"cuda": "L4"}, num_cards=1) +def test_voxcpm2_prefill_decode_mixed_batch_003(voxcpm2_engine): + """Regression: prefill+decode mixed batch must not crash (PR #2903).""" + long_prompt = ( + "This is a deliberately long prompt that will stay in the decode " + "phase for many steps so that subsequent shorter prompts keep " + "entering prefill alongside it, reproducing the prefill plus " + "decode mixed batch scheduling pattern." + ) + short_prompts = [ + "Hello one.", + "Hello two.", + "Hello three.", + "Hello four.", + ] + requests = [{"prompt": long_prompt}] + [{"prompt": p} for p in short_prompts] + + outputs = voxcpm2_engine.generate(requests) + assert len(outputs) == len(requests) + + for i, out in enumerate(outputs): + audio = _extract_audio(out.outputs[0].multimodal_output) + duration_s = audio.shape[0] / SAMPLE_RATE + assert 0.1 < duration_s < 30.0, f"Request {i} audio duration out of range: {duration_s:.2f}s" diff --git a/tests/e2e/online_serving/test_bagel_expansion.py b/tests/e2e/online_serving/test_bagel_expansion.py index e2d75e0d19..d801020c9d 100644 --- a/tests/e2e/online_serving/test_bagel_expansion.py +++ b/tests/e2e/online_serving/test_bagel_expansion.py @@ -88,7 +88,7 @@ def _get_diffusion_feature_cases(model: str): ], ), id="parallel_tp_2", - marks=PARALLEL_FEATURE_MARKS, + marks=[*PARALLEL_FEATURE_MARKS, pytest.mark.skip(reason="issue: #2862")], ), # Ulysses-SP degree=2 (2 GPUs) pytest.param( diff --git a/tests/e2e/online_serving/test_nextstep_expansion.py b/tests/e2e/online_serving/test_nextstep_expansion.py new file mode 100644 index 0000000000..cd3d7f9bca --- /dev/null +++ b/tests/e2e/online_serving/test_nextstep_expansion.py @@ -0,0 +1,71 @@ +""" +Online serving E2E for NextStep-1.1 text-to-image (tensor parallel). +""" + +import os + +import pytest + +from tests.conftest import ( + OmniServer, + OmniServerParams, + OpenAIClientHandler, + dummy_messages_from_mix_data, +) +from tests.utils import hardware_marks + +# L4: 4 GPUs + TP=4; XPU B60: 2 cards (use num_cards={"cuda": 4, "xpu": 4} if needed) +FOUR_CARD_MARKS = hardware_marks( + res={"cuda": "L4", "xpu": "B60"}, + num_cards={"cuda": 2, "xpu": 2}, +) + +POSITIVE_PROMPT = "A small red barn in a snowy field, simple illustration." +NEGATIVE_PROMPT = "blurry, low quality" + +_DEFAULT_MODEL = "stepfun-ai/NextStep-1.1" + + +def _get_diffusion_feature_cases(model: str): + """Single online config: TP=4, explicit pipeline class.""" + return [ + pytest.param( + OmniServerParams( + model=model, + server_args=[ + "--tensor-parallel-size", + "2", + "--model-class-name", + "NextStep11Pipeline", + ], + ), + id="nextstep_tp4_pipeline", + marks=FOUR_CARD_MARKS, + ), + ] + + +@pytest.mark.advanced_model +@pytest.mark.diffusion +@pytest.mark.parametrize( + "omni_server", + _get_diffusion_feature_cases(model=os.environ.get("VLLM_TEST_NEXTSTEP_MODEL", _DEFAULT_MODEL)), + indirect=True, +) +def test_nextstep_11(omni_server: OmniServer, openai_client: OpenAIClientHandler): + messages = dummy_messages_from_mix_data(content_text=POSITIVE_PROMPT) + request_config = { + "model": omni_server.model, + "messages": messages, + "extra_body": { + "height": 512, + "width": 512, + "num_inference_steps": 2, + "guidance_scale": 5.0, + "guidance_scale_2": 1.0, + "negative_prompt": NEGATIVE_PROMPT, + "seed": 42, + }, + } + + openai_client.send_diffusion_request(request_config) diff --git a/tests/e2e/online_serving/test_qwen2_5_omni.py b/tests/e2e/online_serving/test_qwen2_5_omni.py index e2913ce021..ba333e498c 100644 --- a/tests/e2e/online_serving/test_qwen2_5_omni.py +++ b/tests/e2e/online_serving/test_qwen2_5_omni.py @@ -3,7 +3,6 @@ """ import os -from pathlib import Path import pytest @@ -15,8 +14,7 @@ generate_synthetic_video, modify_stage_config, ) -from tests.utils import hardware_test -from vllm_omni.platforms import current_omni_platform +from tests.utils import get_deploy_config_path, hardware_test os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" @@ -24,20 +22,9 @@ models = ["Qwen/Qwen2.5-Omni-7B"] - -def get_config(): - path = modify_stage_config( - str(Path(__file__).parent.parent / "stage_configs" / "qwen2_5_omni_ci.yaml"), - ) - return path - - -# CI stage config for 2xH100-80G GPUs or AMD GPU MI325 -if current_omni_platform.is_rocm(): - # ROCm stage config optimized for MI325 GPU - stage_configs = [str(Path(__file__).parent.parent / "stage_configs" / "rocm" / "qwen2_5_omni_ci.yaml")] -else: - stage_configs = [get_config()] +# Single CI deploy YAML; rocm/xpu deltas are picked automatically via the +# platforms: section in vllm_omni/deploy/ci/qwen2_5_omni.yaml. +stage_configs = [modify_stage_config(get_deploy_config_path("ci/qwen2_5_omni.yaml"))] # Create parameter combinations for model and stage config test_params = [ diff --git a/tests/e2e/online_serving/test_qwen3_omni.py b/tests/e2e/online_serving/test_qwen3_omni.py index 9737fa42bd..62eca6349f 100644 --- a/tests/e2e/online_serving/test_qwen3_omni.py +++ b/tests/e2e/online_serving/test_qwen3_omni.py @@ -3,7 +3,6 @@ """ import os -from pathlib import Path import pytest @@ -15,7 +14,7 @@ generate_synthetic_video, modify_stage_config, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test from vllm_omni.platforms import current_omni_platform os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" @@ -23,32 +22,24 @@ models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"] -QWEN3_OMNI_CONFIG_PATH = str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_ci.yaml") -QWEN3_OMNI_XPU_CONFIG_PATH = str(Path(__file__).parent.parent / "stage_configs" / "xpu" / "qwen3_omni_ci.yaml") -_STAGE_CONFIGS_DIR = Path(__file__).parent.parent / "stage_configs" -_PD_SEP_CONFIG = str(_STAGE_CONFIGS_DIR / "qwen3_omni_moe_pd_ci.yaml") +# Set VLLM_TEST_PD_MODE=1 to test PD disaggregation (follow-up — deploy overlay not yet migrated). +_USE_PD = os.environ.get("VLLM_TEST_PD_MODE", "0") == "1" + +_CI_DEPLOY = get_deploy_config_path("ci/qwen3_omni_moe.yaml") def get_chunk_config(config_path: str | None = None): - """Load qwen3_omni_ci.yaml with async_chunk modifications for streaming mode.""" + """Load the qwen3_omni CI deploy yaml with async_chunk modifications for streaming mode.""" if config_path is None: - config_path = str(_STAGE_CONFIGS_DIR / "qwen3_omni_ci.yaml") - return modify_stage_config( - config_path, - updates={ - "async_chunk": True, - "stage_args": { - 0: { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk" - }, - 1: { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk" - }, - }, - }, - deletes={"stage_args": {2: ["custom_process_input_func"]}}, - ) + config_path = _CI_DEPLOY + # TODO: remove this workaround once legacy `stage_args` path is deleted. + # The pipeline (qwen3_omni/pipeline.py) already wires + # thinker2talker_async_chunk / talker2code2wav_async_chunk on stage 0/1, + # so only async_chunk needs flipping. Writing nested `engine_args:` into + # the new-schema overlay trips _parse_stage_deploy's legacy branch and + # drops flat fields (load_format, max_num_seqs, ...). + return modify_stage_config(config_path, updates={"async_chunk": True}) def get_prefix_caching_config(config_path: str): @@ -64,21 +55,16 @@ def get_prefix_caching_config(config_path: str): return path -# Set VLLM_TEST_PD_MODE=1 to test PD disaggregation, default tests async_chunk mode. -_USE_PD = os.environ.get("VLLM_TEST_PD_MODE", "0") == "1" - -# Stage configs for H100/CUDA, ROCm MI325, and XPU platforms -if current_omni_platform.is_rocm(): - rocm_config = str(_STAGE_CONFIGS_DIR / "rocm" / "qwen3_omni_ci.yaml") - stage_configs = [rocm_config] - prefix_caching_stage_configs = [get_prefix_caching_config(rocm_config)] -elif current_omni_platform.is_xpu(): - xpu_config = str(_STAGE_CONFIGS_DIR / "xpu" / "qwen3_omni_ci.yaml") - stage_configs = [xpu_config] - prefix_caching_stage_configs = [get_prefix_caching_config(xpu_config)] -else: - stage_configs = [_PD_SEP_CONFIG if _USE_PD else get_chunk_config(QWEN3_OMNI_CONFIG_PATH)] - prefix_caching_stage_configs = [get_prefix_caching_config(QWEN3_OMNI_CONFIG_PATH)] +# Platform-specific overrides live inside the new deploy yaml's ``platforms:`` +# section, so a single ``_CI_DEPLOY`` path serves CUDA, ROCm, and XPU. +# TODO: re-add VLLM_TEST_PD_MODE branch once the PD-disaggregation deploy +# overlay has been migrated to the new schema (previously used the deleted +# ``qwen3_omni_moe_pd_ci.yaml`` stage-configs file). +if current_omni_platform.is_xpu(): + stage_configs = [_CI_DEPLOY] +else: # CUDA + ROCm MI325 share the same deploy config + stage_configs = [get_chunk_config()] +prefix_caching_stage_configs = [get_prefix_caching_config(_CI_DEPLOY)] # Create parameter combinations for model and stage config test_params = [ diff --git a/tests/e2e/online_serving/test_qwen3_omni_expansion.py b/tests/e2e/online_serving/test_qwen3_omni_expansion.py index 06847f3d51..acec0efde2 100644 --- a/tests/e2e/online_serving/test_qwen3_omni_expansion.py +++ b/tests/e2e/online_serving/test_qwen3_omni_expansion.py @@ -6,10 +6,7 @@ import os -from vllm_omni.platforms import current_omni_platform - os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" -from pathlib import Path import pytest @@ -21,7 +18,7 @@ generate_synthetic_video, modify_stage_config, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test model = "Qwen/Qwen3-Omni-30B-A3B-Instruct" @@ -40,47 +37,56 @@ LONG_AUDIO_DURATION_SEC = 120 -def get_chunk_config(default_path): - path = modify_stage_config( +def get_batch_token_config(default_path): + """Override stage 1's max_num_batched_tokens to exercise small-batch paths. + + Uses the new flat-stage schema (``stages..``); the legacy + ``stage_args..engine_args.`` path no longer applies because + the deploy YAML doesn't nest engine fields under ``engine_args:``. + """ + return modify_stage_config( default_path, updates={ - "async_chunk": True, - "stage_args": { - 0: { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk", - "default_sampling_params.max_tokens": 2048, - }, - 1: { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk" - }, - }, + "stages": {1: {"max_num_batched_tokens": 64}}, }, - deletes={"stage_args": {2: ["custom_process_input_func"]}}, ) - return path -def get_batch_token_config(default_path): - path = modify_stage_config( +def get_async_chunk_config(default_path): + """Flip async_chunk on and bump stage 0 thinker output to 2048 tokens. + + Pipeline registry (qwen3_omni/pipeline.py) already wires + thinker2talker_async_chunk / talker2code2wav_async_chunk on stages 0/1, + so no per-stage processor override is needed. Using only flat-schema + writes so _parse_stage_deploy stays in its flat branch (nested + ``engine_args:`` would drop other overlay fields). + """ + return modify_stage_config( default_path, updates={ - "stage_args": {1: {"engine_args.max_num_batched_tokens": 64}}, + "async_chunk": True, + "stages": {0: {"default_sampling_params.max_tokens": 2048}}, }, ) - return path -# CI stage config for 2*H100-80G GPUs -default_path = str(Path(__file__).parent.parent / "stage_configs" / "qwen3_omni_ci.yaml") +# CI deploy YAML (single file; xpu deltas applied via ``platforms:`` section). +# The overlay explicitly sets ``async_chunk: False``, so ``default`` tests the +# sync path and ``async_chunk`` tests the streaming path with a longer thinker +# output — two distinct scenarios, kept as separate parametrizations. +default_path = get_deploy_config_path("ci/qwen3_omni_moe.yaml") -if current_omni_platform.is_xpu(): - default_path = str(Path(__file__).parent.parent / "stage_configs" / "xpu" / "qwen3_omni_ci.yaml") - -# Create parameter combinations for model and stage config test_params = [ - pytest.param(OmniServerParams(model=model, stage_config_path=default_path, use_stage_cli=True), id="default"), pytest.param( - OmniServerParams(model=model, stage_config_path=get_chunk_config(default_path), use_stage_cli=True), + OmniServerParams(model=model, stage_config_path=default_path, use_stage_cli=True), + id="default", + ), + pytest.param( + OmniServerParams( + model=model, + stage_config_path=get_async_chunk_config(default_path), + use_stage_cli=True, + ), id="async_chunk", ), ] diff --git a/tests/e2e/online_serving/test_qwen3_omni_realtime_websocket.py b/tests/e2e/online_serving/test_qwen3_omni_realtime_websocket.py new file mode 100644 index 0000000000..6a7cf1c67e --- /dev/null +++ b/tests/e2e/online_serving/test_qwen3_omni_realtime_websocket.py @@ -0,0 +1,206 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +""" +E2E online tests for Qwen3-Omni /v1/realtime WebSocket (streaming PCM in, audio out). +""" + +from __future__ import annotations + +import asyncio +import base64 +import io +import json +import os +import wave + +import pytest +import websockets + +from tests.conftest import ( + OmniServerParams, + convert_audio_bytes_to_text, + cosine_similarity_text, + generate_synthetic_audio, + modify_stage_config, +) +from tests.utils import get_deploy_config_path, hardware_test + +os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" + +MODEL = "Qwen/Qwen3-Omni-30B-A3B-Instruct" + +# The new-schema CI overlay bakes in async_chunk: False and covers CUDA/ROCm/XPU +# via its ``platforms:`` section, so one path serves all three. +default_stage_config = get_deploy_config_path("ci/qwen3_omni_moe.yaml") + + +def _realtime_stage_config_path() -> str: + """CI omni layout without async_chunk; stage 0 thinker max_tokens=10.""" + return modify_stage_config( + default_stage_config, + updates={"stages": {0: {"default_sampling_params.max_tokens": 10}}}, + ) + + +realtime_server_params = [ + pytest.param( + OmniServerParams( + model=MODEL, + stage_config_path=_realtime_stage_config_path(), + use_stage_cli=True, + ), + id="thinker_max_tokens_10", + ), +] + + +def _pcm16_mono_16k_from_wav_bytes(wav_bytes: bytes) -> bytes: + with wave.open(io.BytesIO(wav_bytes), "rb") as wf: + if wf.getnchannels() != 1: + raise ValueError(f"Expected mono WAV, got {wf.getnchannels()} channels") + if wf.getsampwidth() != 2: + raise ValueError(f"Expected 16-bit PCM, sampwidth={wf.getsampwidth()}") + if wf.getframerate() != 16000: + raise ValueError(f"Expected 16 kHz input for /v1/realtime, got {wf.getframerate()} Hz") + if wf.getcomptype() != "NONE": + raise ValueError(f"Expected uncompressed PCM, comptype={wf.getcomptype()!r}") + return wf.readframes(wf.getnframes()) + + +def _wav_bytes_from_pcm16(pcm: bytes, sample_rate_hz: int) -> bytes: + buf = io.BytesIO() + with wave.open(buf, "wb") as wf: + wf.setnchannels(1) + wf.setsampwidth(2) + wf.setframerate(sample_rate_hz) + wf.writeframes(pcm) + return buf.getvalue() + + +async def _run_realtime_audio_roundtrip( + host: str, + port: int, + model: str, + pcm16: bytes, + *, + chunk_ms: int = 100, +) -> dict: + uri = f"ws://{host}:{port}/v1/realtime" + incremental: list[bytes] = [] + output_sr = 24000 + text_chunks: list[str] = [] + final_text = "" + delta_events = 0 + + bytes_per_ms = 16000 * 2 // 1000 + chunk_bytes = max(bytes_per_ms * chunk_ms, 2) + + async with websockets.connect(uri, max_size=64 * 1024 * 1024) as ws: + await ws.send(json.dumps({"type": "session.update", "model": model})) + await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": False})) + + for i in range(0, len(pcm16), chunk_bytes): + chunk = pcm16[i : i + chunk_bytes] + await ws.send( + json.dumps( + { + "type": "input_audio_buffer.append", + "audio": base64.b64encode(chunk).decode("utf-8"), + } + ) + ) + + await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True})) + + while True: + message = await asyncio.wait_for(ws.recv(), timeout=600) + if isinstance(message, bytes): + continue + + event = json.loads(message) + event_type = event.get("type") + + if event_type == "session.created": + continue + + if event_type == "response.audio.delta": + delta_events += 1 + sr = event.get("sample_rate_hz") + if isinstance(sr, int) and sr > 0: + output_sr = sr + audio_b64 = event.get("audio", "") + if audio_b64: + incremental.append(base64.b64decode(audio_b64)) + continue + + if event_type == "transcription.delta": + d = event.get("delta", "") + if d: + text_chunks.append(d) + continue + + if event_type == "transcription.done": + final_text = event.get("text", "") or "".join(text_chunks) + continue + + if event_type == "response.audio.done": + break + + if event_type == "error": + raise AssertionError(f"WebSocket error: {event}") + + raise AssertionError(f"Unexpected WebSocket event: {event}") + + out_pcm = b"".join(incremental) + return { + "output_pcm": out_pcm, + "output_sample_rate": output_sr, + "transcription_text": final_text if final_text else "".join(text_chunks), + "delta_events": delta_events, + } + + +class TestQwen3OmniRealtimeWebSocket: + @pytest.mark.advanced_model + @pytest.mark.omni + @hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2) + @pytest.mark.parametrize("omni_server", realtime_server_params, indirect=True) + def test_streaming_audio_input_pcm_output(self, omni_server) -> None: + """ + Short streamed 16 kHz mono PCM16 input; expect streamed PCM16 audio deltas and + transcription. Verify Whisper(output audio) aligns with model text (same idea + as multimodal omni e2e). + """ + syn = generate_synthetic_audio(10, 1, sample_rate=16000) + wav_bytes = base64.b64decode(syn["base64"]) + pcm16 = _pcm16_mono_16k_from_wav_bytes(wav_bytes) + + result = asyncio.run( + _run_realtime_audio_roundtrip( + omni_server.host, + omni_server.port, + omni_server.model, + pcm16, + chunk_ms=100, + ) + ) + + out_pcm = result["output_pcm"] + assert result["delta_events"] >= 1 + assert out_pcm, "No output PCM from response.audio.delta" + assert len(out_pcm) % 2 == 0 + assert len(out_pcm) >= 4096, "Output audio unexpectedly small" + assert result["output_sample_rate"] > 0 + + final_text = (result["transcription_text"] or "").strip() + assert final_text, "Expected non-empty transcription (model text stream)" + + wav_out = _wav_bytes_from_pcm16(out_pcm, result["output_sample_rate"]) + whisper_text = convert_audio_bytes_to_text(wav_out).strip() + assert whisper_text, "Whisper returned empty string for synthesized output audio" + + sim = cosine_similarity_text(whisper_text.lower(), final_text.lower()) + assert sim > 0.9, ( + f"Output audio transcript should match model text (sim={sim:.3f}): " + f"whisper={whisper_text!r}, model_text={final_text!r}" + ) diff --git a/tests/e2e/online_serving/test_qwen3_tts_base.py b/tests/e2e/online_serving/test_qwen3_tts_base.py index 002f9d9972..c97fdef5bc 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_base.py +++ b/tests/e2e/online_serving/test_qwen3_tts_base.py @@ -12,12 +12,10 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import OmniServerParams -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-0.6B-Base" @@ -25,11 +23,6 @@ REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." -def get_stage_config(name: str = "qwen3_tts.yaml"): - """Get the stage config path from vllm_omni model_executor stage_configs.""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) - - def get_prompt(prompt_type="text"): """Text prompt for text-to-audio tests (same as test_qwen3_omni - beijing test case).""" prompts = { @@ -48,7 +41,7 @@ def get_max_batch_size(size_type="few"): pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts.yaml"), + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), server_args=["--trust-remote-code", "--disable-log-stats"], ), id="async_chunk", diff --git a/tests/e2e/online_serving/test_qwen3_tts_base_expansion.py b/tests/e2e/online_serving/test_qwen3_tts_base_expansion.py index 3c33485e4f..364865d286 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_base_expansion.py +++ b/tests/e2e/online_serving/test_qwen3_tts_base_expansion.py @@ -12,12 +12,10 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import OmniServerParams -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-0.6B-Base" @@ -25,11 +23,6 @@ REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." -def get_stage_config(name: str = "qwen3_tts.yaml"): - """Get the stage config path from vllm_omni model_executor stage_configs.""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) - - def get_prompt(prompt_type="text"): """Text prompt for text-to-audio tests (same as test_qwen3_omni - beijing test case).""" prompts = { @@ -48,16 +41,19 @@ def get_max_batch_size(size_type="few"): pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts.yaml"), + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), server_args=["--trust-remote-code", "--disable-log-stats"], ), id="async_chunk", ), + # Synchronous (no async-chunk) variant — ``--no-async-chunk`` alone + # flips the deploy yaml's bool and the pipeline dispatches to the + # end-to-end codec processor. No variant yaml / pipeline needed. pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts_no_async_chunk.yaml"), - server_args=["--trust-remote-code", "--disable-log-stats"], + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), + server_args=["--trust-remote-code", "--disable-log-stats", "--no-async-chunk"], ), id="no_async_chunk", ), diff --git a/tests/e2e/online_serving/test_qwen3_tts_batch.py b/tests/e2e/online_serving/test_qwen3_tts_batch.py index 1a453afb72..bf13884997 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_batch.py +++ b/tests/e2e/online_serving/test_qwen3_tts_batch.py @@ -27,14 +27,15 @@ convert_audio_file_to_text, cosine_similarity_text, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice" STAGE_INIT_TIMEOUT_S = 120 -def get_stage_config(name: str = "qwen3_tts.yaml"): - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) +def get_stage_config(name: str = "qwen3_tts.yaml") -> str: + """Resolve a deploy config path under vllm_omni/deploy/.""" + return get_deploy_config_path(name) @pytest.fixture(scope="module") diff --git a/tests/e2e/online_serving/test_qwen3_tts_customvoice.py b/tests/e2e/online_serving/test_qwen3_tts_customvoice.py index fb60df725b..d19c652689 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_customvoice.py +++ b/tests/e2e/online_serving/test_qwen3_tts_customvoice.py @@ -12,21 +12,14 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import OmniServerParams -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" -def get_stage_config(name: str = "qwen3_tts.yaml"): - """Get the stage config path from vllm_omni model_executor stage_configs.""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) - - def get_prompt(prompt_type="text"): """Text prompt for text-to-audio tests (same as test_qwen3_omni - beijing test case).""" prompts = { @@ -45,7 +38,7 @@ def get_max_batch_size(size_type="few"): pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts.yaml"), + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), server_args=["--trust-remote-code", "--disable-log-stats"], ), id="async_chunk", diff --git a/tests/e2e/online_serving/test_qwen3_tts_customvoice_expansion.py b/tests/e2e/online_serving/test_qwen3_tts_customvoice_expansion.py index 03a985896e..4087532d63 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_customvoice_expansion.py +++ b/tests/e2e/online_serving/test_qwen3_tts_customvoice_expansion.py @@ -12,21 +12,14 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" -from pathlib import Path - import pytest from tests.conftest import OmniServerParams -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" -def get_stage_config(name: str = "qwen3_tts.yaml"): - """Get the stage config path from vllm_omni model_executor stage_configs.""" - return str(Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / name) - - def get_prompt(prompt_type="english"): """Text prompt for text-to-audio tests (same as test_qwen3_omni - beijing test case).""" prompts = { @@ -46,16 +39,19 @@ def get_max_batch_size(size_type="few"): pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts.yaml"), + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), server_args=["--trust-remote-code", "--disable-log-stats"], ), id="async_chunk", ), + # Synchronous (no async-chunk) variant — ``--no-async-chunk`` alone + # flips the deploy yaml's bool and the pipeline dispatches to the + # end-to-end codec processor. No variant yaml / pipeline needed. pytest.param( OmniServerParams( model=MODEL, - stage_config_path=get_stage_config("qwen3_tts_no_async_chunk.yaml"), - server_args=["--trust-remote-code", "--disable-log-stats"], + stage_config_path=get_deploy_config_path("qwen3_tts.yaml"), + server_args=["--trust-remote-code", "--disable-log-stats", "--no-async-chunk"], ), id="no_async_chunk", ), diff --git a/tests/e2e/online_serving/test_qwen3_tts_speaker_embedding.py b/tests/e2e/online_serving/test_qwen3_tts_speaker_embedding.py index 8c1c860819..d4212bb5b1 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_speaker_embedding.py +++ b/tests/e2e/online_serving/test_qwen3_tts_speaker_embedding.py @@ -13,13 +13,12 @@ os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" import struct -from pathlib import Path import httpx import pytest from tests.conftest import OmniServer -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test MODEL_BASE = "Qwen/Qwen3-TTS-12Hz-0.6B-Base" MODEL_BASE_1_7B = "Qwen/Qwen3-TTS-12Hz-1.7B-Base" @@ -37,10 +36,8 @@ MAX_NEW_TOKENS = 256 -def get_stage_config(): - return str( - Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / "qwen3_tts.yaml" - ) +def get_stage_config() -> str: + return get_deploy_config_path("qwen3_tts.yaml") def _server_args(): diff --git a/tests/e2e/online_serving/test_qwen3_tts_websocket.py b/tests/e2e/online_serving/test_qwen3_tts_websocket.py index 849d1c1158..dddba6e58a 100644 --- a/tests/e2e/online_serving/test_qwen3_tts_websocket.py +++ b/tests/e2e/online_serving/test_qwen3_tts_websocket.py @@ -7,13 +7,12 @@ import asyncio import json import os -from pathlib import Path import pytest import websockets from tests.conftest import OmniServer -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" @@ -23,9 +22,7 @@ def get_stage_config() -> str: - return str( - Path(__file__).parent.parent.parent.parent / "vllm_omni" / "model_executor" / "stage_configs" / "qwen3_tts.yaml" - ) + return get_deploy_config_path("qwen3_tts.yaml") @pytest.fixture(scope="module") diff --git a/tests/e2e/stage_configs/dynin_omni_ci.yaml b/tests/e2e/stage_configs/dynin_omni_ci.yaml index 0240007510..525b7d888c 100644 --- a/tests/e2e/stage_configs/dynin_omni_ci.yaml +++ b/tests/e2e/stage_configs/dynin_omni_ci.yaml @@ -72,13 +72,8 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 edges: - from: 0 to: 1 - window_size: -1 - from: 1 to: 2 - window_size: -1 diff --git a/tests/e2e/stage_configs/qwen2_5_omni_ci.yaml b/tests/e2e/stage_configs/qwen2_5_omni_ci.yaml deleted file mode 100644 index a7c637d486..0000000000 --- a/tests/e2e/stage_configs/qwen2_5_omni_ci.yaml +++ /dev/null @@ -1,109 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config has been verified on 2x 24GB GPU (L4/RTX3090/RTX4090). -# This config is optimized for CI e2e tests. -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.9 - skip_mm_profiling: true - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - load_format: dummy - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.4 - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - load_format: dummy - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 4096 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - stage_id: 2 - runtime: - process: true - devices: "2" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.5 #increase the gpu memory utilization to enable the test on H800 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - max_num_batched_tokens: 8192 - max_model_len: 8192 - load_format: dummy - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 8192 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/tests/e2e/stage_configs/qwen2_5_omni_thinker_ci.yaml b/tests/e2e/stage_configs/qwen2_5_omni_thinker_ci.yaml deleted file mode 100644 index 9401382847..0000000000 --- a/tests/e2e/stage_configs/qwen2_5_omni_thinker_ci.yaml +++ /dev/null @@ -1,31 +0,0 @@ -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.9 - skip_mm_profiling: true - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/e2e/stage_configs/qwen3_omni_ci.yaml b/tests/e2e/stage_configs/qwen3_omni_ci.yaml deleted file mode 100644 index 08dd49de95..0000000000 --- a/tests/e2e/stage_configs/qwen3_omni_ci.yaml +++ /dev/null @@ -1,102 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 16-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -stage_args: -- stage_id: 0 - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 5 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - max_num_batched_tokens: 32768 - max_model_len: 32768 - enable_prefix_caching: false - mm_processor_cache_gb: 0 - hf_config_name: thinker_config - tensor_parallel_size: 1 - load_format: dummy - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 150 - seed: 42 - ignore_eos: False - detokenize: True - repetition_penalty: 1.05 - -- stage_id: 1 - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 5 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.5 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - max_model_len: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - load_format: dummy - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 1000 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - -- stage_id: 2 - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 5 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 100000 - hf_config_name: thinker_config - async_scheduling: false - load_format: dummy - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2000 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/e2e/stage_configs/rocm/qwen2_5_omni_ci.yaml b/tests/e2e/stage_configs/rocm/qwen2_5_omni_ci.yaml deleted file mode 100644 index 0c756ce56b..0000000000 --- a/tests/e2e/stage_configs/rocm/qwen2_5_omni_ci.yaml +++ /dev/null @@ -1,106 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config has been verified on 2x 24GB GPU (L4/RTX3090/RTX4090). -# This config is optimized for CI e2e tests. -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.8 - skip_mm_profiling: true - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.8 - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 4096 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - stage_id: 2 - runtime: - process: true - devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - max_num_batched_tokens: 4096 - max_model_len: 4096 - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 4096 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/tests/e2e/stage_configs/rocm/qwen3_omni_ci.yaml b/tests/e2e/stage_configs/rocm/qwen3_omni_ci.yaml deleted file mode 100644 index ac2b1fbd71..0000000000 --- a/tests/e2e/stage_configs/rocm/qwen3_omni_ci.yaml +++ /dev/null @@ -1,100 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 16-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -stage_args: - - stage_id: 0 - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - mm_processor_cache_gb: 0 - hf_config_name: thinker_config - tensor_parallel_size: 1 - load_format: dummy - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 100 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - # tensor_parallel_size: 2 - enable_prefix_caching: false - distributed_executor_backend: "mp" - hf_config_name: talker_config - load_format: dummy - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 100 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - load_format: dummy - async_scheduling: false - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 200 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/e2e/stage_configs/xpu/qwen2_5_omni_ci.yaml b/tests/e2e/stage_configs/xpu/qwen2_5_omni_ci.yaml deleted file mode 100644 index 14ef3c3438..0000000000 --- a/tests/e2e/stage_configs/xpu/qwen2_5_omni_ci.yaml +++ /dev/null @@ -1,108 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config is verified with 2 * Intel Arc Pro B60 XPU. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage - engine_args: - model_stage: thinker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.9 # thinker weight is around 16.74GB for Qwen2.5-Omni-7B - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 128 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - max_model_len: 16384 - max_num_batched_tokens: 16384 - max_num_seqs: 1 - gpu_memory_utilization: 0.5 # talker weight is 6.03GB for Qwen2.5-Omni-7B - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 4096 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "2" - engine_args: - max_num_seqs: 1 - model_stage: code2wav - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.3 # code2wav weight is around 1.46GB for Qwen2.5-Omni-7B - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/tests/e2e/stage_configs/xpu/qwen3_omni_ci.yaml b/tests/e2e/stage_configs/xpu/qwen3_omni_ci.yaml deleted file mode 100644 index c4586e0664..0000000000 --- a/tests/e2e/stage_configs/xpu/qwen3_omni_ci.yaml +++ /dev/null @@ -1,109 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config is verified with 8 * Intel Arc Pro B60 XPU. -stage_args: -- stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0,1,2,3" - engine_args: - max_num_seqs: 1 - model_stage: thinker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.85 # thinker weight is around 61.08GB for Qwen3-Omni-30B-A3B-Instruct - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - max_num_batched_tokens: 4096 - max_model_len: 4096 - enable_prefix_caching: false - hf_config_name: thinker_config - tensor_parallel_size: 4 - max_cudagraph_capture_size: 0 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 100 - seed: 42 - ignore_eos: False - detokenize: True - repetition_penalty: 1.05 - -- stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "4" - engine_args: - max_num_seqs: 1 - model_stage: talker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 # talker weight is around 8.5GB for Qwen3-Omni-30B-A3B-Instruct - skip_mm_profiling: true - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 4096 - max_model_len: 4096 - distributed_executor_backend: "mp" - hf_config_name: talker_config - max_cudagraph_capture_size: 0 - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - -- stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "5" - engine_args: - max_num_seqs: 1 - model_stage: code2wav - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.3 # code2wav weight is around 0.4GB for Qwen3-Omni-30B-A3B-Instruct - skip_mm_profiling: true - distributed_executor_backend: "mp" - max_num_batched_tokens: 100000 - hf_config_name: thinker_config - async_scheduling: false - max_cudagraph_capture_size: 0 - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2000 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/tests/engine/test_arg_utils.py b/tests/engine/test_arg_utils.py index 565c83c1ad..4d69f24c56 100644 --- a/tests/engine/test_arg_utils.py +++ b/tests/engine/test_arg_utils.py @@ -39,21 +39,28 @@ def test_default_stage_id_is_concrete_int(): assert cfg.stage_id == 0 -def test_multimodal_kwarg_overrides(): +def test_multimodal_kwarg_overrides(mocker): """Ensure that overrides in the multimodal config are preserved.""" - # Get a different value than the default for a multimodal field sig = inspect.signature(OmniEngineArgs) default_mm_cache = sig.parameters["mm_processor_cache_gb"].default override_val = default_mm_cache + 1 - # NOTE: This needs to be a model that resolves to supports_multimodal=True - # in vLLM, otherwise we won't have an MM config + fake_model_config = SimpleNamespace( + multimodal_config=SimpleNamespace(mm_processor_cache_gb=override_val), + ) + + def _fake_parent_create_model_config(self): + assert self.mm_processor_cache_gb == override_val + return fake_model_config + + mocker.patch.object(EngineArgs, "create_model_config", _fake_parent_create_model_config) + mocker.patch.object(OmniModelConfig, "from_vllm_model_config", side_effect=lambda model_config, **_: model_config) + cfg = OmniEngineArgs( model="Qwen/Qwen2-VL-2B-Instruct", mm_processor_cache_gb=override_val, ).create_model_config() - # Ensure that the override was applied correctly assert cfg.multimodal_config is not None assert cfg.multimodal_config.mm_processor_cache_gb == override_val diff --git a/tests/engine/test_async_omni_engine_abort.py b/tests/engine/test_async_omni_engine_abort.py index 34fdf45ea2..e7f2bb679f 100644 --- a/tests/engine/test_async_omni_engine_abort.py +++ b/tests/engine/test_async_omni_engine_abort.py @@ -2,20 +2,24 @@ import os import sys from contextlib import ExitStack -from pathlib import Path import pytest from vllm import SamplingParams from vllm.inputs import PromptType -from tests.utils import hardware_test +# Side-effect import: registers QWEN2_5_OMNI_THINKER_ONLY_PIPELINE in the +# pipeline registry so the materialized deploy overlay below can select it +# via its top-level ``pipeline:`` field. +import vllm_omni.model_executor.models.qwen2_5_omni.pipeline # noqa: F401, E402 +from tests.utils import get_deploy_config_path, hardware_test from vllm_omni.entrypoints.async_omni import AsyncOmni os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" SEED = 42 -stage_config = str(Path(__file__).parent.parent / "e2e" / "stage_configs" / "qwen2_5_omni_thinker_ci.yaml") +# Single-stage thinker-only deploy, materialized from tests.utils._CI_OVERLAYS. +stage_config = get_deploy_config_path("ci/qwen2_5_omni_thinker_only.yaml") model = "Qwen/Qwen2.5-Omni-7B" diff --git a/tests/engine/test_orchestrator.py b/tests/engine/test_orchestrator.py index 7bf2eccf7f..0b549f58e9 100644 --- a/tests/engine/test_orchestrator.py +++ b/tests/engine/test_orchestrator.py @@ -70,7 +70,7 @@ def get_diffusion_output_nowait(self): def set_engine_outputs(self, outputs) -> None: return None - def process_engine_inputs(self, stage_list, prompt=None): + def process_engine_inputs(self, stage_list, prompt=None, streaming_context=None): return list(self.next_inputs) async def abort_requests_async(self, request_ids: list[str]) -> None: diff --git a/tests/entrypoints/openai_api/test_image_server.py b/tests/entrypoints/openai_api/test_image_server.py index 00282d5c77..607b3eaa81 100644 --- a/tests/entrypoints/openai_api/test_image_server.py +++ b/tests/entrypoints/openai_api/test_image_server.py @@ -204,14 +204,19 @@ def async_omni_test_client(): class FakeAsyncOmniClass(AsyncOmni): def __init__(self): - self.stage_configs = [ + stage_configs = [ SimpleNamespace(stage_type="llm", is_comprehension=True), SimpleNamespace(stage_type="diffusion", is_comprehension=False), ] - self.default_sampling_params_list = [ + default_sampling_params_list = [ SamplingParams(temperature=0.1), OmniDiffusionSamplingParams(), ] + self.engine = SimpleNamespace( + stage_configs=stage_configs, + default_sampling_params_list=default_sampling_params_list, + ) + self.default_sampling_params_list = default_sampling_params_list self.captured_sampling_params_list = None self.captured_prompt = None self._images = [Image.new("RGB", (64, 64), color="green")] @@ -263,14 +268,19 @@ def async_omni_rgba_test_client(): class FakeAsyncOmniClass(AsyncOmni): def __init__(self): - self.stage_configs = [ + stage_configs = [ SimpleNamespace(stage_type="llm", is_comprehension=True), SimpleNamespace(stage_type="diffusion", is_comprehension=False), ] - self.default_sampling_params_list = [ + default_sampling_params_list = [ SamplingParams(temperature=0.1), OmniDiffusionSamplingParams(), ] + self.engine = SimpleNamespace( + stage_configs=stage_configs, + default_sampling_params_list=default_sampling_params_list, + ) + self.default_sampling_params_list = default_sampling_params_list self.captured_sampling_params_list = None self.captured_prompt = None self._images = [Image.new("RGBA", (64, 64), color=(0, 255, 0, 128))] @@ -322,14 +332,19 @@ def async_omni_stage_configs_only_client(): class FakeAsyncOmniClass(AsyncOmni): def __init__(self): - self.stage_configs = [ + stage_configs = [ SimpleNamespace(stage_type="llm", is_comprehension=True), SimpleNamespace(stage_type="diffusion", is_comprehension=False), ] - self.default_sampling_params_list = [ + default_sampling_params_list = [ SamplingParams(temperature=0.1), OmniDiffusionSamplingParams(), ] + self.engine = SimpleNamespace( + stage_configs=stage_configs, + default_sampling_params_list=default_sampling_params_list, + ) + self.default_sampling_params_list = default_sampling_params_list self.captured_sampling_params_list = None self.captured_prompt = None self._images = [Image.new("RGB", (64, 64), color="green")] @@ -836,6 +851,19 @@ def test_model_field_omitted_works(test_client): assert response.status_code == 200 +def test_generate_images_rejects_model_mismatch(test_client): + response = test_client.post( + "/v1/images/generations", + json={ + "prompt": "test", + "model": "Qwen/Qwen-Image-2512", + "size": "1024x1024", + }, + ) + assert response.status_code == 400 + assert "model mismatch" in response.json()["detail"].lower() + + def make_test_image_bytes(size=(64, 64)) -> bytes: img = Image.new( "RGB", @@ -939,6 +967,20 @@ def test_image_edit_rejects_multiple_images_when_model_does_not_support_them(asy assert engine.captured_prompt is None +def test_image_edit_rejects_model_mismatch(test_client): + img_bytes = make_test_image_bytes((16, 16)) + response = test_client.post( + "/v1/images/edits", + files=[("image", img_bytes)], + data={ + "prompt": "edit me", + "model": "Qwen/Qwen-Image-Edit", + }, + ) + assert response.status_code == 400 + assert "model mismatch" in response.json()["detail"].lower() + + def test_image_edit_rejects_too_many_images_for_qwen_image_edit_2511(async_omni_test_client): engine = async_omni_test_client.app.state.engine_client engine.get_diffusion_od_config = lambda: SimpleNamespace( diff --git a/tests/entrypoints/openai_api/test_video_server.py b/tests/entrypoints/openai_api/test_video_server.py index 7a395bab5b..6157d82313 100644 --- a/tests/entrypoints/openai_api/test_video_server.py +++ b/tests/entrypoints/openai_api/test_video_server.py @@ -564,6 +564,18 @@ def test_missing_prompt_returns_422(test_client): assert response.status_code == 422 +def test_video_generation_rejects_model_mismatch(test_client): + response = test_client.post( + "/v1/videos", + data={ + "prompt": "bad model", + "model": "Wan-AI/Wan2.1-T2V-14B-Diffusers", + }, + ) + assert response.status_code == 400 + assert "model mismatch" in response.json()["detail"].lower() + + def test_invalid_size_parse_returns_422(test_client): response = test_client.post( "/v1/videos", diff --git a/tests/entrypoints/test_realtime_connection_helpers.py b/tests/entrypoints/test_realtime_connection_helpers.py new file mode 100644 index 0000000000..e795aa92d0 --- /dev/null +++ b/tests/entrypoints/test_realtime_connection_helpers.py @@ -0,0 +1,86 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Unit tests for realtime streaming helpers (PR #2581 /v1/realtime path).""" + +from __future__ import annotations + +import base64 + +import numpy as np +import pytest +import torch +from vllm.sampling_params import RequestOutputKind, SamplingParams + +from vllm_omni.entrypoints.async_omni import AsyncOmni +from vllm_omni.entrypoints.openai.realtime_connection import RealtimeConnection + +pytestmark = [pytest.mark.core_model, pytest.mark.cpu] + + +@pytest.fixture +def realtime_conn() -> RealtimeConnection: + return RealtimeConnection.__new__(RealtimeConnection) + + +class TestRealtimeConnectionTensorAndPcm: + def test_tensor_to_numpy_none(self) -> None: + assert RealtimeConnection._tensor_to_numpy(None) is None + + def test_tensor_to_numpy_1d_numpy(self) -> None: + arr = np.array([1.0, 2.0], dtype=np.float64) + out = RealtimeConnection._tensor_to_numpy(arr) + assert out is not None + assert out.dtype == np.float32 + assert out.shape == (2,) + + def test_tensor_to_numpy_2d_numpy_flattened(self) -> None: + arr = np.array([[0.5], [-0.5]], dtype=np.float32) + out = RealtimeConnection._tensor_to_numpy(arr) + assert out is not None + assert out.shape == (2,) + + def test_tensor_to_numpy_torch(self) -> None: + t = torch.tensor([[0.25, -0.25]], dtype=torch.float32) + out = RealtimeConnection._tensor_to_numpy(t) + assert out is not None + assert out.shape == (2,) + np.testing.assert_allclose(out, [0.25, -0.25], rtol=1e-5) + + def test_pcm16_b64_roundtrip(self) -> None: + audio = np.array([0.0, 1.0, -1.0], dtype=np.float32) + b64 = RealtimeConnection._pcm16_b64(audio) + raw = base64.b64decode(b64) + assert len(raw) == 6 + pcm = np.frombuffer(raw, dtype=np.int16) + assert pcm[0] == 0 + assert pcm[1] == 32767 + assert pcm[2] == -32767 + + +class TestAsyncOmniStreamingParamsValidation: + def test_accepts_streaming_friendly_params(self) -> None: + p = SamplingParams( + n=1, + stop=[], + output_kind=RequestOutputKind.DELTA, + ) + AsyncOmni._validate_streaming_input_sampling_params(p) + + def test_rejects_non_sampling_params(self) -> None: + with pytest.raises(ValueError, match="Input streaming"): + AsyncOmni._validate_streaming_input_sampling_params(object()) # type: ignore[arg-type] + + def test_rejects_n_greater_than_one(self) -> None: + p = SamplingParams(n=2, stop=[], output_kind=RequestOutputKind.DELTA) + with pytest.raises(ValueError, match="Input streaming"): + AsyncOmni._validate_streaming_input_sampling_params(p) + + def test_rejects_final_only(self) -> None: + p = SamplingParams(n=1, stop=[], output_kind=RequestOutputKind.FINAL_ONLY) + with pytest.raises(ValueError, match="Input streaming"): + AsyncOmni._validate_streaming_input_sampling_params(p) + + def test_rejects_stop_strings(self) -> None: + p = SamplingParams(n=1, stop=["\n"], output_kind=RequestOutputKind.DELTA) + with pytest.raises(ValueError, match="Input streaming"): + AsyncOmni._validate_streaming_input_sampling_params(p) diff --git a/tests/examples/online_serving/test_qwen2_5_omni.py b/tests/examples/online_serving/test_qwen2_5_omni.py index a78ccf5924..2813b2fda8 100644 --- a/tests/examples/online_serving/test_qwen2_5_omni.py +++ b/tests/examples/online_serving/test_qwen2_5_omni.py @@ -5,8 +5,6 @@ import os -from vllm_omni.platforms import current_omni_platform - os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" from pathlib import Path @@ -19,19 +17,15 @@ run_cmd, strip_trailing_audio_saved_line, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test pytestmark = [pytest.mark.advanced_model, pytest.mark.example] models = ["Qwen/Qwen2.5-Omni-7B"] - -stage_configs = [str(Path(__file__).parent.parent.parent / "e2e" / "stage_configs" / "qwen2_5_omni_ci.yaml")] - -if current_omni_platform.is_xpu(): - stage_configs = [ - str(Path(__file__).parent.parent.parent / "e2e" / "stage_configs" / "xpu" / "qwen2_5_omni_ci.yaml") - ] +# Single CI deploy YAML; rocm/xpu deltas are picked automatically via the +# platforms: section in vllm_omni/deploy/ci/qwen2_5_omni.yaml. +stage_configs = [get_deploy_config_path("ci/qwen2_5_omni.yaml")] example_dir = str(Path(__file__).parent.parent.parent.parent / "examples" / "online_serving") # Create parameter combinations for model and stage config diff --git a/tests/examples/online_serving/test_qwen3_omni.py b/tests/examples/online_serving/test_qwen3_omni.py index 65f99d7bf2..e9ee2763bb 100644 --- a/tests/examples/online_serving/test_qwen3_omni.py +++ b/tests/examples/online_serving/test_qwen3_omni.py @@ -5,8 +5,6 @@ import os -from vllm_omni.platforms import current_omni_platform - os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" from pathlib import Path @@ -19,17 +17,14 @@ run_cmd, strip_trailing_audio_saved_line, ) -from tests.utils import hardware_test +from tests.utils import get_deploy_config_path, hardware_test pytestmark = [pytest.mark.advanced_model, pytest.mark.example] models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"] -stage_configs = [str(Path(__file__).parent.parent.parent / "e2e" / "stage_configs" / "qwen3_omni_ci.yaml")] - -if current_omni_platform.is_xpu(): - stage_configs = [str(Path(__file__).parent.parent.parent / "e2e" / "stage_configs" / "xpu" / "qwen3_omni_ci.yaml")] +stage_configs = [get_deploy_config_path("ci/qwen3_omni_moe.yaml")] example_dir = str(Path(__file__).parent.parent.parent.parent / "examples" / "online_serving") diff --git a/tests/model_executor/models/voxcpm2/__init__.py b/tests/model_executor/models/voxcpm2/__init__.py new file mode 100644 index 0000000000..208f01a7cb --- /dev/null +++ b/tests/model_executor/models/voxcpm2/__init__.py @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project diff --git a/tests/model_executor/models/voxcpm2/test_talker_state_eviction.py b/tests/model_executor/models/voxcpm2/test_talker_state_eviction.py new file mode 100644 index 0000000000..5d8a35636b --- /dev/null +++ b/tests/model_executor/models/voxcpm2/test_talker_state_eviction.py @@ -0,0 +1,121 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Regression tests for VoxCPM2 talker per-request state lifecycle.""" + +from __future__ import annotations + +import pytest + +torch = pytest.importorskip("torch") +pytest.importorskip("librosa") + +from vllm_omni.model_executor.models.voxcpm2.voxcpm2_talker import ( # noqa: E402 + VoxCPM2TalkerForConditionalGeneration, + _RequestState, +) + + +def _make_bare_talker() -> VoxCPM2TalkerForConditionalGeneration: + talker = VoxCPM2TalkerForConditionalGeneration.__new__(VoxCPM2TalkerForConditionalGeneration) + talker._active_states = {} + talker._current_request_id = None + talker._pending_requests = [] + talker._results_queue = [] + talker._audio_queue = [] + talker._deferred_cleanup_ids = set() + talker._max_batch_size = 4 + talker._active_state_warn_threshold = 512 + talker._active_state_warned = False + return talker + + +def _seed_cached_decode(talker, req_id: str) -> _RequestState: + state = _RequestState(request_id=req_id) + state.prefill_completed = True + state.decode_step_count = 5 + talker._active_states[req_id] = state + return state + + +class TestStateEvictionContract: + def test_pending_requests_is_not_used_for_eviction(self) -> None: + talker = _make_bare_talker() + + cached_ids = [f"req-{i}" for i in range(4)] + for rid in cached_ids: + _seed_cached_decode(talker, rid) + + walked_so_far = ["req-new", cached_ids[0], cached_ids[1]] + talker._pending_requests = [(rid, False, None, 0) for rid in walked_so_far] + + for rid in cached_ids: + assert rid in talker._active_states + assert talker._active_states[rid].prefill_completed is True + + def test_on_requests_finished_defers_cleanup(self) -> None: + talker = _make_bare_talker() + _seed_cached_decode(talker, "req-A") + _seed_cached_decode(talker, "req-B") + + talker.on_requests_finished({"req-A"}) + + assert "req-A" in talker._active_states + assert "req-A" in talker._deferred_cleanup_ids + + def test_flush_deferred_cleanup_removes_only_finished(self) -> None: + talker = _make_bare_talker() + _seed_cached_decode(talker, "req-A") + _seed_cached_decode(talker, "req-B") + talker.on_requests_finished(["req-A"]) + + talker._flush_deferred_cleanup() + + assert "req-A" not in talker._active_states + assert "req-B" in talker._active_states + assert talker._deferred_cleanup_ids == set() + + def test_current_request_id_cleared_when_matching(self) -> None: + talker = _make_bare_talker() + _seed_cached_decode(talker, "req-A") + talker._current_request_id = "req-A" + + talker.on_requests_finished({"req-A"}) + talker._flush_deferred_cleanup() + + assert talker._current_request_id is None + + def test_current_request_id_preserved_when_not_finished(self) -> None: + talker = _make_bare_talker() + _seed_cached_decode(talker, "req-A") + _seed_cached_decode(talker, "req-B") + talker._current_request_id = "req-B" + + talker.on_requests_finished({"req-A"}) + talker._flush_deferred_cleanup() + + assert talker._current_request_id == "req-B" + + +class TestLeakWarnGuard: + def test_warn_fires_once_over_threshold(self, monkeypatch) -> None: + from vllm_omni.model_executor.models.voxcpm2 import voxcpm2_talker as tk + + calls: list[str] = [] + + def _capture(msg, *args, **kwargs): + calls.append(msg % args if args else msg) + + monkeypatch.setattr(tk.logger, "warning", _capture) + + talker = _make_bare_talker() + talker._active_state_warn_threshold = 3 + + for i in range(4): + talker._active_states[f"seed-{i}"] = _RequestState(request_id=f"seed-{i}") + + talker._get_or_create_state("new-1") + talker._get_or_create_state("new-2") + + leak_warnings = [m for m in calls if "cleanup path leak" in m] + assert len(leak_warnings) == 1 + assert talker._active_state_warned is True diff --git a/tests/model_executor/stage_input_processors/test_qwen3_omni_streaming_helpers.py b/tests/model_executor/stage_input_processors/test_qwen3_omni_streaming_helpers.py new file mode 100644 index 0000000000..18972c91d5 --- /dev/null +++ b/tests/model_executor/stage_input_processors/test_qwen3_omni_streaming_helpers.py @@ -0,0 +1,81 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Unit tests for Qwen3-Omni streaming thinker→talker / talker→codec helpers (PR #2581).""" + +from __future__ import annotations + +from types import SimpleNamespace + +import pytest + +import vllm_omni.model_executor.stage_input_processors.qwen3_omni as q3 + +pytestmark = [pytest.mark.core_model, pytest.mark.cpu] + + +@pytest.fixture(autouse=True) +def _streaming_context() -> SimpleNamespace: + return SimpleNamespace(bridge_states={}) + + +def test_get_streaming_talker_tokens_first_segment(_streaming_context: SimpleNamespace) -> None: + inc_p, inc_o, merged, thinker_in = q3._get_streaming_talker_tokens( + "r1", + [1, 2], + [10, 11], + streaming_context=_streaming_context, + ) + assert inc_p == [1, 2] + assert inc_o == [10, 11] + assert merged == [1, 2, 10, 11] + assert thinker_in == [1, 2] + + +def test_get_streaming_talker_tokens_second_segment_accumulates(_streaming_context: SimpleNamespace) -> None: + q3._get_streaming_talker_tokens("r2", [1, 2], [10, 11], streaming_context=_streaming_context) + inc_p, inc_o, merged, thinker_in = q3._get_streaming_talker_tokens( + "r2", + [1, 2, 3, 4], + [10, 11, 12, 13], + streaming_context=_streaming_context, + ) + assert inc_p == [3, 4] + assert inc_o == [12, 13] + assert merged == [1, 2, 10, 3, 4, 12, 13] + assert thinker_in == [1, 2, 10, 3, 4] + + +def test_get_streaming_talker_tokens_new_prompt_len_snapshot_truncates( + _streaming_context: SimpleNamespace, +) -> None: + inc_p, inc_o, merged, thinker_in = q3._get_streaming_talker_tokens( + "r3", + [1, 2, 3, 4, 5, 6], + [10], + new_prompt_len_snapshot=2, + streaming_context=_streaming_context, + ) + assert inc_p == [1, 2, 3, 4] + assert inc_o == [10] + assert merged == [1, 2, 3, 4, 10] + assert thinker_in == [1, 2, 3, 4] + + +def test_get_streaming_talker_tokens_clear_state(_streaming_context: SimpleNamespace) -> None: + q3._get_streaming_talker_tokens("r4", [1], [2], streaming_context=_streaming_context, clear_state=True) + state = q3._get_qwen3_streaming_state("r4", _streaming_context).thinker2talker + assert state.last_prompt_len == 0 + assert state.last_output_len == 0 + assert state.merged_sequences == [] + + +def test_get_streaming_codec_delta_len_increments_and_finishes(_streaming_context: SimpleNamespace) -> None: + d1 = q3._get_streaming_codec_delta_len(5, "c1", SimpleNamespace(finished=False), _streaming_context) + assert d1 == 5 + d2 = q3._get_streaming_codec_delta_len(8, "c1", SimpleNamespace(finished=False), _streaming_context) + assert d2 == 2 + # After d2, stored cursor is cur_seq_len + 1 == 9; next delta uses new cur_seq_len - 9. + d3 = q3._get_streaming_codec_delta_len(10, "c1", SimpleNamespace(finished=True), _streaming_context) + assert d3 == 1 + state = q3._get_qwen3_streaming_state("c1", _streaming_context) + assert state.talker2code2wav_last_seq_len == 0 diff --git a/tests/test_arg_utils.py b/tests/test_arg_utils.py new file mode 100644 index 0000000000..dab5ed6878 --- /dev/null +++ b/tests/test_arg_utils.py @@ -0,0 +1,353 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Tests for vllm_omni.engine.arg_utils — invariants that must +hold for the orchestrator/engine/server CLI flag partition.""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass, fields + +import pytest + +from vllm_omni.engine.arg_utils import ( + SHARED_FIELDS, + derive_server_dests_from_vllm_parser, + internal_blacklist_keys, + orchestrator_args_from_argparse, + orchestrator_field_names, + split_kwargs, +) + +# --------------------------------------------------------------------------- +# Fake engine class for unit testing — avoids pulling in the full vllm +# EngineArgs and its heavy __post_init__ at test time. +# --------------------------------------------------------------------------- + + +@dataclass +class _FakeEngineArgs: + """Stand-in for OmniEngineArgs with a representative subset of fields.""" + + model: str = "" + stage_id: int = 0 + max_num_seqs: int = 64 + gpu_memory_utilization: float = 0.9 + async_chunk: bool = False # also in OrchestratorArgs → shared + log_stats: bool = False # also in OrchestratorArgs → shared + stage_configs_path: str | None = None + + +# ============================================================================ +# Invariant 1 — OrchestratorArgs and engine must not ambiguously overlap. +# ============================================================================ + + +def test_no_ambiguous_overlap_with_fake_engine(): + """OrchestratorArgs ∩ engine fields must be ⊆ SHARED_FIELDS.""" + orch = orchestrator_field_names() + engine = {f.name for f in fields(_FakeEngineArgs)} + overlap = orch & engine + unexpected = overlap - SHARED_FIELDS + assert not unexpected, ( + f"Fields declared in both OrchestratorArgs and the engine class " + f"but not in SHARED_FIELDS: {sorted(unexpected)}. These cause " + f"double-routing — either remove the duplicate declaration or add " + f"to SHARED_FIELDS if sharing is intentional." + ) + + +def test_no_ambiguous_overlap_with_real_engine(): + """Same check, but against the real OmniEngineArgs.""" + try: + from vllm_omni.engine.arg_utils import OmniEngineArgs + except Exception as exc: + pytest.skip(f"OmniEngineArgs not importable: {exc}") + + orch = orchestrator_field_names() + engine = {f.name for f in fields(OmniEngineArgs)} + overlap = orch & engine + unexpected = overlap - SHARED_FIELDS + assert not unexpected, ( + f"Real OmniEngineArgs has ambiguous overlap with OrchestratorArgs: " + f"{sorted(unexpected)}. Update SHARED_FIELDS or remove duplication." + ) + + +# ============================================================================ +# Invariant 2 — split_kwargs partitions correctly. +# ============================================================================ + + +def test_split_orchestrator_only(): + """Pure orchestrator fields go to OrchestratorArgs, not engine_kwargs.""" + raw = {"stage_init_timeout": 500, "worker_backend": "ray"} + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + assert orch.stage_init_timeout == 500 + assert orch.worker_backend == "ray" + assert "stage_init_timeout" not in engine + assert "worker_backend" not in engine + + +def test_split_engine_only(): + """Pure engine fields go to engine_kwargs, not OrchestratorArgs.""" + raw = {"max_num_seqs": 128, "gpu_memory_utilization": 0.85} + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + assert engine["max_num_seqs"] == 128 + assert engine["gpu_memory_utilization"] == 0.85 + # These fields don't exist on OrchestratorArgs at all. + + +def test_split_shared_fields_go_to_both(): + """Fields in SHARED_FIELDS are copied to both buckets.""" + raw = {"model": "Qwen/Qwen2.5-Omni-7B", "log_stats": True} + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + assert orch.log_stats is True + assert engine["model"] == "Qwen/Qwen2.5-Omni-7B" + assert engine["log_stats"] is True + + +def test_split_drops_unclassified(): + """Unclassified fields (uvicorn/server) are dropped silently.""" + raw = { + "max_num_seqs": 64, # engine + "host": "0.0.0.0", # unclassified (server) + "port": 8091, # unclassified (server) + "ssl_keyfile": "key.pem", # unclassified (server) + } + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + assert engine == {"max_num_seqs": 64} + assert "host" not in engine + assert "port" not in engine + assert "ssl_keyfile" not in engine + + +def test_split_mixed_real_world(): + """End-to-end: raw CLI kwargs with all three classes present.""" + raw = { + # orchestrator + "stage_init_timeout": 400, + "deploy_config": "/tmp/deploy.yaml", + "worker_backend": "multi_process", + "async_chunk": True, + # engine + "max_num_seqs": 32, + "gpu_memory_utilization": 0.8, + # shared + "model": "Qwen/Qwen3-Omni", + "log_stats": False, + # server / unclassified + "host": "0.0.0.0", + "port": 8091, + "api_key": "secret", + # None values + "ray_address": None, + } + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + + # Orchestrator side + assert orch.stage_init_timeout == 400 + assert orch.deploy_config == "/tmp/deploy.yaml" + assert orch.worker_backend == "multi_process" + assert orch.async_chunk is True + assert orch.log_stats is False # shared, read from raw + assert orch.ray_address is None # default preserved + + # Engine side + assert engine["max_num_seqs"] == 32 + assert engine["gpu_memory_utilization"] == 0.8 + assert engine["model"] == "Qwen/Qwen3-Omni" + assert engine["log_stats"] is False + assert "host" not in engine + assert "port" not in engine + assert "api_key" not in engine + # orchestrator-only keys never reach engine + assert "stage_init_timeout" not in engine + assert "deploy_config" not in engine + assert "async_chunk" not in engine + + +# ============================================================================ +# Invariant 3 — user-typed unclassifiable flags warn (don't fail silently). +# ============================================================================ + + +def test_user_typed_unclassified_warns(caplog): + """If the user types a flag we can't route, warn — don't silently drop.""" + raw = {"bogus_flag": "value", "max_num_seqs": 64} + with caplog.at_level(logging.WARNING, logger="vllm_omni.engine.arg_utils"): + split_kwargs(raw, engine_cls=_FakeEngineArgs, user_typed={"bogus_flag"}) + assert any("bogus_flag" in rec.message for rec in caplog.records), ( + f"Expected warning mentioning 'bogus_flag', got: {[rec.message for rec in caplog.records]}" + ) + + +def test_unclassified_without_user_typed_silent(caplog): + """Without user_typed, unclassified keys drop silently (argparse defaults + for server flags shouldn't spam logs on every launch).""" + raw = {"host": "0.0.0.0", "port": 8091, "max_num_seqs": 64} + with caplog.at_level(logging.WARNING, logger="vllm_omni.engine.arg_utils"): + split_kwargs(raw, engine_cls=_FakeEngineArgs, user_typed=None) + # No warnings because we don't know these were user-typed. + assert not any("host" in rec.message or "port" in rec.message for rec in caplog.records) + + +# ============================================================================ +# Invariant 4 — CLI flag classification completeness. +# Catches new flags added without updating OrchestratorArgs or OmniEngineArgs. +# ============================================================================ + + +def test_all_omni_cli_flags_classified(): + """Every vllm-omni-added CLI flag must be classifiable. + + Runs ``OmniServeCommand.subparser_init`` and checks that every new + argument (compared to vllm's base parser) is either: + - a field on OrchestratorArgs, OR + - a field on OmniEngineArgs, OR + - in SHARED_FIELDS + """ + try: + from vllm.utils.argparse_utils import FlexibleArgumentParser + + from vllm_omni.engine.arg_utils import OmniEngineArgs + from vllm_omni.entrypoints.cli.serve import OmniServeCommand + except Exception as exc: + pytest.skip(f"Cannot build parser in this environment: {exc}") + + # Build the serve parser + root = FlexibleArgumentParser() + subparsers = root.add_subparsers() + cmd = OmniServeCommand() + try: + parser = cmd.subparser_init(subparsers) + except Exception as exc: + pytest.skip(f"subparser_init failed (dev env issue): {exc}") + + all_dests = {a.dest for a in parser._actions if a.dest and a.dest not in {"help", "model_tag"}} + + orch = orchestrator_field_names() + engine = {f.name for f in fields(OmniEngineArgs)} + server_derived = derive_server_dests_from_vllm_parser() + + unclassified = all_dests - orch - engine - SHARED_FIELDS - server_derived + # Some argparse-internal dests (suppressed, private) may not match — + # filter those out. + unclassified = {d for d in unclassified if not d.startswith("_")} + + assert not unclassified, ( + f"These CLI flags are not classified as " + f"orchestrator/engine/shared/server: {sorted(unclassified)}. " + f"Add them to OrchestratorArgs (if consumed by orchestrator), " + f"OmniEngineArgs (if consumed by per-stage engine), or the known-server " + f"allowlist (if they're vllm frontend flags). " + f"If intentional (e.g. a new CLI-only flag that doesn't map to either " + f"dataclass), add it to a KNOWN_UNROUTED allowlist." + ) + + +# ============================================================================ +# argparse interop (Phase 3). +# ============================================================================ + + +def test_orchestrator_args_from_argparse(): + """Can build OrchestratorArgs from an argparse.Namespace.""" + import argparse + + ns = argparse.Namespace( + stage_init_timeout=500, + deploy_config="/tmp/x.yaml", + max_num_seqs=64, # engine field — ignored + host="0.0.0.0", # server field — ignored + ) + orch = orchestrator_args_from_argparse(ns) + assert orch.stage_init_timeout == 500 + assert orch.deploy_config == "/tmp/x.yaml" + assert orch.worker_backend == "multi_process" # default + + +def test_derive_server_dests_returns_frozenset(): + """Server-dest derivation returns a frozenset (possibly empty).""" + result = derive_server_dests_from_vllm_parser() + assert isinstance(result, frozenset) + + +# ============================================================================ +# internal_blacklist_keys — single source of truth for per-stage forwarding. +# ============================================================================ + + +def test_internal_blacklist_keys_derived_from_orchestrator(): + """Blacklist is exactly OrchestratorArgs fields minus SHARED_FIELDS. + + This function replaces the old hardcoded INTERNAL_STAGE_OVERRIDE_KEYS + frozenset. Asserts the contract so future changes to OrchestratorArgs + automatically propagate to the blacklist. + """ + blacklist = internal_blacklist_keys() + assert blacklist == orchestrator_field_names() - SHARED_FIELDS + # Spot-check expected entries + assert "stage_init_timeout" in blacklist + assert "deploy_config" in blacklist + assert "async_chunk" in blacklist + # Shared fields must NOT appear — they flow to both orchestrator and engine + assert "model" not in blacklist + assert "log_stats" not in blacklist + + +# ============================================================================ +# Boundary value analysis — edge cases around split_kwargs. +# ============================================================================ + + +def test_split_empty_kwargs(): + """Empty kwargs yields default OrchestratorArgs and empty engine dict.""" + orch, engine = split_kwargs({}, engine_cls=_FakeEngineArgs) + assert orch.stage_init_timeout == 300 # dataclass default + assert orch.worker_backend == "multi_process" # dataclass default + assert engine == {} + + +def test_split_all_none_values_preserved_on_orchestrator(): + """None values for orchestrator fields are kept (represents 'not set').""" + raw = {"ray_address": None, "deploy_config": None, "max_num_seqs": None} + orch, engine = split_kwargs(raw, engine_cls=_FakeEngineArgs) + assert orch.ray_address is None + assert orch.deploy_config is None + # Engine-side None still passes through; caller decides semantics downstream. + assert engine.get("max_num_seqs") is None + + +def test_split_user_typed_with_empty_kwargs_no_warn(caplog): + """user_typed non-empty but kwargs empty — no warnings emitted.""" + with caplog.at_level(logging.WARNING, logger="vllm_omni.engine.arg_utils"): + split_kwargs({}, engine_cls=_FakeEngineArgs, user_typed={"nothing"}) + assert not caplog.records + + +def test_ambiguous_field_strict_raises(): + """strict=True raises ValueError on overlap outside SHARED_FIELDS.""" + + # deploy_config is on OrchestratorArgs; declaring it on the engine class + # too (without adding to SHARED_FIELDS) creates an ambiguous route. + @dataclass + class _AmbiguousEngine: + deploy_config: str | None = None + + with pytest.raises(ValueError, match="both OrchestratorArgs and"): + split_kwargs({"deploy_config": "x"}, engine_cls=_AmbiguousEngine, strict=True) + + +def test_ambiguous_field_non_strict_routes_to_orchestrator(caplog): + """strict=False logs ERROR but routes the ambiguous field to orchestrator.""" + + @dataclass + class _AmbiguousEngine: + deploy_config: str | None = None + + with caplog.at_level(logging.ERROR, logger="vllm_omni.engine.arg_utils"): + orch, engine = split_kwargs({"deploy_config": "x"}, engine_cls=_AmbiguousEngine, strict=False) + assert orch.deploy_config == "x" + assert "deploy_config" not in engine + assert any("both OrchestratorArgs" in r.message for r in caplog.records) diff --git a/tests/test_config_factory.py b/tests/test_config_factory.py index e284de48d0..1d65d3acd2 100644 --- a/tests/test_config_factory.py +++ b/tests/test_config_factory.py @@ -4,12 +4,26 @@ Unit tests for StageConfigFactory and related classes. """ +from dataclasses import dataclass +from pathlib import Path + +import pytest + from vllm_omni.config.stage_config import ( + _EXECUTION_TYPE_TO_SCHEDULER, + _PIPELINE_REGISTRY, ModelPipeline, + PipelineConfig, StageConfig, StageConfigFactory, + StageExecutionType, + StagePipelineConfig, StageType, + build_stage_runtime_overrides, + register_pipeline, + strip_parent_engine_args, ) +from vllm_omni.engine.arg_utils import SHARED_FIELDS, internal_blacklist_keys class TestStageType: @@ -241,8 +255,9 @@ def test_default_diffusion_no_yaml(self): def test_default_diffusion_with_parallel_config(self): """Test diffusion config calculates devices from parallel_config.""" + @dataclass class MockParallelConfig: - world_size = 4 + world_size: int = 4 kwargs = { "parallel_config": MockParallelConfig(), @@ -270,7 +285,7 @@ def test_cli_override_forwards_engine_registered_args(self): stage = StageConfig(stage_id=0, model_stage="thinker", input_sources=[]) cli_overrides = { "gpu_memory_utilization": 0.9, # Well-known param - "custom_engine_flag": True, # Not in _INTERNAL_KEYS, so forwarded + "custom_engine_flag": True, # Not orchestrator-owned, so forwarded } overrides = StageConfigFactory._merge_cli_overrides(stage, cli_overrides) @@ -311,6 +326,56 @@ def test_per_stage_override_excludes_internal_keys(self): assert "batch_timeout" not in overrides +class TestStageResolutionHelpers: + """Tests for shared stage override / filtering helpers.""" + + def test_build_stage_runtime_overrides_ignores_other_stage_and_internal_keys(self): + # Pass the same filter set the function uses by default + # (orchestrator-only fields plus SHARED_FIELDS so ``model`` is + # treated as not-per-stage-overridable). + overrides = build_stage_runtime_overrides( + 0, + { + "gpu_memory_utilization": 0.5, + "stage_0_gpu_memory_utilization": 0.9, + "stage_1_gpu_memory_utilization": 0.1, + "stage_0_model": "should_be_ignored", + "parallel_config": {"world_size": 2}, + }, + internal_keys=internal_blacklist_keys() | SHARED_FIELDS, + ) + + assert overrides["gpu_memory_utilization"] == 0.9 + assert "model" not in overrides + assert "parallel_config" not in overrides + + def test_strip_parent_engine_args_reports_only_surprising_parent_overrides(self): + from dataclasses import fields as dc_fields + + from vllm.engine.arg_utils import EngineArgs + + parent_fields = {f.name: f for f in dc_fields(EngineArgs)} + filtered, overridden = strip_parent_engine_args( + { + "model": "some/model", + "stage_configs_path": "/tmp/stages.yaml", + "tensor_parallel_size": 4, + "worker_extension_cls": "some.Extension", + "custom_pipeline_args": {"pipeline_class": "demo.Pipeline"}, + }, + parent_fields=parent_fields, + keep_keys={"worker_extension_cls"}, + strip_keys={"stage_configs_path"}, + no_warn_keys={"model"}, + ) + + assert filtered == { + "worker_extension_cls": "some.Extension", + "custom_pipeline_args": {"pipeline_class": "demo.Pipeline"}, + } + assert overridden == ["tensor_parallel_size"] + + class TestPipelineYamlParsing: """Tests for pipeline YAML file parsing (@ZJY0516).""" @@ -609,16 +674,617 @@ def test_parse_missing_async_chunk_defaults_false(self, tmp_path): assert pipeline.async_chunk is False -class TestArchitectureFallback: - """Tests for architecture-based model detection fallback.""" +class TestPipelineDiscovery: + """Tests for the central pipeline registry (``pipeline_registry._VLLM_OMNI_PIPELINES``).""" + + def test_registry_has_known_models(self): + """Built-in pipelines are lazy-loaded from the central declaration + on first access; no eager import or discovery walk needed.""" + # ``in`` triggers the lazy-map lookup without forcing a load. + assert "qwen2_5_omni" in _PIPELINE_REGISTRY + assert "qwen3_omni_moe" in _PIPELINE_REGISTRY + assert "qwen3_tts" in _PIPELINE_REGISTRY + + def test_registry_loads_pipeline_on_getitem(self): + """Looking up a registered model_type returns the matching PipelineConfig.""" + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + assert pipeline.model_type == "qwen3_omni_moe" + assert len(pipeline.stages) == 3 # thinker + talker + code2wav + + def test_registry_returns_none_for_unknown(self): + """Unknown model_types aren't found; ``get()`` returns None.""" + assert "definitely_not_a_real_model" not in _PIPELINE_REGISTRY + assert _PIPELINE_REGISTRY.get("definitely_not_a_real_model") is None + + def test_pipeline_config_supports_hf_architectures(self): + """PipelineConfig accepts hf_architectures for HF-arch fallback + (replaces the old _ARCHITECTURE_MODELS dict).""" + p = PipelineConfig( + model_type="custom_collide", + hf_architectures=("SomeCollidingArch",), + ) + assert p.hf_architectures == ("SomeCollidingArch",) + + +class TestStagePipelineConfig: + def test_frozen(self): + s = StagePipelineConfig(stage_id=0, model_stage="a") + with pytest.raises(AttributeError): + s.model_stage = "changed" + + def test_defaults(self): + s = StagePipelineConfig(stage_id=0, model_stage="a") + assert s.execution_type == StageExecutionType.LLM_AR + assert s.input_sources == () + assert s.final_output is False + assert s.sampling_constraints == {} + assert s.engine_output_type is None + + +class TestPipelineConfigNew: + def test_frozen(self): + p = PipelineConfig(model_type="t", model_arch="A") + with pytest.raises(AttributeError): + p.model_type = "changed" + + def test_validate_valid(self): + p = PipelineConfig( + model_type="t", + model_arch="A", + stages=( + StagePipelineConfig(stage_id=0, model_stage="a"), + StagePipelineConfig(stage_id=1, model_stage="b", input_sources=(0,)), + ), + ) + assert p.validate() == [] + + def test_validate_no_stages(self): + p = PipelineConfig(model_type="t", model_arch="A") + assert any("no stages" in e.lower() for e in p.validate()) + + def test_get_scheduler_cls(self): + p = PipelineConfig( + model_type="t", + model_arch="A", + stages=( + StagePipelineConfig(stage_id=0, model_stage="a", execution_type=StageExecutionType.LLM_AR), + StagePipelineConfig( + stage_id=1, model_stage="b", execution_type=StageExecutionType.LLM_GENERATION, input_sources=(0,) + ), + ), + ) + assert "OmniARScheduler" in p.get_scheduler_cls(0) + assert "OmniGenerationScheduler" in p.get_scheduler_cls(1) + + +class TestExecutionTypeToScheduler: + def test_all_types_mapped(self): + for et in StageExecutionType: + assert et in _EXECUTION_TYPE_TO_SCHEDULER + + +class TestPipelineRegistry: + def test_register_and_lookup(self): + p = PipelineConfig( + model_type="__test_only__", + model_arch="A", + stages=(StagePipelineConfig(stage_id=0, model_stage="a"),), + ) + register_pipeline(p) + assert _PIPELINE_REGISTRY["__test_only__"] is p + del _PIPELINE_REGISTRY["__test_only__"] + + +class TestDeployConfigLoading: + def test_load_deploy_config(self): + from pathlib import Path + + from vllm_omni.config.stage_config import load_deploy_config + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + assert len(deploy.stages) == 3 + assert deploy.async_chunk is True + assert deploy.connectors is not None + assert deploy.platforms is not None + + def test_merge_pipeline_deploy(self): + from pathlib import Path + + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + stages = merge_pipeline_deploy(pipeline, deploy) + + assert len(stages) == 3 + s0 = stages[0] + assert s0.model_stage == "thinker" + assert s0.yaml_engine_args["model_arch"] == "Qwen3OmniMoeForConditionalGeneration" + assert s0.yaml_engine_args["engine_output_type"] == "latent" + assert s0.yaml_extras["default_sampling_params"]["detokenize"] is True + + +class TestQwen3OmniPipeline: + def test_registered(self): + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + + p = _PIPELINE_REGISTRY.get("qwen3_omni_moe") + assert p is not None + assert p.model_arch == "Qwen3OmniMoeForConditionalGeneration" + assert len(p.stages) == 3 + assert p.validate() == [] + + def test_thinker(self): + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen3_omni_moe"].get_stage(0) + assert s.model_stage == "thinker" + assert s.execution_type == StageExecutionType.LLM_AR + assert s.owns_tokenizer is True + assert s.engine_output_type == "latent" + assert s.sampling_constraints["detokenize"] is True + + def test_talker(self): + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen3_omni_moe"].get_stage(1) + assert s.input_sources == (0,) + assert s.sampling_constraints["stop_token_ids"] == [2150] + assert s.custom_process_input_func is not None + assert s.custom_process_next_stage_input_func is not None + + def test_code2wav(self): + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen3_omni_moe"].get_stage(2) + assert s.execution_type == StageExecutionType.LLM_GENERATION + assert s.final_output_type == "audio" + assert s.custom_process_input_func is not None + + +class TestQwen2_5OmniPipeline: + def test_registered(self): + import vllm_omni.model_executor.models.qwen2_5_omni.pipeline # noqa: F401 + + p = _PIPELINE_REGISTRY.get("qwen2_5_omni") + assert p is not None + assert p.model_arch == "Qwen2_5OmniForConditionalGeneration" + assert len(p.stages) == 3 + assert p.validate() == [] + + def test_thinker(self): + import vllm_omni.model_executor.models.qwen2_5_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen2_5_omni"].get_stage(0) + assert s.model_stage == "thinker" + assert s.execution_type == StageExecutionType.LLM_AR + assert s.owns_tokenizer is True + assert s.engine_output_type == "latent" + assert s.requires_multimodal_data is True + + def test_talker(self): + import vllm_omni.model_executor.models.qwen2_5_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen2_5_omni"].get_stage(1) + assert s.input_sources == (0,) + assert s.sampling_constraints["stop_token_ids"] == [8294] + assert s.custom_process_input_func is not None + + def test_code2wav(self): + import vllm_omni.model_executor.models.qwen2_5_omni.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen2_5_omni"].get_stage(2) + assert s.execution_type == StageExecutionType.LLM_GENERATION + assert s.final_output_type == "audio" + assert s.engine_output_type == "audio" + + +class TestQwen3TTSPipeline: + def test_registered(self): + import vllm_omni.model_executor.models.qwen3_tts.pipeline # noqa: F401 + + p = _PIPELINE_REGISTRY.get("qwen3_tts") + assert p is not None + assert p.model_arch == "Qwen3TTSTalkerForConditionalGeneration" + assert len(p.stages) == 2 + assert p.validate() == [] + + def test_talker_stage(self): + import vllm_omni.model_executor.models.qwen3_tts.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen3_tts"].get_stage(0) + assert s.model_stage == "qwen3_tts" + assert s.execution_type == StageExecutionType.LLM_AR + assert s.owns_tokenizer is True + assert s.engine_output_type == "latent" + assert s.sampling_constraints["stop_token_ids"] == [2150] + # Stage 0 inherits the pipeline-level model_arch + assert s.model_arch is None + + def test_code2wav_stage_has_per_stage_model_arch(self): + import vllm_omni.model_executor.models.qwen3_tts.pipeline # noqa: F401 + + s = _PIPELINE_REGISTRY["qwen3_tts"].get_stage(1) + assert s.execution_type == StageExecutionType.LLM_GENERATION + assert s.final_output_type == "audio" + assert s.engine_output_type == "audio" + # Per-stage model_arch override (different from pipeline-level talker) + assert s.model_arch == "Qwen3TTSCode2Wav" + # tts_args is passed through via extras + assert s.extras["tts_args"]["max_instructions_length"] == 500 + + def test_per_stage_model_arch_flows_through_merge(self, tmp_path): + """Verify the new ps.model_arch override survives merge_pipeline_deploy.""" + import vllm_omni.model_executor.models.qwen3_tts.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_tts.yaml" + if not deploy_path.exists(): + pytest.skip("qwen3_tts deploy yaml not found") + + deploy = load_deploy_config(deploy_path) + pipeline = _PIPELINE_REGISTRY["qwen3_tts"] + stages = merge_pipeline_deploy(pipeline, deploy) + + # Stage 0 inherits pipeline-level model_arch + assert stages[0].yaml_engine_args["model_arch"] == "Qwen3TTSTalkerForConditionalGeneration" + # Stage 1 uses its per-stage override + assert stages[1].yaml_engine_args["model_arch"] == "Qwen3TTSCode2Wav" + + +class TestBaseConfigInheritance: + """Test deploy YAML base_config inheritance.""" + + def test_ci_inherits_from_main(self): + from tests.utils import get_deploy_config_path + from vllm_omni.config.stage_config import load_deploy_config + + ci_path = Path(get_deploy_config_path("ci/qwen3_omni_moe.yaml")) + if not ci_path.exists(): + pytest.skip("CI deploy config not found") + + deploy = load_deploy_config(ci_path) + assert len(deploy.stages) == 3 + # CI overrides + assert deploy.stages[0].engine_extras.get("load_format") == "dummy" + assert deploy.stages[0].max_num_seqs == 5 + # Inherited from base + assert deploy.stages[0].gpu_memory_utilization == 0.9 + assert deploy.connectors is not None + assert "connector_of_shared_memory" in deploy.connectors + # CI overlay explicitly sets async_chunk: False (see + # tests/utils.py::_CI_OVERLAYS and PR #2383 discussion). Overlay + # bool overrides base even when the base yaml has async_chunk: true. + assert deploy.async_chunk is False + + def test_ci_sampling_merge(self): + from tests.utils import get_deploy_config_path + from vllm_omni.config.stage_config import load_deploy_config + + ci_path = Path(get_deploy_config_path("ci/qwen3_omni_moe.yaml")) + if not ci_path.exists(): + pytest.skip("CI deploy config not found") + + deploy = load_deploy_config(ci_path) + s0 = deploy.stages[0].default_sampling_params + # CI overrides max_tokens + assert s0["max_tokens"] == 150 + # Inherited from base + assert s0["temperature"] == 0.4 + assert s0["seed"] == 42 + + def test_pure_inheritance_overlay(self, tmp_path): + """An overlay with only ``base_config`` inherits everything.""" + from vllm_omni.config.stage_config import load_deploy_config + + base = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not base.exists(): + pytest.skip("Base deploy config not found") + + overlay = tmp_path / "overlay.yaml" + overlay.write_text(f"base_config: {base}\n") + + deploy = load_deploy_config(overlay) + assert len(deploy.stages) == 3 + assert deploy.stages[0].gpu_memory_utilization == 0.9 + + def test_single_field_overlay(self, tmp_path): + """An overlay overriding one stage field merges with the base.""" + from vllm_omni.config.stage_config import load_deploy_config + + base = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not base.exists(): + pytest.skip("Base deploy config not found") + + overlay = tmp_path / "overlay.yaml" + overlay.write_text(f"base_config: {base}\nstages:\n - stage_id: 2\n max_num_batched_tokens: 1000000\n") + + deploy = load_deploy_config(overlay) + assert deploy.stages[2].max_num_batched_tokens == 1000000 + # Rest inherited + assert deploy.stages[0].gpu_memory_utilization == 0.9 + + +class TestPlatformOverrides: + """Test platform-specific deploy config overrides.""" + + def test_npu_overrides(self): + from pathlib import Path + + from vllm_omni.config.stage_config import _apply_platform_overrides, load_deploy_config + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + deploy = _apply_platform_overrides(deploy, platform="npu") + + assert deploy.stages[0].gpu_memory_utilization == 0.6 + assert deploy.stages[0].tensor_parallel_size == 2 + assert deploy.stages[0].devices == "0,1" + # Stage 2 unaffected fields stay at base + assert deploy.stages[2].enforce_eager is True + + def test_xpu_overrides(self): + from pathlib import Path + + from vllm_omni.config.stage_config import _apply_platform_overrides, load_deploy_config + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + deploy = _apply_platform_overrides(deploy, platform="xpu") + + assert deploy.stages[0].tensor_parallel_size == 4 + assert deploy.stages[0].devices == "0,1,2,3" + assert deploy.stages[0].engine_extras.get("max_cudagraph_capture_size") == 0 + + def test_unknown_platform_noop(self): + from pathlib import Path + + from vllm_omni.config.stage_config import _apply_platform_overrides, load_deploy_config + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + original_mem = deploy.stages[0].gpu_memory_utilization + deploy = _apply_platform_overrides(deploy, platform="unknown_hw") + assert deploy.stages[0].gpu_memory_utilization == original_mem + + def test_platforms_deep_merge_inheritance(self, tmp_path): + """Overlay's platforms: block layers onto base's, per-stage.""" + from vllm_omni.config.stage_config import _apply_platform_overrides, load_deploy_config + + base = tmp_path / "base.yaml" + base.write_text( + "stages:\n" + " - stage_id: 0\n" + " gpu_memory_utilization: 0.9\n" + "platforms:\n" + " rocm:\n" + " stages:\n" + " - stage_id: 0\n" + " enforce_eager: true\n" + ) + overlay = tmp_path / "overlay.yaml" + overlay.write_text( + f"base_config: {base.name}\n" + "platforms:\n" + " rocm:\n" + " stages:\n" + " - stage_id: 0\n" + " max_num_seqs: 1\n" + ) + + deploy = load_deploy_config(overlay) + deploy = _apply_platform_overrides(deploy, platform="rocm") + # Both base's enforce_eager and overlay's max_num_seqs should apply. + assert deploy.stages[0].enforce_eager is True + assert deploy.stages[0].max_num_seqs == 1 + # Inherited stage default not touched by overlay platforms section. + assert deploy.stages[0].gpu_memory_utilization == 0.9 + + +class TestCLIOverrideFlow: + """Test --stage-overrides JSON merge into StageConfig.""" + + def test_stage_overrides_merge(self): + from pathlib import Path + + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + stages = merge_pipeline_deploy(pipeline, deploy) + + # Simulate --stage-overrides '{"0": {"gpu_memory_utilization": 0.5}}' + overrides = {"stage_0_gpu_memory_utilization": 0.5} + stages[0].runtime_overrides = StageConfigFactory._merge_cli_overrides(stages[0], overrides) + assert stages[0].runtime_overrides["gpu_memory_utilization"] == 0.5 + + def test_global_override_applies_to_all(self): + from pathlib import Path + + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") + + deploy = load_deploy_config(deploy_path) + stages = merge_pipeline_deploy(pipeline, deploy) + + overrides = {"enforce_eager": True} + for s in stages: + s.runtime_overrides = StageConfigFactory._merge_cli_overrides(s, overrides) + assert s.runtime_overrides["enforce_eager"] is True + + +class TestCLIExplicitPrecedence: + """Verify YAML > argparse defaults; explicit CLI args > YAML.""" + + def _stages(self, cli_overrides, cli_explicit_keys): + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + + return StageConfigFactory._create_from_registry( + "qwen3_omni_moe", + cli_overrides=cli_overrides, + cli_explicit_keys=cli_explicit_keys, + ) + + def test_explicit_cli_overrides_yaml(self): + """User-typed --max-num-seqs wins over the deploy YAML value.""" + stages = self._stages( + cli_overrides={"max_num_seqs": 999}, + cli_explicit_keys={"max_num_seqs"}, + ) + # Stage 2 yaml has max_num_seqs=1; explicit CLI must beat it. + assert stages[2].runtime_overrides.get("max_num_seqs") == 999 + + def test_default_cli_does_not_override_yaml(self): + """Argparse defaults must NOT clobber values that are present in YAML.""" + stages = self._stages( + cli_overrides={"max_num_seqs": 256}, + cli_explicit_keys=set(), # user typed nothing + ) + # Stage 2's YAML value (1) should win because the user didn't type --max-num-seqs. + assert stages[2].runtime_overrides.get("max_num_seqs") != 256 + + def test_default_cli_fills_missing_yaml_field(self): + """Argparse defaults still fill fields the YAML doesn't set.""" + stages = self._stages( + cli_overrides={"some_unrelated_knob": "fallback"}, + cli_explicit_keys=set(), + ) + # Field absent from YAML → CLI default flows through as a fallback. + assert stages[0].runtime_overrides.get("some_unrelated_knob") == "fallback" + + def test_per_stage_overrides_always_explicit(self): + """``stage__*`` keys are always treated as explicit.""" + stages = self._stages( + cli_overrides={"stage_0_gpu_memory_utilization": 0.42}, + cli_explicit_keys=set(), # not in the explicit set, but per-stage + ) + assert stages[0].runtime_overrides.get("gpu_memory_utilization") == 0.42 + + def test_none_explicit_set_treats_all_as_explicit(self): + """Programmatic Omni() callers (cli_explicit_keys=None) keep current behavior.""" + stages = self._stages( + cli_overrides={"max_num_seqs": 999}, + cli_explicit_keys=None, + ) + assert stages[2].runtime_overrides.get("max_num_seqs") == 999 + + def test_explicit_async_chunk_false_overrides_yaml(self): + """``--no-async-chunk`` flips the deploy-level async_chunk to False even + when the YAML sets it to True. Verifies that the per-stage + ``async_chunk: True`` injection in ``merge_pipeline_deploy`` is skipped + and that ``async_chunk`` does not leak through ``_merge_cli_overrides``. + """ + stages = self._stages( + cli_overrides={"async_chunk": False}, + cli_explicit_keys={"async_chunk"}, + ) + # qwen3_omni_moe.yaml has `async_chunk: true`, so by default every + # stage's engine_args would carry it. With the explicit override, it + # must NOT show up. + for stage in stages: + assert stage.yaml_engine_args.get("async_chunk") is not True + assert stage.runtime_overrides.get("async_chunk") is None + + def test_default_async_chunk_leaves_yaml_alone(self): + """An unset ``--async-chunk`` (default None) must leave the YAML's True + in force on every stage.""" + stages = self._stages( + cli_overrides={"async_chunk": None}, + cli_explicit_keys=set(), + ) + # qwen3_omni_moe.yaml: `async_chunk: true` → injected on every stage. + for stage in stages: + assert stage.yaml_engine_args.get("async_chunk") is True + + def test_explicit_enable_prefix_caching_overrides_yaml(self): + """``--enable-prefix-caching`` (global) flips every stage's + ``enable_prefix_caching`` to True regardless of the YAML default.""" + stages = self._stages( + cli_overrides={"enable_prefix_caching": True}, + cli_explicit_keys={"enable_prefix_caching"}, + ) + for stage in stages: + assert stage.runtime_overrides.get("enable_prefix_caching") is True + + def test_async_chunk_dispatches_processors(self): + """A single ``qwen3_tts`` pipeline picks per-chunk vs end-to-end + processors based on ``deploy.async_chunk``, without needing a + separate variant pipeline registration.""" + import vllm_omni.model_executor.models.qwen3_tts.pipeline # noqa: F401 + from vllm_omni.config.stage_config import ( + _PIPELINE_REGISTRY, + DeployConfig, + merge_pipeline_deploy, + ) + + pipeline = _PIPELINE_REGISTRY["qwen3_tts"] + + # async_chunk=True → stage 0's per-chunk processor wires up, stage 1 + # has no sync input processor. + async_stages = merge_pipeline_deploy(pipeline, DeployConfig(async_chunk=True)) + assert ( + async_stages[0] + .yaml_engine_args.get("custom_process_next_stage_input_func", "") + .endswith("talker2code2wav_async_chunk") + ) + assert async_stages[1].custom_process_input_func is None + + # async_chunk=False → stage 0 has no streaming processor, stage 1's + # batch-end processor wires up. + sync_stages = merge_pipeline_deploy(pipeline, DeployConfig(async_chunk=False)) + assert "custom_process_next_stage_input_func" not in sync_stages[0].yaml_engine_args + assert sync_stages[1].custom_process_input_func is not None + assert sync_stages[1].custom_process_input_func.endswith("talker2code2wav") + + +class TestSamplingConstraintsPrecedence: + """Test that pipeline sampling_constraints override deploy defaults.""" + + def test_constraints_win(self): + from pathlib import Path + + import vllm_omni.model_executor.models.qwen3_omni.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + pipeline = _PIPELINE_REGISTRY["qwen3_omni_moe"] + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "qwen3_omni_moe.yaml" + if not deploy_path.exists(): + pytest.skip("Deploy config not found") - def test_architecture_models_mapping_exists(self): - """Test that _ARCHITECTURE_MODELS contains expected entries.""" - assert "MiMoAudioForConditionalGeneration" in StageConfigFactory._ARCHITECTURE_MODELS - assert StageConfigFactory._ARCHITECTURE_MODELS["MiMoAudioForConditionalGeneration"] == "mimo_audio" - assert "HunyuanImage3ForCausalMM" in StageConfigFactory._ARCHITECTURE_MODELS - assert StageConfigFactory._ARCHITECTURE_MODELS["HunyuanImage3ForCausalMM"] == "hunyuan_image3" + deploy = load_deploy_config(deploy_path) + stages = merge_pipeline_deploy(pipeline, deploy) - def test_mimo_audio_in_pipeline_models(self): - """Test that mimo_audio is registered in PIPELINE_MODELS.""" - assert "mimo_audio" in StageConfigFactory.PIPELINE_MODELS + # Pipeline says detokenize=True for thinker, deploy can't override + assert stages[0].yaml_extras["default_sampling_params"]["detokenize"] is True + # Pipeline says stop_token_ids=[2150] for talker + assert stages[1].yaml_extras["default_sampling_params"]["stop_token_ids"] == [2150] + # Deploy temperature still flows through + assert stages[0].yaml_extras["default_sampling_params"]["temperature"] == 0.4 diff --git a/tests/utils.py b/tests/utils.py index 84edbbf3d1..d8137cf963 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -11,6 +11,7 @@ import time from collections.abc import Callable from contextlib import ExitStack, contextmanager, suppress +from pathlib import Path from typing import Any, Literal import cloudpickle @@ -24,6 +25,221 @@ _P = ParamSpec("_P") +_REPO_ROOT = Path(__file__).resolve().parent.parent +_DEPLOY_DIR = _REPO_ROOT / "vllm_omni" / "deploy" +_CI_GENERATED_DIR = _REPO_ROOT / "tests" / ".ci_generated" + + +# CI overlays as Python dicts (LSP-friendly). Materialized on demand to +# tests/.ci_generated/.yaml via get_deploy_config_path("ci/.yaml"). +_CI_OVERLAYS: dict[str, dict[str, Any]] = { + "qwen2_5_omni": { + "base_config": "qwen2_5_omni.yaml", + "async_chunk": False, + "stages": [ + { + "stage_id": 0, + "max_model_len": 16384, + "max_num_batched_tokens": 16384, + "max_num_seqs": 1, + "gpu_memory_utilization": 0.9, + "skip_mm_profiling": True, + "load_format": "dummy", + "default_sampling_params": {"max_tokens": 128}, + }, + { + "stage_id": 1, + "max_model_len": 16384, + "max_num_batched_tokens": 16384, + "max_num_seqs": 1, + "gpu_memory_utilization": 0.4, + "skip_mm_profiling": True, + "load_format": "dummy", + "default_sampling_params": {"max_tokens": 4096}, + }, + { + "stage_id": 2, + "max_num_seqs": 1, + "gpu_memory_utilization": 0.5, + "max_num_batched_tokens": 8192, + "max_model_len": 8192, + "load_format": "dummy", + "devices": "2", + "default_sampling_params": {"max_tokens": 8192}, + }, + ], + "platforms": { + "rocm": { + "stages": [ + {"stage_id": 0, "gpu_memory_utilization": 0.9}, + {"stage_id": 1, "gpu_memory_utilization": 0.4}, + {"stage_id": 2, "gpu_memory_utilization": 0.5, "devices": "2"}, + ], + }, + "xpu": { + "stages": [ + { + "stage_id": 0, + "gpu_memory_utilization": 0.9, + "max_num_batched_tokens": 16384, + "max_model_len": 16384, + }, + {"stage_id": 1, "gpu_memory_utilization": 0.5}, + { + "stage_id": 2, + "gpu_memory_utilization": 0.3, + "max_num_batched_tokens": 4096, + "max_model_len": 4096, + "devices": "2", + }, + ], + }, + }, + }, + "qwen3_omni_moe": { + "base_config": "qwen3_omni_moe.yaml", + "async_chunk": False, + "stages": [ + { + "stage_id": 0, + "max_num_seqs": 5, + "max_model_len": 32768, + "mm_processor_cache_gb": 0, + "load_format": "dummy", + "default_sampling_params": {"max_tokens": 150, "ignore_eos": False}, + }, + { + "stage_id": 1, + "gpu_memory_utilization": 0.5, + "max_num_seqs": 5, + "max_model_len": 32768, + "load_format": "dummy", + "default_sampling_params": {"max_tokens": 1000}, + }, + { + "stage_id": 2, + "max_num_seqs": 5, + "max_num_batched_tokens": 100000, + "load_format": "dummy", + "default_sampling_params": {"max_tokens": 2000}, + }, + ], + "platforms": { + "rocm": { + "stages": [ + {"stage_id": 0, "max_num_seqs": 1, "default_sampling_params": {"max_tokens": 100}}, + { + "stage_id": 1, + "max_num_seqs": 1, + "enforce_eager": True, + "default_sampling_params": {"max_tokens": 100}, + }, + { + "stage_id": 2, + "max_num_seqs": 1, + "max_num_batched_tokens": 1000000, + "default_sampling_params": {"max_tokens": 200}, + }, + ], + }, + "xpu": { + "stages": [ + { + "stage_id": 0, + "gpu_memory_utilization": 0.85, + "max_num_seqs": 1, + "tensor_parallel_size": 4, + "enforce_eager": True, + "max_num_batched_tokens": 4096, + "max_model_len": 4096, + "max_cudagraph_capture_size": 0, + "skip_mm_profiling": True, + "devices": "0,1,2,3", + "default_sampling_params": {"max_tokens": 100, "ignore_eos": False}, + }, + { + "stage_id": 1, + "gpu_memory_utilization": 0.6, + "max_num_seqs": 1, + "enforce_eager": True, + "max_num_batched_tokens": 4096, + "max_model_len": 4096, + "max_cudagraph_capture_size": 0, + "skip_mm_profiling": True, + "devices": "4", + }, + { + "stage_id": 2, + "gpu_memory_utilization": 0.3, + "max_num_seqs": 1, + "max_num_batched_tokens": 100000, + "max_cudagraph_capture_size": 0, + "skip_mm_profiling": True, + "devices": "5", + "default_sampling_params": {"max_tokens": 2000}, + }, + ], + }, + }, + }, + # Single-stage thinker-only topology for the abort test. + "qwen2_5_omni_thinker_only": { + "async_chunk": False, + "pipeline": "qwen2_5_omni_thinker_only", + "stages": [ + { + "stage_id": 0, + "max_num_seqs": 1, + "gpu_memory_utilization": 0.9, + "enforce_eager": True, + "max_num_batched_tokens": 16384, + "max_model_len": 16384, + "skip_mm_profiling": True, + "mm_processor_cache_gb": 0, + "load_format": "dummy", + "devices": "0", + "default_sampling_params": { + "temperature": 0.0, + "top_p": 1.0, + "top_k": -1, + "max_tokens": 128, + "seed": 42, + "repetition_penalty": 1.1, + }, + }, + ], + }, +} + + +def _materialize_ci_overlay(model_type: str) -> Path: + import yaml + + if model_type not in _CI_OVERLAYS: + raise KeyError(f"No CI overlay registered for {model_type!r}. Available: {sorted(_CI_OVERLAYS)}") + + _CI_GENERATED_DIR.mkdir(parents=True, exist_ok=True) + out = _CI_GENERATED_DIR / f"{model_type}.yaml" + + overlay = {**_CI_OVERLAYS[model_type]} + base = overlay.get("base_config") + if base: + overlay["base_config"] = str(_DEPLOY_DIR / base) + + with open(out, "w", encoding="utf-8") as f: + yaml.safe_dump(overlay, f, sort_keys=False) + return out + + +def get_deploy_config_path(rel_path: str) -> str: + """Resolve a deploy yaml; ``ci/.yaml`` materializes from ``_CI_OVERLAYS``.""" + if rel_path.startswith("ci/") and rel_path.endswith(".yaml"): + model_type = rel_path[len("ci/") : -len(".yaml")] + if model_type in _CI_OVERLAYS: + return str(_materialize_ci_overlay(model_type)) + return str(_DEPLOY_DIR / rel_path) + + if current_platform.is_rocm(): from amdsmi import ( amdsmi_get_gpu_vram_usage, diff --git a/tests/worker_v2/test_init_model_state.py b/tests/worker_v2/test_init_model_state.py index f60871507b..5c3aabb1c4 100644 --- a/tests/worker_v2/test_init_model_state.py +++ b/tests/worker_v2/test_init_model_state.py @@ -61,6 +61,7 @@ def test_unknown_arch_delegates_to_upstream(mock_upstream): def test_omni_architectures_set_contains_expected(): expected = { "Qwen3OmniMoeForConditionalGeneration", + "Qwen2_5OmniForConditionalGeneration", "MammothModa2ForConditionalGeneration", "MiMoAudioForConditionalGeneration", "MammothModa2ARForConditionalGeneration", diff --git a/vllm_omni/config/__init__.py b/vllm_omni/config/__init__.py index 2aa236e69f..f02c075880 100644 --- a/vllm_omni/config/__init__.py +++ b/vllm_omni/config/__init__.py @@ -5,10 +5,18 @@ from vllm_omni.config.lora import LoRAConfig from vllm_omni.config.model import OmniModelConfig from vllm_omni.config.stage_config import ( + DeployConfig, ModelPipeline, + PipelineConfig, StageConfig, StageConfigFactory, + StageDeployConfig, + StageExecutionType, + StagePipelineConfig, StageType, + load_deploy_config, + merge_pipeline_deploy, + register_pipeline, ) from vllm_omni.config.yaml_util import ( create_config, @@ -24,6 +32,14 @@ "StageConfigFactory", "ModelPipeline", "StageType", + "StageExecutionType", + "StagePipelineConfig", + "PipelineConfig", + "StageDeployConfig", + "DeployConfig", + "load_deploy_config", + "merge_pipeline_deploy", + "register_pipeline", "create_config", "load_yaml_config", "merge_configs", diff --git a/vllm_omni/config/pipeline_registry.py b/vllm_omni/config/pipeline_registry.py new file mode 100644 index 0000000000..c07bc2610c --- /dev/null +++ b/vllm_omni/config/pipeline_registry.py @@ -0,0 +1,55 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Central declarative registry of all vllm-omni pipelines. + +Mirrors the pattern in ``vllm/model_executor/models/registry.py``: each entry +is ``model_type -> (module_path, variable_name)``, and the module is imported +lazily on first lookup (see ``_LazyPipelineRegistry`` in +``vllm_omni/config/stage_config.py``). Keeping every pipeline declared in one +file makes it easy to spot a missing registration, which was the original +motivation in https://github.com/vllm-project/vllm-omni/issues/2887 (item 4). + +Per-model ``pipeline.py`` modules still define the ``PipelineConfig`` instance; +they just no longer need to self-register via ``register_pipeline(...)``. + +Adding a new pipeline: + 1. Define the ``PipelineConfig`` instance as a module-level variable in + ``vllm_omni/.../pipeline.py``. + 2. Add one line to ``_OMNI_PIPELINES`` or ``_DIFFUSION_PIPELINES`` below. + +``register_pipeline(config)`` in ``stage_config`` is still supported for +out-of-tree plugins and tests that create pipelines at runtime; those override +the entries declared here. +""" + +from __future__ import annotations + +# --- Multi-stage omni pipelines (LLM-centric; audio / video I/O) --- +_OMNI_PIPELINES: dict[str, tuple[str, str]] = { + # model_type -> (module_path, variable_name) + "qwen2_5_omni": ( + "vllm_omni.model_executor.models.qwen2_5_omni.pipeline", + "QWEN2_5_OMNI_PIPELINE", + ), + "qwen2_5_omni_thinker_only": ( + "vllm_omni.model_executor.models.qwen2_5_omni.pipeline", + "QWEN2_5_OMNI_THINKER_ONLY_PIPELINE", + ), + "qwen3_omni_moe": ( + "vllm_omni.model_executor.models.qwen3_omni.pipeline", + "QWEN3_OMNI_PIPELINE", + ), + "qwen3_tts": ( + "vllm_omni.model_executor.models.qwen3_tts.pipeline", + "QWEN3_TTS_PIPELINE", + ), +} + +# --- Single-stage diffusion pipelines (populated in PR 3/N) --- +_DIFFUSION_PIPELINES: dict[str, tuple[str, str]] = {} + +# Union view used by ``_LazyPipelineRegistry``; don't mutate at runtime. +_VLLM_OMNI_PIPELINES: dict[str, tuple[str, str]] = { + **_OMNI_PIPELINES, + **_DIFFUSION_PIPELINES, +} diff --git a/vllm_omni/config/stage_config.py b/vllm_omni/config/stage_config.py index a4e186c3bd..392a550be6 100644 --- a/vllm_omni/config/stage_config.py +++ b/vllm_omni/config/stage_config.py @@ -1,18 +1,13 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project -""" -Stage Configuration System for vLLM-Omni. - -Pipeline structure (stages, types, data-flow) is defined in per-model YAML -files and is set by model developers at integration time. -Runtime parameters (gpu_memory_utilization, tp_size, etc.) come from CLI. -""" +"""Stage configuration system for vLLM-Omni.""" from __future__ import annotations +import dataclasses import re import warnings -from dataclasses import asdict, dataclass, field +from dataclasses import asdict, dataclass, field, fields from enum import Enum from pathlib import Path from typing import Any @@ -20,76 +15,818 @@ from vllm.logger import init_logger from vllm_omni.config.yaml_util import create_config, load_yaml_config, to_dict +from vllm_omni.core.sched.omni_ar_scheduler import OmniARScheduler +from vllm_omni.core.sched.omni_generation_scheduler import OmniGenerationScheduler -# Pipeline YAMLs live alongside model code in model_executor/models// _MODELS_DIR = Path(__file__).resolve().parent.parent / "model_executor" / "models" def get_pipeline_path(model_dir: str, filename: str) -> Path: - """Return the full path to a pipeline YAML file. + return _MODELS_DIR / model_dir / filename + + +logger = init_logger(__name__) + + +_STAGE_OVERRIDE_PATTERN = re.compile(r"^stage_(\d+)_(.+)$") - Args: - model_dir: Model subdirectory name (e.g., "qwen3_omni"). - filename: Name of the YAML file (e.g., "pipeline.yaml"). - Returns: - Absolute path to the file. +def build_stage_runtime_overrides( + stage_id: int, + cli_overrides: dict[str, Any], + *, + internal_keys: set[str] | frozenset[str] | None = None, +) -> dict[str, Any]: + """Build per-stage runtime overrides from global and ``stage__*`` kwargs. + + ``internal_keys`` defaults to the union of + ``arg_utils.internal_blacklist_keys()`` and ``arg_utils.SHARED_FIELDS`` + so that neither orchestrator-only fields nor shared-pipeline fields + (``model`` / ``stage_configs_path`` / ``log_stats`` / ``stage_id``) leak + into a stage's per-stage runtime overrides — the orchestrator sets those + uniformly for every stage, they are not per-stage knobs. Callers can + pass an explicit set for tests or specialized flows. """ - return _MODELS_DIR / model_dir / filename + if internal_keys is None: + from vllm_omni.engine.arg_utils import SHARED_FIELDS, internal_blacklist_keys + internal_keys = internal_blacklist_keys() | SHARED_FIELDS -logger = init_logger(__name__) + result: dict[str, Any] = {} + + for key, value in cli_overrides.items(): + if value is None or key in internal_keys: + continue + + match = _STAGE_OVERRIDE_PATTERN.match(key) + if match is not None: + override_stage_id = int(match.group(1)) + param_name = match.group(2) + if override_stage_id == stage_id and param_name not in internal_keys: + result[param_name] = value + continue + + result[key] = value + + return result + + +def strip_parent_engine_args( + kwargs: dict[str, Any], + *, + parent_fields: dict[str, dataclasses.Field], + keep_keys: set[str] | frozenset[str] = frozenset(), + strip_keys: set[str] | frozenset[str] = frozenset(), + no_warn_keys: set[str] | frozenset[str] = frozenset(), +) -> tuple[dict[str, Any], list[str]]: + """Strip parent ``EngineArgs`` fields before merging into stage YAML.""" + overridden: list[str] = [] + result: dict[str, Any] = {} + + for key, value in kwargs.items(): + if key in strip_keys: + continue + + if key not in parent_fields or key in keep_keys: + result[key] = value + continue + + field_def = parent_fields[key] + if field_def.default is not dataclasses.MISSING: + default = field_def.default + elif field_def.default_factory is not dataclasses.MISSING: + default = field_def.default_factory() + else: + default = dataclasses.MISSING + + if default is dataclasses.MISSING or value is None: + continue + + if dataclasses.is_dataclass(default) and not isinstance(default, type): + default = asdict(default) + + if value != default and key not in no_warn_keys: + overridden.append(key) + + return result, sorted(overridden) class StageType(str, Enum): """Type of processing stage in the Omni pipeline.""" + # TODO(@lishunyang12): remove once all models migrate to StageExecutionType LLM = "llm" DIFFUSION = "diffusion" +class StageExecutionType(str, Enum): + """Merged StageType + WorkerType — 3 combinations today.""" + + LLM_AR = "llm_ar" + LLM_GENERATION = "llm_generation" + DIFFUSION = "diffusion" + + +# Mapping class refs (not dotted-path strings) so module/class renames fail +# at import time instead of lazily at scheduler resolution. YAML overrides +# and downstream serialization still use the dotted-path string form; the +# conversion happens at the map lookup site via _scheduler_path(). +_EXECUTION_TYPE_TO_SCHEDULER: dict[StageExecutionType, type | None] = { + StageExecutionType.LLM_AR: OmniARScheduler, + StageExecutionType.LLM_GENERATION: OmniGenerationScheduler, + StageExecutionType.DIFFUSION: None, +} + + +def _scheduler_path(cls: type | None) -> str | None: + """Return the dotted import path for a scheduler class (``None`` passes through).""" + if cls is None: + return None + return f"{cls.__module__}.{cls.__qualname__}" + + +@dataclass(frozen=True) +class StagePipelineConfig: + """Fixed topology for one stage (frozen, not user-configurable).""" + + stage_id: int + model_stage: str + execution_type: StageExecutionType = StageExecutionType.LLM_AR + input_sources: tuple[int, ...] = () + final_output: bool = False + final_output_type: str | None = None + owns_tokenizer: bool = False + requires_multimodal_data: bool = False + hf_config_name: str | None = None + engine_output_type: str | None = None + model_arch: str | None = None + sampling_constraints: dict[str, Any] = field(default_factory=dict) + custom_process_input_func: str | None = None + custom_process_next_stage_input_func: str | None = None + # Alternates picked by ``merge_pipeline_deploy`` based on ``deploy.async_chunk``. + async_chunk_process_next_stage_input_func: str | None = None + sync_process_input_func: str | None = None + prompt_expand_func: str | None = None + cfg_kv_collect_func: str | None = None + omni_kv_config: dict[str, Any] | None = None + extras: dict[str, Any] = field(default_factory=dict) + + +@dataclass(frozen=True) +class PipelineConfig: + """Complete pipeline topology for a model (frozen).""" + + model_type: str + model_arch: str = "" + stages: tuple[StagePipelineConfig, ...] = () + # HF architecture aliases: used by StageConfigFactory when the model's + # HF config reports a generic model_type that collides with a different + # model (e.g. MiMo Audio reports model_type="qwen2"). The factory + # matches ``hf_config.architectures[*]`` against this tuple to route + # to the correct pipeline. Leave empty for models with unique model_type. + hf_architectures: tuple[str, ...] = () + + def get_stage(self, stage_id: int) -> StagePipelineConfig | None: + """Look up a stage by its ID.""" + for stage in self.stages: + if stage.stage_id == stage_id: + return stage + return None + + def get_scheduler_cls(self, stage_id: int) -> str | None: + """Return the inferred scheduler class path for a stage. + + Returns ``None`` for DIFFUSION stages (no vLLM scheduler). Raises + ``ValueError`` if ``stage_id`` doesn't exist in this pipeline, and + ``KeyError`` if ``execution_type`` isn't in the scheduler map. + """ + stage = self.get_stage(stage_id) + if stage is None: + raise ValueError(f"Pipeline {self.model_type!r} has no stage with id {stage_id}") + return _scheduler_path(_EXECUTION_TYPE_TO_SCHEDULER[stage.execution_type]) + + def validate(self) -> list[str]: + """Return list of topology errors (empty if valid).""" + errors: list[str] = [] + if not self.stages: + errors.append("Pipeline has no stages defined") + return errors + stage_ids = [s.stage_id for s in self.stages] + if len(stage_ids) != len(set(stage_ids)): + errors.append("Duplicate stage IDs found") + stage_id_set = set(stage_ids) + for stage in self.stages: + for src in stage.input_sources: + if src not in stage_id_set: + errors.append(f"Stage {stage.stage_id} references non-existent input source {src}") + if src == stage.stage_id: + errors.append(f"Stage {stage.stage_id} references itself") + if not any(not s.input_sources for s in self.stages): + errors.append("No entry point (stage with empty input_sources)") + return errors + + +class _LazyPipelineRegistry: + """Dict-like registry that lazy-loads pipelines from the central declaration. + + In-tree pipelines are declared once in + ``vllm_omni/config/pipeline_registry.py`` as + ``model_type -> (module_path, variable_name)`` entries; the module is + imported only when the pipeline is first looked up. This mirrors the + pattern in ``vllm/model_executor/models/registry.py`` and addresses + https://github.com/vllm-project/vllm-omni/issues/2887 (item 4): having + every registration in one file makes a missing entry easy to spot. + + Out-of-tree / dynamic registrations via ``register_pipeline()`` are stored + directly in ``_loaded`` and take precedence over the lazy-map entry with + the same ``model_type``. + + The class exposes the subset of ``dict`` operations the rest of this + module relies on (``__contains__``, ``__getitem__``, ``__setitem__``, + ``get``, ``keys``, ``values``, ``items``, ``__iter__``), so existing call + sites don't need to change. + """ + + def __init__(self) -> None: + self._loaded: dict[str, PipelineConfig] = {} + # Populated lazily to avoid a circular import at module init time. + self._lazy_map: dict[str, tuple[str, str]] | None = None + + def _get_lazy_map(self) -> dict[str, tuple[str, str]]: + if self._lazy_map is None: + from vllm_omni.config.pipeline_registry import _VLLM_OMNI_PIPELINES + + self._lazy_map = _VLLM_OMNI_PIPELINES + return self._lazy_map + + def _load_lazy(self, model_type: str) -> PipelineConfig | None: + entry = self._get_lazy_map().get(model_type) + if entry is None: + return None + module_path, var_name = entry + import importlib + + try: + module = importlib.import_module(module_path) + except ImportError as exc: + logger.error( + "Failed to import pipeline module %r for %r: %s", + module_path, + model_type, + exc, + ) + return None + pipeline = getattr(module, var_name, None) + if pipeline is None: + logger.error( + "Pipeline variable %r not found in module %r (registered for %r)", + var_name, + module_path, + model_type, + ) + return None + errors = pipeline.validate() + if errors: + logger.warning("Pipeline %s has issues: %s", pipeline.model_type, errors) + self._loaded[model_type] = pipeline + return pipeline + + def __contains__(self, model_type: str) -> bool: + if model_type in self._loaded: + return True + return model_type in self._get_lazy_map() + + def __getitem__(self, model_type: str) -> PipelineConfig: + if model_type in self._loaded: + return self._loaded[model_type] + pipeline = self._load_lazy(model_type) + if pipeline is None: + raise KeyError(model_type) + return pipeline + + def get(self, model_type: str, default: PipelineConfig | None = None) -> PipelineConfig | None: + if model_type in self._loaded: + return self._loaded[model_type] + pipeline = self._load_lazy(model_type) + return pipeline if pipeline is not None else default + + def __setitem__(self, model_type: str, pipeline: PipelineConfig) -> None: + self._loaded[model_type] = pipeline + + def __delitem__(self, model_type: str) -> None: + """Remove a dynamically-registered pipeline. + + Only the dynamic-cache side of the registry can be mutated; the + central declarative registry is immutable at runtime. Calling ``del`` + on a model_type that only exists in the central registry raises + ``KeyError``. + """ + if model_type in self._loaded: + del self._loaded[model_type] + return + if model_type in self._get_lazy_map(): + raise KeyError( + f"{model_type!r} is declared in the central pipeline_registry and " + "cannot be removed at runtime. Edit " + "vllm_omni/config/pipeline_registry.py to delete a built-in entry." + ) + raise KeyError(model_type) + + def keys(self) -> set[str]: + return set(self._get_lazy_map().keys()) | set(self._loaded.keys()) + + def values(self): + # Iterating values forces load of every lazy pipeline. + for key in self.keys(): + yield self[key] + + def items(self): + for key in self.keys(): + yield key, self[key] + + def __iter__(self): + return iter(self.keys()) + + +_PIPELINE_REGISTRY = _LazyPipelineRegistry() + + +def register_pipeline(pipeline: PipelineConfig) -> None: + """Register a pipeline config dynamically. + + In-tree pipelines are declared in ``pipeline_registry._VLLM_OMNI_PIPELINES`` + and loaded lazily; calling ``register_pipeline`` is only needed for + out-of-tree plugins or tests that build a ``PipelineConfig`` at runtime. + A dynamic registration overrides the central-registry entry with the same + ``model_type``. + """ + errors = pipeline.validate() + if errors: + logger.warning("Pipeline %s has issues: %s", pipeline.model_type, errors) + _PIPELINE_REGISTRY[pipeline.model_type] = pipeline + + +_DEPLOY_DIR = Path(__file__).resolve().parent.parent / "deploy" + + +@dataclass +class StageDeployConfig: + """Per-stage deployment knobs. + + Only fields whose value legitimately varies across stages of the same + pipeline live here (e.g. ``max_num_seqs`` on thinker vs talker, + ``devices`` for GPU placement). Pipeline-wide settings + (``trust_remote_code``, ``distributed_executor_backend``, ``dtype``, + ``quantization``, prefix/chunked prefill, DP/PP sizes) are declared at + the top level of ``DeployConfig`` and propagated to every stage. + """ + + stage_id: int + max_num_seqs: int = 64 + gpu_memory_utilization: float = 0.9 + tensor_parallel_size: int = 1 + enforce_eager: bool = False + max_num_batched_tokens: int = 32768 + max_model_len: int | None = None + async_scheduling: bool | None = None + devices: str = "0" + output_connectors: dict[str, str] | None = None + input_connectors: dict[str, str] | None = None + default_sampling_params: dict[str, Any] | None = None + engine_extras: dict[str, Any] = field(default_factory=dict) + + +@dataclass +class DeployConfig: + """Loaded from deploy/.yaml — the only config file users edit. + + Top-level fields (``trust_remote_code``, ``distributed_executor_backend``, + ``dtype``, ``quantization``, ``enable_prefix_caching``, + ``enable_chunked_prefill``, ``data_parallel_size``, + ``pipeline_parallel_size``) are pipeline-wide: they apply uniformly to + every stage. Fields that legitimately vary per stage live in the + individual ``StageDeployConfig`` entries under ``stages:``. + """ + + async_chunk: bool = True + connectors: dict[str, Any] | None = None + edges: list[dict[str, Any]] | None = None + stages: list[StageDeployConfig] = field(default_factory=list) + platforms: dict[str, Any] | None = None + # Overrides the auto-detected pipeline registry key for structural variants. + pipeline: str | None = None + + # === Pipeline-wide engine settings (applied uniformly to every stage) === + trust_remote_code: bool = True + distributed_executor_backend: str = "mp" + dtype: str | None = None + quantization: str | None = None + enable_prefix_caching: bool = False + enable_chunked_prefill: bool | None = None + data_parallel_size: int = 1 + pipeline_parallel_size: int = 1 + + +_STAGE_NON_ENGINE_KEYS = frozenset( + { + "stage_id", + "devices", + "output_connectors", + "input_connectors", + "default_sampling_params", + "engine_extras", + } +) + +# Fields on StageDeployConfig that are populated from engine_args dict +_STAGE_DEPLOY_FIELDS = {f.name: f for f in fields(StageDeployConfig) if f.name not in _STAGE_NON_ENGINE_KEYS} + + +def _parse_stage_deploy(stage_data: dict[str, Any]) -> StageDeployConfig: + """Parse a single stage entry from deploy YAML into StageDeployConfig.""" + if "engine_args" in stage_data: + engine_args = dict(stage_data["engine_args"]) + devices = stage_data.get("runtime", {}).get("devices", stage_data.get("devices", "0")) + else: + engine_args = {k: v for k, v in stage_data.items() if k not in _STAGE_NON_ENGINE_KEYS and k != "stage_id"} + devices = stage_data.get("devices", "0") + + kwargs: dict[str, Any] = {"stage_id": stage_data["stage_id"], "devices": devices} + for name, f in _STAGE_DEPLOY_FIELDS.items(): + if name in engine_args: + kwargs[name] = engine_args.pop(name) + + kwargs["output_connectors"] = stage_data.get("output_connectors") + kwargs["input_connectors"] = stage_data.get("input_connectors") + kwargs["default_sampling_params"] = stage_data.get("default_sampling_params") + kwargs["engine_extras"] = engine_args + return StageDeployConfig(**kwargs) + + +_DEEP_MERGE_KEYS = frozenset({"default_sampling_params", "engine_extras", "engine_args"}) + + +def _deep_merge_stage(base: dict, overlay: dict) -> dict: + """Deep-merge ``_DEEP_MERGE_KEYS`` so thin overlays don't drop base keys.""" + merged = dict(base) + for k, v in overlay.items(): + if k in _DEEP_MERGE_KEYS: + base_val = merged.get(k) + if isinstance(v, dict) and isinstance(base_val, dict): + merged[k] = {**base_val, **v} + continue + # Deep-merge key but at least one side isn't a dict: surface the + # silent clobber so mismatched YAML types don't get past review. + if base_val is not None: + logger.warning( + "Deep-merge key %r has non-dict value (base=%s, overlay=%s); " + "overlay will fully replace base instead of merging.", + k, + type(base_val).__name__, + type(v).__name__, + ) + merged[k] = v + return merged + + +def _merge_stage_lists( + base_stages: list[dict[str, Any]] | None, + overlay_stages: list[dict[str, Any]] | None, +) -> list[dict[str, Any]]: + """Merge two ``stages:`` lists by ``stage_id`` (overlay wins per field).""" + by_id: dict[int, dict[str, Any]] = {s["stage_id"]: s for s in (base_stages or [])} + for overlay_stage in overlay_stages or []: + sid = overlay_stage["stage_id"] + if sid in by_id: + by_id[sid] = _deep_merge_stage(by_id[sid], overlay_stage) + else: + by_id[sid] = overlay_stage + return list(by_id.values()) + + +def _merge_platforms( + base: dict[str, Any] | None, + overlay: dict[str, Any] | None, +) -> dict[str, Any] | None: + """Deep-merge two ``platforms:`` blocks per-platform, per-stage_id.""" + if not base and not overlay: + return None + base = base or {} + overlay = overlay or {} + merged: dict[str, Any] = {} + for plat in set(base) | set(overlay): + bp = base.get(plat) or {} + op = overlay.get(plat) or {} + merged_plat = {**bp, **{k: v for k, v in op.items() if k != "stages"}} + merged_plat["stages"] = _merge_stage_lists(bp.get("stages"), op.get("stages")) + merged[plat] = merged_plat + return merged + + +def resolve_deploy_yaml(path: str | Path) -> dict[str, Any]: + """Load a deploy YAML with optional ``base_config`` inheritance.""" + raw_dict = to_dict(load_yaml_config(path)) + + base_path = raw_dict.pop("base_config", None) + if base_path is None: + return raw_dict + + # Resolve relative to the overlay file's directory + base_path = Path(path).parent / base_path + base_dict = resolve_deploy_yaml(base_path) + + # Merge top-level scalars: overlay wins. ``stages:`` and ``platforms:`` + # are deep-merged below so an overlay can layer on top of the base. + merged = { + **base_dict, + **{k: v for k, v in raw_dict.items() if k not in ("stages", "platforms")}, + } + merged["stages"] = _merge_stage_lists(base_dict.get("stages"), raw_dict.get("stages")) + merged_platforms = _merge_platforms(base_dict.get("platforms"), raw_dict.get("platforms")) + if merged_platforms is not None: + merged["platforms"] = merged_platforms + + return merged + + +def load_deploy_config(path: str | Path) -> DeployConfig: + """Load a deploy YAML (with optional base_config inheritance).""" + raw_dict = resolve_deploy_yaml(path) + + stages = [_parse_stage_deploy(s) for s in raw_dict.get("stages", [])] + + kwargs: dict[str, Any] = { + "async_chunk": raw_dict.get("async_chunk", True), + "connectors": raw_dict.get("connectors", None), + "edges": raw_dict.get("edges", None), + "stages": stages, + "platforms": raw_dict.get("platforms", None), + "pipeline": raw_dict.get("pipeline", None), + } + # Pipeline-wide engine settings: only set if explicitly present in YAML + # so the DeployConfig dataclass defaults take effect otherwise. + for name in ( + "trust_remote_code", + "distributed_executor_backend", + "dtype", + "quantization", + "enable_prefix_caching", + "enable_chunked_prefill", + "data_parallel_size", + "pipeline_parallel_size", + ): + if name in raw_dict: + kwargs[name] = raw_dict[name] + return DeployConfig(**kwargs) + + +def _detect_platform() -> str | None: + """Return "npu", "rocm", "xpu", or None (CUDA default).""" + try: + from vllm.platforms import current_platform + + name = current_platform.device_name.lower() + if "npu" in name: + return "npu" + if "rocm" in name or "amd" in name: + return "rocm" + if "xpu" in name: + return "xpu" + except Exception as e: + logger.debug("Platform auto-detect failed, falling back to CUDA: %s", e) + return None + + +def _extract_platform_overrides(ps: dict[str, Any]) -> tuple[dict[str, Any], str | None]: + """Return ``(overrides, devices)`` from a platform stage entry. + + Handles both the nested layout (``engine_args:`` / ``runtime.devices``) and + the flat layout. ``devices`` is ``None`` when no override is set. + """ + if "engine_args" in ps: + return dict(ps["engine_args"]), ps.get("runtime", {}).get("devices") + overrides = {k: v for k, v in ps.items() if k not in ("stage_id", "devices")} + return overrides, ps.get("devices") + + +def _apply_platform_overrides( + deploy: DeployConfig, + platform: str | None = None, +) -> DeployConfig: + """Merge platform-specific stage overrides into deploy config.""" + if platform is None: + platform = _detect_platform() + if platform is None or deploy.platforms is None: + return deploy + platform_section = deploy.platforms.get(platform) + if platform_section is None: + return deploy + + platform_stages = platform_section.get("stages", []) + base_by_id = {s.stage_id: s for s in deploy.stages} + + for ps in platform_stages: + base = base_by_id.get(ps["stage_id"]) + if base is None: + continue + overrides, devices = _extract_platform_overrides(ps) + if devices is not None: + base.devices = devices + for key, val in overrides.items(): + if hasattr(base, key): + setattr(base, key, val) + else: + base.engine_extras[key] = val + + return deploy + + +_EXECUTION_TYPE_TO_STAGE_WORKER: dict[StageExecutionType, tuple[StageType, str | None]] = { + StageExecutionType.LLM_AR: (StageType.LLM, "ar"), + StageExecutionType.LLM_GENERATION: (StageType.LLM, "generation"), + StageExecutionType.DIFFUSION: (StageType.DIFFUSION, None), +} + + +def _resolve_execution_mode( + execution_type: StageExecutionType, +) -> tuple[StageType, str | None]: + """Map ``execution_type`` → ``(stage_type, worker_type)`` legacy tuple.""" + return _EXECUTION_TYPE_TO_STAGE_WORKER.get(execution_type, (StageType.LLM, None)) + + +def _select_processor_funcs( + ps: StagePipelineConfig, + async_chunk: bool, +) -> tuple[str | None, str | None]: + """Pick ``(input_proc, next_stage_proc)`` based on the async_chunk mode.""" + next_stage_proc = ps.custom_process_next_stage_input_func + input_proc = ps.custom_process_input_func + if async_chunk and ps.async_chunk_process_next_stage_input_func: + next_stage_proc = ps.async_chunk_process_next_stage_input_func + elif not async_chunk and ps.sync_process_input_func: + input_proc = ps.sync_process_input_func + return input_proc, next_stage_proc + + +# Pipeline-wide DeployConfig fields that are propagated to every stage's +# engine args during merge. These live at top level of the deploy YAML. +_PIPELINE_WIDE_ENGINE_FIELDS: tuple[str, ...] = ( + "trust_remote_code", + "distributed_executor_backend", + "dtype", + "quantization", + "enable_prefix_caching", + "enable_chunked_prefill", + "data_parallel_size", + "pipeline_parallel_size", +) + + +def _build_engine_args( + ps: StagePipelineConfig, + ds: StageDeployConfig | None, + pipeline: PipelineConfig, + deploy: DeployConfig, + next_stage_proc: str | None, +) -> dict[str, Any]: + """Assemble the flat ``yaml_engine_args`` dict for one stage. + + Pipeline-wide DeployConfig fields are applied uniformly to every stage; + per-stage StageDeployConfig overrides take precedence when present (e.g. + ``engine_extras`` can still carry a stage-specific ``dtype``). + """ + engine_args: dict[str, Any] = {"model_arch": ps.model_arch or pipeline.model_arch} + if ps.engine_output_type: + engine_args["engine_output_type"] = ps.engine_output_type + if next_stage_proc: + engine_args["custom_process_next_stage_input_func"] = next_stage_proc + + # Pipeline-wide top-level DeployConfig settings, applied to every stage. + for name in _PIPELINE_WIDE_ENGINE_FIELDS: + value = getattr(deploy, name) + if value is not None: + engine_args[name] = value + + # Per-stage StageDeployConfig values override pipeline-wide settings. + if ds is not None: + for k, v in asdict(ds).items(): + if k in _STAGE_NON_ENGINE_KEYS or v is None: + continue + engine_args[k] = v + engine_args.update(ds.engine_extras) + if deploy.async_chunk: + engine_args["async_chunk"] = True + return engine_args + + +def _build_extras( + ps: StagePipelineConfig, + ds: StageDeployConfig | None, +) -> dict[str, Any]: + """Assemble ``yaml_extras`` (sampling + connectors + pipeline extras).""" + extras: dict[str, Any] = {} + sampling: dict[str, Any] = {} + if ds is not None and ds.default_sampling_params: + sampling.update(ds.default_sampling_params) + sampling.update(ps.sampling_constraints) + if sampling: + extras["default_sampling_params"] = sampling + if ds is not None and ds.output_connectors: + extras["output_connectors"] = dict(ds.output_connectors) + if ds is not None and ds.input_connectors: + extras["input_connectors"] = dict(ds.input_connectors) + if ps.extras: + extras.update(ps.extras) + return extras + + +def merge_pipeline_deploy( + pipeline: PipelineConfig, + deploy: DeployConfig, + cli_overrides: dict[str, Any] | None = None, +) -> list[StageConfig]: + """Merge pipeline + deploy + platform overrides → list[StageConfig].""" + if cli_overrides is None: + cli_overrides = {} + + deploy = _apply_platform_overrides(deploy) + deploy_by_id = {s.stage_id: s for s in deploy.stages} + + # A pipeline supports async_chunk if any stage has either an explicit + # async-chunk-only processor slot OR a custom next-stage processor (some + # pipelines like qwen3_omni wire async-chunk processing directly through + # ``custom_process_next_stage_input_func``). Only raise when neither is + # present — that's the "user enabled async_chunk but pipeline has no + # inter-stage processing at all" case. + if deploy.async_chunk and not any( + ps.async_chunk_process_next_stage_input_func or ps.custom_process_next_stage_input_func + for ps in pipeline.stages + ): + raise ValueError( + f"Pipeline {pipeline.model_type!r} has async_chunk=True in deploy but no stage " + "declares a next-stage input processor " + "(``async_chunk_process_next_stage_input_func`` or ``custom_process_next_stage_input_func``). " + "Either set async_chunk=False or implement an async-chunk processor on the pipeline." + ) + + result: list[StageConfig] = [] + for ps in pipeline.stages: + ds = deploy_by_id.get(ps.stage_id) + stage_type, worker_type = _resolve_execution_mode(ps.execution_type) + input_proc, next_stage_proc = _select_processor_funcs(ps, deploy.async_chunk) + engine_args = _build_engine_args(ps, ds, pipeline, deploy, next_stage_proc) + extras = _build_extras(ps, ds) + runtime: dict[str, Any] = {"process": True} + if ds is not None: + runtime["devices"] = ds.devices + + result.append( + StageConfig( + stage_id=ps.stage_id, + model_stage=ps.model_stage, + stage_type=stage_type, + input_sources=list(ps.input_sources), + custom_process_input_func=input_proc, + final_output=ps.final_output, + final_output_type=ps.final_output_type, + worker_type=worker_type, + scheduler_cls=_scheduler_path(_EXECUTION_TYPE_TO_SCHEDULER.get(ps.execution_type)), + hf_config_name=ps.hf_config_name, + is_comprehension=ps.owns_tokenizer, + yaml_engine_args=engine_args, + yaml_runtime=runtime, + yaml_extras=extras, + ) + ) + return result + + @dataclass class StageConfig: - """Per-stage configuration from pipeline YAML. + """Per-stage config (legacy path). Used by both new and legacy loaders. - Topology fields (stage_id, input_sources, etc.) define the DAG. - Engine and runtime defaults come from the YAML; CLI overrides take - precedence via ``runtime_overrides``. + TODO(@lishunyang12): replace with ResolvedStageConfig once all models are migrated. """ - # Identity stage_id: int model_stage: str - - # Stage type stage_type: StageType = StageType.LLM - input_sources: list[int] = field(default_factory=list) custom_process_input_func: str | None = None final_output: bool = False - final_output_type: str | None = None # "text", "audio", "image" - worker_type: str | None = None # "ar" or "generation" + final_output_type: str | None = None + worker_type: str | None = None scheduler_cls: str | None = None hf_config_name: str | None = None is_comprehension: bool = False - - # Per-stage engine args from pipeline YAML (defaults) yaml_engine_args: dict[str, Any] = field(default_factory=dict) - # Per-stage runtime config from pipeline YAML (devices, etc.) yaml_runtime: dict[str, Any] = field(default_factory=dict) - # Pass-through fields from pipeline YAML (default_sampling_params, - # output_connectors, input_connectors, tts_args, etc.) yaml_extras: dict[str, Any] = field(default_factory=dict) - - # Runtime overrides (populated from CLI, not from pipeline YAML) runtime_overrides: dict[str, Any] = field(default_factory=dict) def to_omegaconf(self) -> Any: - """Convert to OmegaConf for backward compatibility with OmniStage. - - Returns: - OmegaConf DictConfig with stage configuration in legacy format. - """ + """TODO(@lishunyang12): remove once engine consumes ResolvedStageConfig directly.""" # Start with YAML engine_args defaults engine_args: dict[str, Any] = dict(self.yaml_engine_args) @@ -152,9 +889,9 @@ def to_omegaconf(self) -> Any: @dataclass class ModelPipeline: - """Complete pipeline definition for a multi-stage model. + """Complete pipeline definition for a multi-stage model (legacy). - Defined by model developers, bundled with the model, not user-editable. + TODO(@lishunyang12): remove once all models migrate to PipelineConfig. """ model_type: str @@ -225,49 +962,55 @@ class StageConfigFactory: """Factory that loads pipeline YAML and merges CLI overrides. Handles both single-stage and multi-stage models. - """ - # Mapping of model types to directories under model_executor/models/. - PIPELINE_MODELS: dict[str, str] = { - "qwen3_omni_moe": "qwen3_omni", - "qwen2_5_omni": "qwen2_5_omni", - "bagel": "bagel", - "qwen3_tts": "qwen3_tts", - "voxtral_tts": "voxtral_tts", - "mimo_audio": "mimo_audio", - "glm-image": "glm_image", - "cosyvoice3": "cosyvoice3", - "mammothmoda2": "mammoth_moda2", - } - - # Fallback: map HF architecture class names to pipeline dirs. - # Used when model_type collides with another model (e.g. MiMo Audio - # reports model_type="qwen2" which matches plain Qwen2, not our pipeline). - _ARCHITECTURE_MODELS: dict[str, str] = { - "MiMoAudioForConditionalGeneration": "mimo_audio", - "HunyuanImage3ForCausalMM": "hunyuan_image3", - } + Pipelines are declared in ``vllm_omni/config/pipeline_registry.py`` and + loaded lazily via ``_PIPELINE_REGISTRY``; no hardcoded model-type → + directory mapping is maintained here. Models with generic HF + ``model_type`` collisions (e.g. MiMo Audio reports ``qwen2``) should + declare ``hf_architectures=(...)`` on their ``PipelineConfig`` so the + factory can disambiguate via ``hf_config.architectures``. + """ @classmethod def create_from_model( cls, model: str, cli_overrides: dict[str, Any] | None = None, + deploy_config_path: str | None = None, + cli_explicit_keys: set[str] | None = None, ) -> list[StageConfig] | None: - """Load pipeline YAML, merge with CLI overrides. + """Load pipeline + deploy config, merge with CLI overrides. - Args: - model: Model name or path. - cli_overrides: CLI overrides from VllmConfig/OmniDiffusionConfig. + Checks _PIPELINE_REGISTRY first (new path), falls back to legacy YAML. - Returns: - List of StageConfig objects with CLI overrides applied, - or None if no pipeline definition was found for this model. + ``cli_explicit_keys`` is the set of CLI keys the user actually typed + (captured at the parser layer in ``vllm serve``). When ``None`` — + which is the case for programmatic ``Omni()`` callers — every kwarg + in ``cli_overrides`` is treated as explicit. """ if cli_overrides is None: cli_overrides = {} trust_remote_code = cli_overrides.get("trust_remote_code", True) + + # --- New path: check pipeline registry by model_type first --- + model_type, hf_config = cls._auto_detect_model_type(model, trust_remote_code=trust_remote_code) + if model_type and model_type in _PIPELINE_REGISTRY: + return cls._create_from_registry(model_type, cli_overrides, deploy_config_path, cli_explicit_keys) + + # --- HF architecture fallback: some models report a generic + # model_type that collides with another model. Match by the + # hf_architectures declared on each registered PipelineConfig. + if hf_config is not None: + hf_archs = set(getattr(hf_config, "architectures", []) or []) + if hf_archs: + for registered in _PIPELINE_REGISTRY.values(): + if hf_archs.intersection(registered.hf_architectures): + return cls._create_from_registry( + registered.model_type, cli_overrides, deploy_config_path, cli_explicit_keys + ) + + # --- Legacy path: load from pipeline YAML --- pipeline = cls._load_pipeline(model, trust_remote_code=trust_remote_code) if pipeline is None: @@ -295,6 +1038,78 @@ def create_from_model( return result + @classmethod + def _create_from_registry( + cls, + model_type: str, + cli_overrides: dict[str, Any], + deploy_config_path: str | None = None, + cli_explicit_keys: set[str] | None = None, + ) -> list[StageConfig]: + """Create StageConfigs from pipeline registry + deploy YAML. + + Precedence (high → low): + explicit CLI args > deploy YAML > parser default CLI values + + ``cli_explicit_keys`` carries the set of long-option attribute names + the user actually typed (captured in ``OmniServeCommand.cmd``). Any + kwarg whose key is not in that set is treated as a parser default + and is only used to fill fields YAML doesn't already cover. When the + set is ``None`` (programmatic ``Omni()`` callers, which have no + argparse layer), every kwarg is treated as explicit. + """ + # Resolve deploy config path + if deploy_config_path is None: + deploy_path = _DEPLOY_DIR / f"{model_type}.yaml" + else: + deploy_path = Path(deploy_config_path) + + if not deploy_path.exists(): + logger.warning( + "Deploy config not found: %s — using pipeline defaults only", + deploy_path, + ) + deploy_cfg = DeployConfig() + else: + deploy_cfg = load_deploy_config(deploy_path) + + cli_async_chunk = cli_overrides.get("async_chunk") + if cli_async_chunk is not None and (cli_explicit_keys is None or "async_chunk" in cli_explicit_keys): + deploy_cfg.async_chunk = bool(cli_async_chunk) + + pipeline_key = deploy_cfg.pipeline or model_type + if pipeline_key not in _PIPELINE_REGISTRY: + raise KeyError( + f"Pipeline {pipeline_key!r} not in registry " + f"(resolved from {deploy_path.name!r}). Available: " + f"{sorted(_PIPELINE_REGISTRY.keys())}" + ) + pipeline_cfg = _PIPELINE_REGISTRY[pipeline_key] + + stages = merge_pipeline_deploy(pipeline_cfg, deploy_cfg, cli_overrides) + + # Precedence: explicit CLI > yaml > parser-default CLI. + # Per-stage (``stage_N_*``) keys are always treated as explicit. + explicit_overrides: dict[str, Any] = {} + default_overrides: dict[str, Any] = {} + for key, value in cli_overrides.items(): + if value is None: + continue + is_per_stage = bool(re.match(r"stage_\d+_", key)) + is_explicit = cli_explicit_keys is None or key in cli_explicit_keys or is_per_stage + if is_explicit: + explicit_overrides[key] = value + else: + default_overrides[key] = value + + for stage in stages: + yaml_keys = set(stage.yaml_engine_args) + fallback = {k: v for k, v in default_overrides.items() if k not in yaml_keys} + merged = {**fallback, **explicit_overrides} + stage.runtime_overrides = cls._merge_cli_overrides(stage, merged) + + return stages + @classmethod def create_default_diffusion(cls, kwargs: dict[str, Any]) -> list[dict[str, Any]]: """Single-stage diffusion - no YAML needed. @@ -322,9 +1137,16 @@ def create_default_diffusion(cls, kwargs: dict[str, Any]) -> list[dict[str, Any] continue engine_args[key] = value - # Serialize parallel_config as dict for OmegaConf compatibility + # Serialize parallel_config as dict for OmegaConf. Test helpers + # sometimes pass SimpleNamespace rather than a dataclass instance. if "parallel_config" in kwargs: - engine_args["parallel_config"] = asdict(kwargs["parallel_config"]) + parallel_config = kwargs["parallel_config"] + if dataclasses.is_dataclass(parallel_config) and not isinstance(parallel_config, type): + engine_args["parallel_config"] = asdict(parallel_config) + elif hasattr(parallel_config, "__dict__"): + engine_args["parallel_config"] = dict(vars(parallel_config)) + else: + engine_args["parallel_config"] = parallel_config engine_args.setdefault("cache_backend", "none") engine_args["model_stage"] = "diffusion" @@ -351,40 +1173,49 @@ def create_default_diffusion(cls, kwargs: dict[str, Any]) -> list[dict[str, Any] @classmethod def _load_pipeline(cls, model: str, trust_remote_code: bool = True) -> ModelPipeline | None: - """Load pipeline YAML for the model. + """Load a legacy ``pipeline.yaml`` for the model. - Args: - model: Model name or path. - trust_remote_code: Whether to trust remote code for HF config loading. + Searches ``model_executor/models//pipeline.yaml`` by trying + (a) the raw ``model_type`` as the directory name, then + (b) ``model_type`` with hyphens replaced by underscores, + and finally (c) scanning every ``pipeline.yaml`` for one that + declares a matching ``model_type`` or ``hf_architectures``. - Returns: - ModelPipeline if found, None otherwise. + Returns None if no pipeline.yaml is found — caller handles the + ``resolve_model_config_path`` fallback via stage_configs/ YAMLs. """ model_type, hf_config = cls._auto_detect_model_type(model, trust_remote_code=trust_remote_code) if model_type is None: return None - pipeline_dir = cls.PIPELINE_MODELS.get(model_type) - - # Fallback: check HF architectures when model_type doesn't match - if pipeline_dir is None and hf_config is not None: - for arch in getattr(hf_config, "architectures", []) or []: - pipeline_dir = cls._ARCHITECTURE_MODELS.get(arch) - if pipeline_dir is not None: - model_type = pipeline_dir - break - - if pipeline_dir is None: - logger.debug(f"No pipeline mapping for model_type: {model_type}") - return None - - pipeline_path = get_pipeline_path(pipeline_dir, "pipeline.yaml") - - if not pipeline_path.exists(): - logger.debug(f"Pipeline file not found: {pipeline_path}") - return None + # Direct lookups by convention + candidates = [model_type, model_type.replace("-", "_")] + for dir_name in candidates: + pipeline_path = get_pipeline_path(dir_name, "pipeline.yaml") + if pipeline_path.exists(): + return cls._parse_pipeline_yaml(pipeline_path, model_type) + + # Scan fallback: read every pipeline.yaml and match on declared fields + hf_archs = set(getattr(hf_config, "architectures", []) or []) if hf_config else set() + if _MODELS_DIR.exists(): + for subdir in sorted(_MODELS_DIR.iterdir()): + if not subdir.is_dir(): + continue + pipeline_path = subdir / "pipeline.yaml" + if not pipeline_path.exists(): + continue + try: + cfg = load_yaml_config(pipeline_path) + except Exception as exc: + logger.debug("Skip %s: %s", pipeline_path, exc) + continue + declared_type = getattr(cfg, "model_type", None) + declared_archs = set(getattr(cfg, "hf_architectures", None) or []) + if declared_type == model_type or (hf_archs and hf_archs.intersection(declared_archs)): + return cls._parse_pipeline_yaml(pipeline_path, declared_type or model_type) - return cls._parse_pipeline_yaml(pipeline_path, model_type) + logger.debug("No pipeline.yaml found for model_type %s (archs=%s)", model_type, sorted(hf_archs)) + return None # Keys consumed as explicit StageConfig fields — everything else is # passed through via yaml_extras. @@ -542,66 +1373,17 @@ def _auto_detect_model_type(cls, model: str, trust_remote_code: bool = True) -> return None, None - # Keys that should never be forwarded as engine overrides (internal / - # orchestrator-only knobs, complex objects, etc.). - _INTERNAL_KEYS: set[str] = { - "model", - "stage_configs_path", - "stage_id", - "stage_init_timeout", - "init_timeout", - "shm_threshold_bytes", - "worker_backend", - "ray_address", - "batch_timeout", - "log_stats", - "tokenizer", - "parallel_config", - } - @classmethod def _merge_cli_overrides( cls, stage: StageConfig, cli_overrides: dict[str, Any], ) -> dict[str, Any]: - """Merge CLI overrides into stage runtime config. + """Merge global and per-stage (``stage_N_*``) CLI overrides. - All CLI arguments registered by engine config classes (e.g. - EngineArgs / OmniDiffusionConfig) are accepted as overrides - unless they appear in ``_INTERNAL_KEYS``. - - Handles: - - Global overrides (apply to all stages) - - Per-stage overrides (--stage-N-* format, take precedence) - - Args: - stage: The stage to merge overrides into. - cli_overrides: CLI arguments from VllmConfig/OmniDiffusionConfig. - - Returns: - Dict of runtime overrides for this stage. + Orchestrator-owned keys are filtered by ``build_stage_runtime_overrides`` + using ``OrchestratorArgs`` as the single source of truth; unknown + server/uvicorn keys are dropped downstream by + ``filter_dataclass_kwargs(OmniEngineArgs, ...)``. """ - result: dict[str, Any] = {} - - # Apply global overrides – any key not in the internal blocklist - # is forwarded so that engine-registered params work out of the box. - for key, value in cli_overrides.items(): - if key in cls._INTERNAL_KEYS: - continue - if re.match(r"stage_\d+_", key): - # Per-stage keys handled below - continue - if value is not None: - result[key] = value - - # Apply per-stage overrides (--stage-N-* format, take precedence) - stage_prefix = f"stage_{stage.stage_id}_" - for key, value in cli_overrides.items(): - if key.startswith(stage_prefix) and value is not None: - param_name = key[len(stage_prefix) :] - if param_name in cls._INTERNAL_KEYS: - continue - result[param_name] = value - - return result + return build_stage_runtime_overrides(stage.stage_id, cli_overrides) diff --git a/vllm_omni/core/sched/omni_ar_scheduler.py b/vllm_omni/core/sched/omni_ar_scheduler.py index 2a0c92e9d0..9285df7b7d 100644 --- a/vllm_omni/core/sched/omni_ar_scheduler.py +++ b/vllm_omni/core/sched/omni_ar_scheduler.py @@ -16,9 +16,10 @@ from vllm.v1.engine import EngineCoreOutput, EngineCoreOutputs from vllm.v1.metrics.perf import PerfStats from vllm.v1.outputs import ModelRunnerOutput -from vllm.v1.request import Request, RequestStatus +from vllm.v1.request import Request, RequestStatus, StreamingUpdate from vllm.v1.spec_decode.metrics import SpecDecodingStats +from vllm_omni.core.sched.omni_scheduler_mixin import OmniSchedulerMixin from vllm_omni.core.sched.output import OmniSchedulerOutput from vllm_omni.distributed.omni_connectors.transfer_adapter.chunk_transfer_adapter import ( OmniChunkTransferAdapter, @@ -43,7 +44,7 @@ def to_dict(self) -> dict[str, Any]: return asdict(self) -class OmniARScheduler(VLLMScheduler): +class OmniARScheduler(OmniSchedulerMixin, VLLMScheduler): """ OmniARScheduler: Scheduler for vLLM-Omni multimodal processing. @@ -87,6 +88,8 @@ def __init__(self, *args, **kwargs): self.chunk_transfer_adapter = None if getattr(model_config, "async_chunk", False): self.chunk_transfer_adapter = OmniChunkTransferAdapter(self.vllm_config) + # Snapshot prompt length for each streaming input update + self._new_prompt_len_snapshot: dict[str, int] = {} def _get_kv_transfer_criteria(self) -> dict | None: # Note: vllm_config is available in Scheduler after super().__init__ @@ -351,6 +354,7 @@ def update_from_output( ) stopped = False + is_segment_finished = False new_logprobs = None new_token_ids = generated_token_ids pooler_output = pooler_outputs[req_index] if pooler_outputs else None @@ -379,6 +383,7 @@ def update_from_output( # Capture finish_reason BEFORE _handle_stopped_request, which may # reset the status to WAITING for streaming requests that continue. finish_reason = request.get_finished_reason() + is_segment_finished = request.is_finished() and request.resumable finished = self._handle_stopped_request(request) if finished: kv_transfer_params = self._free_request(request) @@ -431,6 +436,8 @@ def update_from_output( num_external_computed_tokens=request.num_external_computed_tokens, routed_experts=routed_experts, num_nans_in_logits=request.num_nans_in_logits, + is_segment_finished=is_segment_finished, + new_prompt_len_snapshot=self._new_prompt_len_snapshot.get(req_id, None), ) ) if self.chunk_transfer_adapter is not None: @@ -573,6 +580,21 @@ def finish_requests(self, request_ids: Any, finished_status: RequestStatus) -> l return super().finish_requests(request_ids, finished_status) + def _update_request_as_session(self, session: Request, update: StreamingUpdate) -> None: + """ + Override: Only extend prompt at stage 0, and replace + the existing session with the next streaming update at other stages. + + Discards the last sampled output token from the prior input chunk at stage 0. + """ + req_id = session.request_id + self._new_prompt_len_snapshot[req_id] = len(update.prompt_token_ids) + if self.vllm_config.model_config.stage_id != 0: + self._replace_session_with_streaming_update(session, update) + + else: + super()._update_request_as_session(session, update) + def _free_request(self, request: Request, delay_free_blocks: bool = False) -> dict[str, Any] | None: # TODO(wzliu)! for offline mode, we should not end process until all data is transferred """Mark a request as finished and free its resources.""" @@ -586,6 +608,7 @@ def _free_request(self, request: Request, delay_free_blocks: bool = False) -> di self.encoder_cache_manager.free(request) request_id = request.request_id self.finished_req_ids.add(request_id) + self._new_prompt_len_snapshot.pop(request_id, None) if self.finished_req_ids_dict is not None: self.finished_req_ids_dict[request.client_index].add(request_id) diff --git a/vllm_omni/core/sched/omni_generation_scheduler.py b/vllm_omni/core/sched/omni_generation_scheduler.py index 9e1c3d5d4f..38750671a8 100644 --- a/vllm_omni/core/sched/omni_generation_scheduler.py +++ b/vllm_omni/core/sched/omni_generation_scheduler.py @@ -1,3 +1,5 @@ +from __future__ import annotations + import os import time from collections import defaultdict @@ -12,11 +14,16 @@ from vllm.v1.core.sched.request_queue import create_request_queue from vllm.v1.core.sched.scheduler import Scheduler as VLLMScheduler from vllm.v1.core.sched.utils import remove_all -from vllm.v1.engine import EngineCoreEventType, EngineCoreOutput, EngineCoreOutputs +from vllm.v1.engine import ( + EngineCoreEventType, + EngineCoreOutput, + EngineCoreOutputs, +) from vllm.v1.metrics.perf import PerfStats -from vllm.v1.request import Request, RequestStatus +from vllm.v1.request import Request, RequestStatus, StreamingUpdate from vllm.v1.spec_decode.metrics import SpecDecodingStats +from vllm_omni.core.sched.omni_scheduler_mixin import OmniSchedulerMixin from vllm_omni.core.sched.output import OmniCachedRequestData, OmniNewRequestData from vllm_omni.distributed.omni_connectors.transfer_adapter.chunk_transfer_adapter import ( OmniChunkTransferAdapter, @@ -28,7 +35,7 @@ VLLM_OMNI_USE_V2_RUNNER = bool(int(os.environ.get("VLLM_OMNI_USE_V2_RUNNER", "0"))) -class OmniGenerationScheduler(VLLMScheduler): +class OmniGenerationScheduler(OmniSchedulerMixin, VLLMScheduler): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) if VLLM_OMNI_USE_V2_RUNNER and not self.use_v2_model_runner: @@ -125,9 +132,16 @@ def schedule(self) -> SchedulerOutput: scheduled_running_reqs.append(request) req_index += 1 - # OMNI: Remove already finished requests from running queue + # Remove from running and propagate to finished_req_ids so the worker releases req_state slots. if already_finished_reqs: self.running = remove_all(self.running, already_finished_reqs) + for req in already_finished_reqs: + req_id = req.request_id + if req_id in self.finished_req_ids: + continue + self.finished_req_ids.add(req_id) + if self.finished_req_ids_dict is not None: + self.finished_req_ids_dict[req.client_index].add(req_id) # Fast path selection and scheduling (treat all as diffusion requests, # independent of pooling_params) @@ -606,3 +620,11 @@ def update_from_output( eco.scheduler_stats = stats return engine_core_outputs + + def _update_request_as_session(self, session: Request, update: StreamingUpdate) -> None: + """ + Override: Just replace the existing session with the next streaming update. + + Do not expend prompt id using update. + """ + self._replace_session_with_streaming_update(session, update) diff --git a/vllm_omni/core/sched/omni_scheduler_mixin.py b/vllm_omni/core/sched/omni_scheduler_mixin.py new file mode 100644 index 0000000000..36080e63ac --- /dev/null +++ b/vllm_omni/core/sched/omni_scheduler_mixin.py @@ -0,0 +1,33 @@ +from __future__ import annotations + +from vllm.v1.engine import EngineCoreEventType +from vllm.v1.request import Request, RequestStatus, StreamingUpdate + + +class OmniSchedulerMixin: + """Shared scheduler helpers for omni-specific request handling.""" + + def _replace_session_with_streaming_update( + self, + session: Request, + update: StreamingUpdate, + ) -> None: + """For streaming input: Replace an existing streaming session payload with the latest update.""" + session._output_token_ids.clear() + session._all_token_ids.clear() + new_prompt = update.prompt_token_ids or () + session._all_token_ids.extend(new_prompt) + session.num_computed_tokens = 0 + session.prompt_token_ids = update.prompt_token_ids or () + session.additional_information = update.additional_information or None + # Update block hashes for the new tokens. + session.update_block_hashes() + session.num_prompt_tokens = len(session.prompt_token_ids) + session.arrival_time = update.arrival_time + session.sampling_params = update.sampling_params + if session.status == RequestStatus.WAITING_FOR_STREAMING_REQ: + self.num_waiting_for_streaming_input -= 1 + session.status = RequestStatus.WAITING + + if self.log_stats: + session.record_event(EngineCoreEventType.QUEUED) diff --git a/vllm_omni/deploy/qwen2_5_omni.yaml b/vllm_omni/deploy/qwen2_5_omni.yaml new file mode 100644 index 0000000000..41aef0df6f --- /dev/null +++ b/vllm_omni/deploy/qwen2_5_omni.yaml @@ -0,0 +1,92 @@ +# Qwen2.5-Omni deploy: CUDA defaults + platform overrides, verified on 2x H100. +# Stage 2 disables flashinfer autotune because its DiT block never invokes +# flashinfer; the autotune dummy run OOMs the shared cuda:0 device otherwise. +# +# Fields omitted from a stage fall back to StageDeployConfig dataclass +# defaults (see vllm_omni/config/stage_config.py). For instance, every +# stage here uses vLLM's default max_num_batched_tokens=32768 because +# chat-sized prefill comfortably fits; only models with codec prefill +# (Qwen3-Omni, Qwen3-TTS) need to bump it above 32k. +# +# enforce_eager policy across the three deploy YAMLs: +# * code2wav / generation stages: always true (cudagraph incompatible with +# the custom generation loop — set explicitly everywhere). +# * AR stages (thinker, talker): model-dependent. Qwen2.5-Omni runs eager +# on CUDA (thinker uses custom ops that don't trace cleanly); NPU / XPU +# platform overrides flip back to false where cudagraph is verified. +# Qwen3-Omni / Qwen3-TTS AR stages use the default (false = cudagraph on). +async_chunk: false + +stages: + - stage_id: 0 + max_num_seqs: 1 + gpu_memory_utilization: 0.8 + enforce_eager: true + mm_processor_cache_gb: 0 + devices: "0" + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.1 + + - stage_id: 1 + max_num_seqs: 1 + gpu_memory_utilization: 0.8 + enforce_eager: true + devices: "1" + default_sampling_params: + temperature: 0.9 + top_p: 0.8 + top_k: 40 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.05 + + - stage_id: 2 + max_num_seqs: 1 + gpu_memory_utilization: 0.15 + enforce_eager: true + enable_flashinfer_autotune: false + async_scheduling: false + devices: "0" + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.1 + +platforms: + npu: + stages: + # NPU has cudagraph support for the thinker, unlike GPU which still + # only runs eager. + - stage_id: 0 + enforce_eager: false + - stage_id: 2 + # 3-NPU layout: stage 2 lives on its own card. + devices: "2" + + rocm: + stages: + - stage_id: 2 + # 3-GPU MI325 layout: stage 2 on a separate card. + devices: "2" + + xpu: + stages: + # Verified on 2x Intel Arc Pro B60. Both AR stages use cudagraphs. + - stage_id: 0 + gpu_memory_utilization: 0.9 + enforce_eager: false + - stage_id: 1 + gpu_memory_utilization: 0.5 + enforce_eager: false + - stage_id: 2 + gpu_memory_utilization: 0.3 + # Stage 2 colocates with stage 1's device on XPU. + devices: "1" diff --git a/vllm_omni/deploy/qwen3_omni_moe.yaml b/vllm_omni/deploy/qwen3_omni_moe.yaml new file mode 100644 index 0000000000..defe218a96 --- /dev/null +++ b/vllm_omni/deploy/qwen3_omni_moe.yaml @@ -0,0 +1,102 @@ +# Qwen3-Omni-MoE production deploy, verified on 2x H100 (stage 0 on cuda:0, +# stages 1+2 on cuda:1). +# +# Fields omitted from a stage fall back to StageDeployConfig defaults (see +# vllm_omni/config/stage_config.py). Notable implicit defaults for this +# model: +# * Stages 0/1 (thinker, talker) do not set max_num_batched_tokens — +# chat-sized prefill fits in the 32768 default. +# * Stages 0/1 do not set enforce_eager — cudagraph runs by default +# (false). Stage 2 (code2wav) sets true because its generation loop +# is cudagraph-incompatible. +# * Platform sections flip enforce_eager per-stage where platform +# cudagraph support differs. +async_chunk: true + +connectors: + connector_of_shared_memory: + name: SharedMemoryConnector + extra: + codec_chunk_frames: 25 + codec_left_context_frames: 25 + +stages: + - stage_id: 0 + gpu_memory_utilization: 0.9 + devices: "0" + default_sampling_params: + temperature: 0.4 + top_p: 0.9 + top_k: 1 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.05 + + - stage_id: 1 + gpu_memory_utilization: 0.6 + devices: "1" + input_connectors: + from_stage_0: connector_of_shared_memory + default_sampling_params: + temperature: 0.9 + top_k: 50 + max_tokens: 4096 + seed: 42 + detokenize: false + repetition_penalty: 1.05 + stop_token_ids: [2150] + + - stage_id: 2 + gpu_memory_utilization: 0.1 + max_num_seqs: 1 + enforce_eager: true + async_scheduling: false + max_num_batched_tokens: 51200 + devices: "1" + input_connectors: + from_stage_1: connector_of_shared_memory + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 65536 + seed: 42 + detokenize: true + repetition_penalty: 1.1 + stop_token_ids: [0] + +platforms: + npu: + stages: + - stage_id: 0 + gpu_memory_utilization: 0.6 + tensor_parallel_size: 2 + devices: "0,1" + - stage_id: 1 + gpu_memory_utilization: 0.6 + enforce_eager: true + devices: "2" + - stage_id: 2 + gpu_memory_utilization: 0.3 + devices: "2" + + rocm: + stages: + - stage_id: 0 + enforce_eager: true + + xpu: + stages: + - stage_id: 0 + tensor_parallel_size: 4 + enforce_eager: true + max_cudagraph_capture_size: 0 + devices: "0,1,2,3" + - stage_id: 1 + enforce_eager: true + max_cudagraph_capture_size: 0 + devices: "4" + - stage_id: 2 + gpu_memory_utilization: 0.3 + max_cudagraph_capture_size: 0 + devices: "4" diff --git a/vllm_omni/deploy/qwen3_tts.yaml b/vllm_omni/deploy/qwen3_tts.yaml new file mode 100644 index 0000000000..bb0d44e379 --- /dev/null +++ b/vllm_omni/deploy/qwen3_tts.yaml @@ -0,0 +1,85 @@ +# Qwen3-TTS deploy: talker → code2wav via shared-memory chunk streaming. +# Verified on 1x H100. +# +# Fields omitted from a stage fall back to StageDeployConfig defaults (see +# vllm_omni/config/stage_config.py). Notable choices for this model: +# * Stage 0 (talker) sets max_num_batched_tokens=512 for async-chunk +# latency tuning (not correctness) — small per-step batches keep +# first-chunk latency low. +# * Stage 1 (code2wav) sets max_num_batched_tokens=65536 for correctness: +# codec prefill length (Q * num_frames) exceeds the 32k default. +# * Stage 0 does not set enforce_eager — talker runs cudagraph by default. +# Stage 1 sets true because its codec generation loop is not +# cudagraph-compatible. NPU platform flips stage 0 to true where +# cudagraph is not yet verified. +async_chunk: true + +connectors: + connector_of_shared_memory: + name: SharedMemoryConnector + extra: + shm_threshold_bytes: 65536 + codec_streaming: true + connector_get_sleep_s: 0.01 + connector_get_max_wait_first_chunk: 3000 + connector_get_max_wait: 300 + # Must match the decoder sliding attention window. + codec_chunk_frames: 25 + codec_left_context_frames: 72 + +stages: + - stage_id: 0 + max_num_seqs: 10 + gpu_memory_utilization: 0.3 + async_scheduling: true + max_num_batched_tokens: 512 + max_model_len: 4096 + devices: "0" + output_connectors: + to_stage_1: connector_of_shared_memory + default_sampling_params: + temperature: 0.9 + top_k: 50 + max_tokens: 4096 + seed: 42 + detokenize: false + repetition_penalty: 1.05 + stop_token_ids: [2150] + + - stage_id: 1 + max_num_seqs: 1 + gpu_memory_utilization: 0.3 + enforce_eager: true + async_scheduling: true + # Must be divisible by num_code_groups and cover (left_context + chunk). + # Prefill length is Q * num_frames (e.g. 16 * 2148 = 34368); keep + # headroom past 32k. + max_num_batched_tokens: 65536 + # async_chunk appends windows per step; max_model_len must cover the + # accumulated flat codec stream. + max_model_len: 65536 + devices: "0" + input_connectors: + from_stage_0: connector_of_shared_memory + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 65536 + seed: 42 + detokenize: true + repetition_penalty: 1.0 + stop_token_ids: [0] + +platforms: + npu: + stages: + # NPU does not yet support async-scheduling for TTS, and the + # talker fits at max_num_seqs=1 only. + - stage_id: 0 + max_num_seqs: 1 + enforce_eager: true + async_scheduling: false + - stage_id: 1 + gpu_memory_utilization: 0.2 + async_scheduling: false diff --git a/vllm_omni/diffusion/cache/cache_dit_backend.py b/vllm_omni/diffusion/cache/cache_dit_backend.py index 3daf883e0d..d5397dd166 100644 --- a/vllm_omni/diffusion/cache/cache_dit_backend.py +++ b/vllm_omni/diffusion/cache/cache_dit_backend.py @@ -1170,41 +1170,85 @@ def refresh_cache_context(pipeline: Any, num_inference_steps: int, verbose: bool return refresh_cache_context -def enable_cache_for_glm_image(pipeline: Any, cache_config: Any) -> Callable[[int], None]: - """Enable cache-dit for GLM-Image pipeline. +def enable_cache_for_flux2(pipeline: Any, cache_config: Any) -> Callable[[int], None]: + """Enable cache-dit for Flux.2-dev pipeline. - GLM-Image processes prompt and image by calling the transformer before the - denoising loop. When an input image is provided (editing mode), the cache must - be force-refreshed after the preprocessing step so stale hidden states are - discarded. Set force_refresh_step_hint = 1 for editing, None for text-to-image. + Args: + pipeline: The Flux2 pipeline instance. + cache_config: DiffusionCacheConfig instance with cache configuration. + Returns: + A refresh function that can be called with a new ``num_inference_steps`` + to update the cache context for the pipeline. """ + # Build DBCacheConfig for transformer db_cache_config = _build_db_cache_config(cache_config) - calibrator_config = None + calibrator = None if cache_config.enable_taylorseer: - calibrator_config = TaylorSeerCalibratorConfig(taylorseer_order=cache_config.taylorseer_order) - logger.info(f"TaylorSeer enabled with order={cache_config.taylorseer_order}") + taylorseer_order = cache_config.taylorseer_order + calibrator = TaylorSeerCalibratorConfig(taylorseer_order=taylorseer_order) + logger.info(f"TaylorSeer enabled with order={taylorseer_order}") + + # Build ParamsModifier for transformer + modifier = ParamsModifier( + cache_config=db_cache_config, + calibrator_config=calibrator, + ) logger.info( - f"Enabling cache-dit on GLM-Image transformer: " + f"Enabling cache-dit on Flux transformer with BlockAdapter: " f"Fn={db_cache_config.Fn_compute_blocks}, " f"Bn={db_cache_config.Bn_compute_blocks}, " f"W={db_cache_config.max_warmup_steps}, " - f"force_refresh_step_hint={db_cache_config.force_refresh_step_hint}, " ) + # Enable cache-dit using BlockAdapter for transformer cache_dit.enable_cache( - pipeline.transformer, + ( + BlockAdapter( + transformer=pipeline.transformer, + blocks=[ + pipeline.transformer.transformer_blocks, + pipeline.transformer.single_transformer_blocks, + ], + forward_pattern=[ForwardPattern.Pattern_1, ForwardPattern.Pattern_2], + params_modifiers=[modifier], + ) + ), cache_config=db_cache_config, - calibrator_config=calibrator_config, ) + def refresh_cache_context(pipeline: Any, num_inference_steps: int, verbose: bool = True) -> None: + """Refresh cache context for the transformer with new num_inference_steps. -def enable_cache_for_flux2(pipeline: Any, cache_config: Any) -> Callable[[int], None]: - """Enable cache-dit for Flux.2-dev pipeline. + Args: + pipeline: The Flux2 pipeline instance. + num_inference_steps: New number of inference steps. + """ + if cache_config.scm_steps_mask_policy is None: + cache_dit.refresh_context(pipeline.transformer, num_inference_steps=num_inference_steps, verbose=verbose) + else: + cache_dit.refresh_context( + pipeline.transformer, + cache_config=DBCacheConfig().reset( + num_inference_steps=num_inference_steps, + steps_computation_mask=cache_dit.steps_mask( + mask_policy=cache_config.scm_steps_mask_policy, + total_steps=num_inference_steps, + ), + steps_computation_policy=cache_config.scm_steps_policy, + ), + verbose=verbose, + ) + + return refresh_cache_context + + +def enable_cache_for_glm_image(pipeline: Any, cache_config: Any) -> Callable[[int], None]: + """Enable cache-dit for GlmImage pipeline. Args: - pipeline: The Flux2 pipeline instance. + pipeline: The GlmImage pipeline instance. cache_config: DiffusionCacheConfig instance with cache configuration. Returns: A refresh function that can be called with a new ``num_inference_steps`` @@ -1226,23 +1270,25 @@ def enable_cache_for_flux2(pipeline: Any, cache_config: Any) -> Callable[[int], ) logger.info( - f"Enabling cache-dit on Flux transformer with BlockAdapter: " + f"Enabling cache-dit on GlmImage transformer with BlockAdapter: " f"Fn={db_cache_config.Fn_compute_blocks}, " f"Bn={db_cache_config.Bn_compute_blocks}, " f"W={db_cache_config.max_warmup_steps}, " ) # Enable cache-dit using BlockAdapter for transformer + # Note: We don't use patch_functor here because it's designed for diffusers' GlmImage, + # and our vllm-omni implementation has a different forward signature. + # We use ForwardPattern.Pattern_0 because our block returns (hidden_states, encoder_hidden_states) cache_dit.enable_cache( ( BlockAdapter( transformer=pipeline.transformer, - blocks=[ - pipeline.transformer.transformer_blocks, - pipeline.transformer.single_transformer_blocks, - ], - forward_pattern=[ForwardPattern.Pattern_1, ForwardPattern.Pattern_2], + blocks=pipeline.transformer.transformer_blocks, + forward_pattern=ForwardPattern.Pattern_0, params_modifiers=[modifier], + patch_functor=None, + has_separate_cfg=True, ) ), cache_config=db_cache_config, @@ -1252,7 +1298,7 @@ def refresh_cache_context(pipeline: Any, num_inference_steps: int, verbose: bool """Refresh cache context for the transformer with new num_inference_steps. Args: - pipeline: The Flux2 pipeline instance. + pipeline: The GlmImage pipeline instance. num_inference_steps: New number of inference steps. """ if cache_config.scm_steps_mask_policy is None: diff --git a/vllm_omni/diffusion/cache/teacache/coefficient_estimator.py b/vllm_omni/diffusion/cache/teacache/coefficient_estimator.py index baec21c276..38c805c28d 100644 --- a/vllm_omni/diffusion/cache/teacache/coefficient_estimator.py +++ b/vllm_omni/diffusion/cache/teacache/coefficient_estimator.py @@ -1,20 +1,18 @@ # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project +import os from typing import Any import numpy as np import torch from vllm.config import LoadConfig -from vllm.utils.torch_utils import set_default_torch_dtype +from vllm.transformers_utils.config import get_hf_file_to_dict from vllm_omni.diffusion.cache.teacache.extractors import get_extractor -from vllm_omni.diffusion.data import OmniDiffusionConfig +from vllm_omni.diffusion.data import OmniDiffusionConfig, TransformerConfig from vllm_omni.diffusion.hooks import HookRegistry, ModelHook from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader -from vllm_omni.diffusion.models.bagel.pipeline_bagel import BagelPipeline -from vllm_omni.diffusion.models.flux2.pipeline_flux2 import Flux2Pipeline -from vllm_omni.diffusion.models.stable_audio.pipeline_stable_audio import StableAudioPipeline from vllm_omni.diffusion.request import OmniDiffusionRequest from vllm_omni.inputs.data import OmniDiffusionSamplingParams @@ -36,6 +34,7 @@ def initialize_hook(self, module: torch.nn.Module) -> torch.nn.Module: def new_forward(self, module: torch.nn.Module, *args: Any, **kwargs: Any) -> Any: ctx = self.extractor_fn(module, *args, **kwargs) + # NOTE: We upcast to float32 to also handle bfloat16. modulated_input_cpu = ctx.modulated_input.detach().float().cpu().numpy() outputs = ctx.run_transformer_blocks() @@ -54,23 +53,39 @@ def stop_collection(self) -> list[tuple[np.ndarray, np.ndarray]]: return list(self.current_trajectory) -class BagelAdapter: - """Adapter for Bagel model.""" +class DefaultAdapter: + """Default adapter for standard diffusers pipelines.""" - @staticmethod - def load_pipeline(model_path: str, device: str = "cuda", dtype: torch.dtype = torch.bfloat16) -> BagelPipeline: - od_config = OmniDiffusionConfig.from_kwargs(model=model_path, dtype=dtype) - od_config.model_class_name = "BagelPipeline" + model_class_name = None + uses_tf_config = True + + @classmethod + def load_pipeline(cls, model_path: str, device: str, dtype: torch.dtype) -> Any: + if cls.model_class_name is None: + raise ValueError("Adapter doesn't have a set class name.") - pipeline = BagelPipeline(od_config=od_config) - loader = DiffusersPipelineLoader(LoadConfig()) - loader.load_weights(pipeline) - pipeline.to(device) - return pipeline + od_config = OmniDiffusionConfig.from_kwargs( + model_class_name=cls.model_class_name, + model=model_path, + dtype=dtype, + ) + + if cls.uses_tf_config: + # TODO (Alex): Refactor to handle tf_model_config in OmniDiffusionConfig + # instead of OmniDiffusion and remove the manual population here + tf_config_dict = get_hf_file_to_dict( + os.path.join("transformer", "config.json"), + od_config.model, + ) + od_config.tf_model_config = TransformerConfig.from_dict(tf_config_dict) + + loader = DiffusersPipelineLoader(LoadConfig(), od_config=od_config) + # load_model will handle dtypes / device placement, put in .eval() mode + return loader.load_model(od_config=od_config, load_device=device) @staticmethod def get_transformer(pipeline: Any) -> tuple[Any, str]: - return pipeline.bagel, "Bagel" + return pipeline.transformer, pipeline.transformer.__class__.__name__ @staticmethod def install_hook(transformer: Any, hook: DataCollectionHook) -> None: @@ -78,25 +93,17 @@ def install_hook(transformer: Any, hook: DataCollectionHook) -> None: registry.register_hook(hook._HOOK_NAME, hook) -class StableAudioAdapter: - """Adapter for Stable Audio Open 1.0 coefficient estimation.""" - - @staticmethod - def load_pipeline(model_path: str, device: str = "cuda", dtype: torch.dtype = torch.float16) -> Any: - od_config = OmniDiffusionConfig.from_kwargs(model=model_path, dtype=dtype) - - # Strictly necessary because we bypass loader.load_model() - with set_default_torch_dtype(dtype): - pipeline = StableAudioPipeline(od_config=od_config) +class BagelAdapter(DefaultAdapter): + """Adapter for Bagel model.""" - loader = DiffusersPipelineLoader(LoadConfig()) - loader.load_weights(pipeline) - pipeline.to(device) - return pipeline + model_class_name = "BagelPipeline" + # Skip the hack for loading the tf model config, + # because bagel doesn't use it. + uses_tf_config = False @staticmethod def get_transformer(pipeline: Any) -> tuple[Any, str]: - return pipeline.transformer, "StableAudioDiTModel" + return pipeline.bagel, "Bagel" @staticmethod def install_hook(transformer: Any, hook: DataCollectionHook) -> None: @@ -104,52 +111,32 @@ def install_hook(transformer: Any, hook: DataCollectionHook) -> None: registry.register_hook(hook._HOOK_NAME, hook) -class Flux2Adapter: +class Flux2Adapter(DefaultAdapter): """Adapter for Flux2 model coefficient estimation.""" - @staticmethod - def load_pipeline(model_path: str, device: str = "cuda", dtype: torch.dtype = torch.bfloat16) -> Flux2Pipeline: - """Load Flux2 pipeline for coefficient estimation.""" - od_config = OmniDiffusionConfig.from_kwargs(model=model_path, dtype=dtype) - od_config.model_class_name = "Flux2Pipeline" - - pipeline = Flux2Pipeline(od_config=od_config) - loader = DiffusersPipelineLoader(LoadConfig()) - loader.load_weights(pipeline) - pipeline.to(device) - return pipeline + model_class_name = "Flux2Pipeline" - @staticmethod - def get_transformer(pipeline: Any) -> tuple[Any, str]: - return pipeline.transformer, pipeline.transformer.__class__.__name__ - @staticmethod - def install_hook(transformer: Any, hook: DataCollectionHook) -> None: - registry = HookRegistry.get_or_create(transformer) - registry.register_hook(hook._HOOK_NAME, hook) +class LongCatAdapter(DefaultAdapter): + """Adapter for LongCat Image - NOTE: currently this model needs the vLLM + context to be correctly configured to actually run the estimation, since it + uses vLLM norm layers etc. + """ + model_class_name = "LongCatImagePipeline" -class DefaultAdapter: - """Default adapter for standard diffusers pipelines.""" - @staticmethod - def load_pipeline(model_path: str, device: str, dtype: torch.dtype) -> Any: - raise NotImplementedError("DefaultAdapter.load_pipeline not implemented") - - @staticmethod - def get_transformer(pipeline: Any) -> tuple[Any, str]: - return pipeline.transformer, pipeline.transformer.__class__.__name__ +class StableAudioAdapter(DefaultAdapter): + """Adapter for Stable Audio Open 1.0 coefficient estimation.""" - @staticmethod - def install_hook(transformer: Any, hook: DataCollectionHook) -> None: - registry = HookRegistry.get_or_create(transformer) - registry.register_hook(hook._HOOK_NAME, hook) + model_class_name = "StableAudioPipeline" _MODEL_ADAPTERS: dict[str, type] = { "Bagel": BagelAdapter, "StableAudio": StableAudioAdapter, "Flux2": Flux2Adapter, + "LongCat": LongCatAdapter, } _EPSILON = 1e-6 @@ -196,7 +183,6 @@ def __init__( device: str = "cuda", dtype: torch.dtype = torch.bfloat16, ): - # Add validation here ⬇️ if model_type not in _MODEL_ADAPTERS: available_types = list(_MODEL_ADAPTERS.keys()) raise ValueError( @@ -205,7 +191,7 @@ def __init__( f"To add support for a new model, add an entry to _MODEL_ADAPTERS." ) - adapter = _MODEL_ADAPTERS.get(model_type, DefaultAdapter) + adapter = _MODEL_ADAPTERS[model_type] self.pipeline = adapter.load_pipeline(model_path, device, dtype) self.transformer, self.transformer_type = adapter.get_transformer(self.pipeline) self.hook = DataCollectionHook(self.transformer_type) diff --git a/vllm_omni/diffusion/cache/teacache/config.py b/vllm_omni/diffusion/cache/teacache/config.py index ecf3bfc1d3..7efdd418e1 100644 --- a/vllm_omni/diffusion/cache/teacache/config.py +++ b/vllm_omni/diffusion/cache/teacache/config.py @@ -73,6 +73,8 @@ 3.20000000e00, -2.00000000e-02, ], + # LongCat Image transformer coefficients + "LongCatImageTransformer2DModel": [652.5980, -424.1615, 84.5526, -4.5923, 0.1694], } diff --git a/vllm_omni/diffusion/cache/teacache/extractors.py b/vllm_omni/diffusion/cache/teacache/extractors.py index 84c237b60d..d0da0d9df3 100644 --- a/vllm_omni/diffusion/cache/teacache/extractors.py +++ b/vllm_omni/diffusion/cache/teacache/extractors.py @@ -19,10 +19,13 @@ import torch import torch.nn as nn +from vllm.logger import init_logger from vllm_omni.diffusion.forward_context import get_forward_context from vllm_omni.platforms import current_omni_platform +logger = init_logger(__name__) + @dataclass class CacheContext: @@ -723,6 +726,105 @@ def postprocess(h): ) +def extract_longcat_context( + module: nn.Module, # LongCatImageTransformer2DModel + hidden_states, + timestep, + guidance, + encoder_hidden_states, + txt_ids, + img_ids, + **kwargs, +) -> CacheContext: + """Extract the cache context for LongCat Image. + + Similar to other extractors, this is currently the only code needed + for TeaCache support for LongCat image, and encapsulates preprocessing, + modulated input extraction, transformer execution, and postprocessing + logic. + + Args & kawrgs are identical to the inputs to LongCat Image's forward. + + Returns: + CacheContext with all information needed for generic caching + """ + # TODO (Alex) - Refactor TeaCache extractors to more tightly integrate with .forward + from diffusers.models.modeling_outputs import Transformer2DModelOutput + + # 1. Model specific preprocessing + fwd_context = get_forward_context() + sp_size = module.parallel_config.sequence_parallel_size + if sp_size is not None and sp_size > 1: + # NOTE: For now, we set this to False on the forward context + # to be consistent with LongCat Image's current behavior when + # TeaCache is enabled. We do not need to reset it in post process + # since we should never split text embed in sp for this model. + fwd_context.split_text_embed_in_sp = False + + hidden_states = module.x_embedder(hidden_states) + + timestep = timestep.to(hidden_states.dtype) * 1000 + + temb = module.time_embed(timestep, hidden_states.dtype) + encoder_hidden_states = module.context_embedder(encoder_hidden_states) + + # Compute RoPE embeddings via rope_preparer module + # _sp_plan will automatically shard img_cos/img_sin (outputs 2, 3) + # txt_cos/txt_sin (outputs 0, 1) remain replicated for dual-stream attention + txt_cos, txt_sin, img_cos, img_sin = module.rope_preparer(txt_ids, img_ids) + + # Reconstruct image_rotary_emb with chunked values + # Final shape: (txt_seq_len + img_seq_len // SP, head_dim) + image_rotary_emb = ( + torch.cat([txt_cos, img_cos], dim=0), + torch.cat([txt_sin, img_sin], dim=0), + ) + + # 2. Extract the modulated output from the first mm-DiT block + first_block = module.transformer_blocks[0] + img_modulated = first_block.norm1(hidden_states, emb=temb)[0] + + # 3. Define the transformer execution + def run_transformer_blocks(): + """Execute all Longcat transformer blocks.""" + h = hidden_states + e = encoder_hidden_states + for block in module.transformer_blocks: + e, h = block( + hidden_states=h, + encoder_hidden_states=e, + temb=temb, + image_rotary_emb=image_rotary_emb, + ) + + for block in module.single_transformer_blocks: + e, h = block( + hidden_states=h, + encoder_hidden_states=e, + temb=temb, + image_rotary_emb=image_rotary_emb, + ) + # Hook expects hidden states to be first + return (h, e) + + # 4. Postprocessing + def postprocess(h): + """Apply Longcat-specific output postprocessing.""" + h = module.norm_out(h, temb) + output = module.proj_out(h) + return Transformer2DModelOutput(sample=output) + + # 5. Return the CacheContext + return CacheContext( + modulated_input=img_modulated, + hidden_states=hidden_states, + encoder_hidden_states=encoder_hidden_states, + temb=temb, + run_transformer_blocks=run_transformer_blocks, + postprocess=postprocess, + ) + + def extract_stable_audio_context( module: nn.Module, hidden_states: torch.Tensor, @@ -980,6 +1082,7 @@ def postprocess(h): "Flux2Klein": extract_flux2_klein_context, "StableAudioDiTModel": extract_stable_audio_context, "Flux2Transformer2DModel": extract_flux2_context, + "LongCatImageTransformer2DModel": extract_longcat_context, # Future models: # "FluxTransformer2DModel": extract_flux_context, # "CogVideoXTransformer3DModel": extract_cogvideox_context, diff --git a/vllm_omni/diffusion/models/bagel/pipeline_bagel.py b/vllm_omni/diffusion/models/bagel/pipeline_bagel.py index a3d2259e64..90baf5f676 100644 --- a/vllm_omni/diffusion/models/bagel/pipeline_bagel.py +++ b/vllm_omni/diffusion/models/bagel/pipeline_bagel.py @@ -397,11 +397,26 @@ def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput: cfg_text_context["ropes"] = cfg_text_metadata["ropes"] else: cfg_text_context["ropes"] = [cfg_text_seq_len] - - if cfg_img_kv is None and cfg_text_kv is not None: - cfg_img_kv = injected_kv - - if cfg_img_kv is not None: + else: + # No cfg_text companion received. For text2img this is the + # expected path: original BAGEL uses an empty KV cache (0 + # tokens) as the text-unconditional branch. Keep the default + # empty NaiveCache in cfg_text_context and preserve the + # original cfg_text_scale so CFG still applies. + pass + + if cfg_img_kv is None: + # text2img multi-stage: cfg_img reuses gen KV (positive prompt, + # no image), mirroring forward_cache_update_text on cfg_img_context + # in the single-stage path. + cfg_img_seq_len = injected_kv.key_cache[0].shape[0] + cfg_img_context["past_key_values"] = injected_kv + cfg_img_context["kv_lens"] = [cfg_img_seq_len] + if req.sampling_params.kv_metadata and "ropes" in req.sampling_params.kv_metadata: + cfg_img_context["ropes"] = req.sampling_params.kv_metadata["ropes"] + else: + cfg_img_context["ropes"] = [cfg_img_seq_len] + else: cfg_img_seq_len = cfg_img_kv.key_cache[0].shape[0] cfg_img_context["past_key_values"] = cfg_img_kv cfg_img_context["kv_lens"] = [cfg_img_seq_len] @@ -410,15 +425,6 @@ def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput: else: cfg_img_context["ropes"] = [cfg_img_seq_len] - if not cfg_parallel_contract: - logger.warning("CFG is disabled: only single KV cache available") - gen_params = BagelGenParams( - num_timesteps=gen_params.num_timesteps, - timestep_shift=gen_params.timestep_shift, - cfg_text_scale=1.0, - cfg_img_scale=1.0, - ) - else: image_input = ( None diff --git a/vllm_omni/diffusion/models/dreamid_omni/fusion.py b/vllm_omni/diffusion/models/dreamid_omni/fusion.py index a534f5a76f..abca4c9474 100644 --- a/vllm_omni/diffusion/models/dreamid_omni/fusion.py +++ b/vllm_omni/diffusion/models/dreamid_omni/fusion.py @@ -1,3 +1,5 @@ +import re + import torch import torch.nn as nn from vllm.logger import init_logger @@ -15,78 +17,26 @@ logger = init_logger(__name__) -class FusionModel(nn.Module): - def __init__(self, video_config=None, audio_config=None): - super().__init__() - has_video = True - has_audio = True - if video_config is not None: - self.video_model = WanModel(**video_config) - else: - has_video = False - self.video_model = None - logger.warning("No video model is provided!") - - if audio_config is not None: - self.audio_model = WanModel(**audio_config) - else: - has_audio = False - self.audio_model = None - logger.warning("No audio model is provided!") - - if has_video and has_audio: - assert len(self.video_model.blocks) == len(self.audio_model.blocks) - self.num_blocks = len(self.video_model.blocks) - - self.inject_cross_attention_kv_projections() - self.device = get_local_device() - - self.num_heads = self.video_model.num_heads - self.head_dim = self.video_model.dim // self.video_model.num_heads - self.attn = Attention( - num_heads=self.num_heads, - head_size=self.head_dim, - num_kv_heads=self.num_heads, - softmax_scale=1.0 / (self.head_dim**0.5), - causal=False, - ) - - def inject_cross_attention_kv_projections(self): - for vid_block in self.video_model.blocks: - vid_block.cross_attn.k_fusion = nn.Linear(vid_block.dim, vid_block.dim) - vid_block.cross_attn.v_fusion = nn.Linear(vid_block.dim, vid_block.dim) - vid_block.cross_attn.pre_attn_norm_fusion = WanLayerNorm(vid_block.dim, elementwise_affine=True) - vid_block.cross_attn.norm_k_fusion = ( - WanRMSNorm(vid_block.dim, eps=1e-6) if vid_block.qk_norm else nn.Identity() - ) +class FusedBlock(nn.Module): + """Wrapper pairing a video block and audio block for layerwise offloading. - for audio_block in self.audio_model.blocks: - audio_block.cross_attn.k_fusion = nn.Linear(audio_block.dim, audio_block.dim) - audio_block.cross_attn.v_fusion = nn.Linear(audio_block.dim, audio_block.dim) - audio_block.cross_attn.pre_attn_norm_fusion = WanLayerNorm(audio_block.dim, elementwise_affine=True) - audio_block.cross_attn.norm_k_fusion = ( - WanRMSNorm(audio_block.dim, eps=1e-6) if audio_block.qk_norm else nn.Identity() - ) + Registers both blocks as submodules so their parameters are visible to the offload hooks. + """ - def merge_kwargs(self, vid_kwargs, audio_kwargs): - """ - keys in each kwarg: - e - seq_lens - grid_sizes - freqs - context - context_lens - """ - merged_kwargs = {} - for key in vid_kwargs: - merged_kwargs[f"vid_{key}"] = vid_kwargs[key] - for key in audio_kwargs: - merged_kwargs[f"audio_{key}"] = audio_kwargs[key] - return merged_kwargs + def __init__( + self, + vid_block: nn.Module, + audio_block: nn.Module, + device: torch.device, + ): + super().__init__() + self.vid_block = vid_block + self.audio_block = audio_block + self.device = device - def single_fusion_cross_attention_forward( + def _cross_attention_forward( self, + attn: Attention, cross_attn_block, src_seq, src_grid_sizes, @@ -104,21 +54,17 @@ def single_fusion_cross_attention_forward( ): b, n, d = src_seq.size(0), cross_attn_block.num_heads, cross_attn_block.head_dim if hasattr(cross_attn_block, "k_img"): - ## means is i2v block q, k, v, k_img, v_img = cross_attn_block.qkv_fn(src_seq, context) else: - ## means is t2v block q, k, v = cross_attn_block.qkv_fn(src_seq, context) k_img = v_img = None - x = self.attn(q, k, v) + x = attn(q, k, v) if k_img is not None: - img_x = self.attn(q, k_img, v_img) + img_x = attn(q, k_img, v_img) x = x + img_x - # is_vid = src_grid_sizes.shape[1] > 1 - # compute target attention target_seq = cross_attn_block.pre_attn_norm_fusion(target_seq) k_target = cross_attn_block.norm_k_fusion(cross_attn_block.k_fusion(target_seq)).view(b, -1, n, d) v_target = cross_attn_block.v_fusion(target_seq).view(b, -1, n, d) @@ -132,17 +78,16 @@ def single_fusion_cross_attention_forward( freqs_scaling=target_freqs_scaling, ) - target_x = self.attn(q, k_target, v_target) + target_x = attn(q, k_target, v_target) x = x + target_x - - x = x.flatten(2) # [B, L/P, C] - + x = x.flatten(2) x = cross_attn_block.o(x) return x - def single_fusion_cross_attention_ffn_forward( + def _cross_attention_ffn_forward( self, + attn: Attention, attn_block, src_seq, src_grid_sizes, @@ -159,7 +104,8 @@ def single_fusion_cross_attention_ffn_forward( target_ref_lengths=None, target_freqs_scaling=None, ): - src_seq = src_seq + self.single_fusion_cross_attention_forward( + src_seq = src_seq + self._cross_attention_forward( + attn, attn_block.cross_attn, attn_block.norm3(src_seq), src_grid_sizes=src_grid_sizes, @@ -180,12 +126,11 @@ def single_fusion_cross_attention_ffn_forward( src_seq = src_seq + y * src_e[5].squeeze(2) return src_seq - def single_fusion_block_forward( + def forward( self, - vid_block, - audio_block, vid, audio, + attn: Attention, vid_e, vid_seq_lens, vid_grid_sizes, @@ -203,6 +148,9 @@ def single_fusion_block_forward( audio_ref_lengths, audio_freqs_scaling, ): + vid_block = self.vid_block + audio_block = self.audio_block + ## audio modulation assert audio_e.dtype == torch.bfloat16 assert len(audio_e.shape) == 4 and audio_e.size(2) == 6 and audio_e.shape[1] == audio.shape[1], ( @@ -246,7 +194,8 @@ def single_fusion_block_forward( og_audio = audio # audio cross-attention - audio = self.single_fusion_cross_attention_ffn_forward( + audio = self._cross_attention_ffn_forward( + attn, audio_block, audio, audio_grid_sizes, @@ -267,7 +216,8 @@ def single_fusion_block_forward( assert not torch.equal(og_audio, audio), "Audio should be changed after cross-attention!" # video cross-attention - vid = self.single_fusion_cross_attention_ffn_forward( + vid = self._cross_attention_ffn_forward( + attn, vid_block, vid, vid_grid_sizes, @@ -287,6 +237,128 @@ def single_fusion_block_forward( return vid, audio + +class FusionModel(nn.Module): + _layerwise_offload_blocks_attrs = ["fused_blocks"] + + def __init__(self, video_config=None, audio_config=None): + super().__init__() + has_video = True + has_audio = True + self.device = get_local_device() + if video_config is not None: + self.video_model = WanModel(**video_config) + else: + has_video = False + self.video_model = None + logger.warning("No video model is provided!") + + if audio_config is not None: + self.audio_model = WanModel(**audio_config) + else: + has_audio = False + self.audio_model = None + logger.warning("No audio model is provided!") + + if has_video and has_audio: + assert len(self.video_model.blocks) == len(self.audio_model.blocks) + self.num_blocks = len(self.video_model.blocks) + + self.inject_cross_attention_kv_projections() + + self.num_heads = self.video_model.num_heads + self.head_dim = self.video_model.dim // self.video_model.num_heads + # Make a single shared instance to pass in at forward time + self.attn = Attention( + num_heads=self.num_heads, + head_size=self.head_dim, + num_kv_heads=self.num_heads, + softmax_scale=1.0 / (self.head_dim**0.5), + causal=False, + ) + + if has_video and has_audio: + self.fused_blocks = nn.ModuleList( + [ + FusedBlock( + self.video_model.blocks[i], + self.audio_model.blocks[i], + self.device, + ) + for i in range(self.num_blocks) + ] + ) + + def load_state_dict(self, state_dict, strict=True, assign=False): + """Remap checkpoints where blocks are stored under + `video_model.blocks.N.*` / `audio_model.blocks.N.*` to the current + `fused_blocks.N.vid_block.*` / `fused_blocks.N.audio_block.*`. + """ + needs_remap = any(re.match(r"^(video_model|audio_model)\.blocks\.\d+\.", k) for k in state_dict) + if needs_remap: + remapped = {} + for k, v in state_dict.items(): + new_k = re.sub(r"^video_model\.blocks\.(\d+)\.", r"fused_blocks.\1.vid_block.", k) + new_k = re.sub(r"^audio_model\.blocks\.(\d+)\.", r"fused_blocks.\1.audio_block.", new_k) + remapped[new_k] = v + state_dict = remapped + + self._detach_blocks_from_backbones() + + return super().load_state_dict(state_dict, strict=strict, assign=assign) + + def inject_cross_attention_kv_projections(self): + for vid_block in self.video_model.blocks: + vid_block.cross_attn.k_fusion = nn.Linear(vid_block.dim, vid_block.dim) + vid_block.cross_attn.v_fusion = nn.Linear(vid_block.dim, vid_block.dim) + vid_block.cross_attn.pre_attn_norm_fusion = WanLayerNorm(vid_block.dim, elementwise_affine=True) + vid_block.cross_attn.norm_k_fusion = ( + WanRMSNorm(vid_block.dim, eps=1e-6) if vid_block.qk_norm else nn.Identity() + ) + + for audio_block in self.audio_model.blocks: + audio_block.cross_attn.k_fusion = nn.Linear(audio_block.dim, audio_block.dim) + audio_block.cross_attn.v_fusion = nn.Linear(audio_block.dim, audio_block.dim) + audio_block.cross_attn.pre_attn_norm_fusion = WanLayerNorm(audio_block.dim, elementwise_affine=True) + audio_block.cross_attn.norm_k_fusion = ( + WanRMSNorm(audio_block.dim, eps=1e-6) if audio_block.qk_norm else nn.Identity() + ) + + def _detach_blocks_from_backbones(self) -> None: + """Keep offloadable blocks owned only by a single place. + + NOTE: This is a special workaround to support layerwise offloading. + The model registers the same Wan blocks under both the video/audio + backbones and `fused_blocks` which is a wrapper for unified blocks + walking through. However, layerwise offloading will only consider + `fused_blocks` as offloadable components and will materialize all + other modules onto device, including the same blocks owned by both + `fused_blocks` and `video_model` and `audio_model`. + """ + video_blocks = list(self.video_model.blocks) + audio_blocks = list(self.audio_model.blocks) + self.video_model._modules.pop("blocks", None) + self.audio_model._modules.pop("blocks", None) + self.video_model.blocks = tuple(video_blocks) + self.audio_model.blocks = tuple(audio_blocks) + + def merge_kwargs(self, vid_kwargs, audio_kwargs): + """ + keys in each kwarg: + e + seq_lens + grid_sizes + freqs + context + context_lens + """ + merged_kwargs = {} + for key in vid_kwargs: + merged_kwargs[f"vid_{key}"] = vid_kwargs[key] + for key in audio_kwargs: + merged_kwargs[f"audio_{key}"] = audio_kwargs[key] + return merged_kwargs + def forward( self, vid, @@ -316,17 +388,8 @@ def forward( kwargs = self.merge_kwargs(vid_kwargs, audio_kwargs) - for i in range(self.num_blocks): - """ - 1 fusion block refers to 1 audio block with 1 video block. - """ - - vid_block = self.video_model.blocks[i] - audio_block = self.audio_model.blocks[i] - - vid, audio = self.single_fusion_block_forward( - vid_block=vid_block, audio_block=audio_block, vid=vid, audio=audio, **kwargs - ) + for fused_block in self.fused_blocks: + vid, audio = fused_block(vid, audio, self.attn, **kwargs) vid = self.video_model.post_transformer_block_out(vid, vid_kwargs["grid_sizes"], vid_e) audio = self.audio_model.post_transformer_block_out(audio, audio_kwargs["grid_sizes"], audio_e) diff --git a/vllm_omni/diffusion/models/dreamid_omni/pipeline_dreamid_omni.py b/vllm_omni/diffusion/models/dreamid_omni/pipeline_dreamid_omni.py index c7ab4662d1..cc932f8c1f 100644 --- a/vllm_omni/diffusion/models/dreamid_omni/pipeline_dreamid_omni.py +++ b/vllm_omni/diffusion/models/dreamid_omni/pipeline_dreamid_omni.py @@ -4,6 +4,7 @@ import logging import math import os +from collections.abc import Iterable import torch import torch.distributed @@ -16,6 +17,7 @@ from vllm_omni.diffusion.data import DiffusionOutput, OmniDiffusionConfig from vllm_omni.diffusion.distributed.cfg_parallel import CFGParallelMixin from vllm_omni.diffusion.distributed.utils import get_local_device +from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader from vllm_omni.diffusion.models.interface import SupportAudioInput, SupportImageInput from vllm_omni.diffusion.request import OmniDiffusionRequest @@ -27,7 +29,6 @@ init_mmaudio_vae, init_text_model, init_wan_vae_2_2, - load_fusion_checkpoint, ) from dreamid_omni.utils.rearrange import Rearrange from dreamid_omni.utils.resize import NaResize @@ -122,16 +123,24 @@ def __init__( self.text_model = init_text_model(model, rank=self.device) self.text_encoder = self.text_model.model - # Fusion model - ## load audio/video model config - Fusion_model = FusionModel(VIDEO_CONFIG, AUDIO_CONFIG) - - checkpoint_path = self.od_config.tf_model_config.get("fusion", None) - assert checkpoint_path is not None, "fusion checkpoint path is None" - load_fusion_checkpoint(Fusion_model, checkpoint_path=os.path.join(model, checkpoint_path)) - self.model = Fusion_model + # Fusion model — weights are loaded later via load_weights() + self.model = FusionModel(VIDEO_CONFIG, AUDIO_CONFIG) self.transformer = self.model + fusion_path = self.od_config.tf_model_config.get("fusion", None) + assert fusion_path is not None, "fusion checkpoint path is None in transformer config" + fusion_subfolder = os.path.dirname(fusion_path) or None + fusion_filename = os.path.basename(fusion_path) + self.weights_sources = [ + DiffusersPipelineLoader.ComponentSource( + model_or_path=model, + subfolder=fusion_subfolder, + revision=None, + prefix="model.", + allow_patterns_overrides=[fusion_filename], + ) + ] + # Fixed attributes, non-configurable self.audio_latent_channel = AUDIO_CONFIG.get("in_dim") self.video_latent_channel = VIDEO_CONFIG.get("in_dim") @@ -226,8 +235,11 @@ def load_image_latent_ref_ip_video( return ref_vae_latents, ref_audio_lengths - def load_weights(self, weights): - pass + def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: + prefix = "model." + state_dict = {name[len(prefix) :]: tensor for name, tensor in weights if name.startswith(prefix)} + self.model.load_state_dict(state_dict, strict=True) + return {prefix + k for k in state_dict} def get_scheduler_time_steps(self, sampling_steps, solver_name="unipc", device=0, shift=5.0): torch.manual_seed(4) diff --git a/vllm_omni/diffusion/models/helios/helios_transformer.py b/vllm_omni/diffusion/models/helios/helios_transformer.py index b3d2621ad8..5e7934c3ba 100644 --- a/vllm_omni/diffusion/models/helios/helios_transformer.py +++ b/vllm_omni/diffusion/models/helios/helios_transformer.py @@ -62,10 +62,16 @@ def apply_rotary_emb_helios( """ x_1, x_2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1) cos, sin = freqs_cis.unsqueeze(-2).chunk(2, dim=-1) - out = torch.empty_like(hidden_states) - out[..., 0::2] = x_1 * cos[..., 0::2] - x_2 * sin[..., 1::2] - out[..., 1::2] = x_1 * sin[..., 1::2] + x_2 * cos[..., 0::2] - return out.type_as(hidden_states) + # Use stack+flatten instead of strided slice assignment for contiguous + # memory layout and better performance on GPU/NPU (#2436, cf. PR #2393). + rotated = torch.stack( + ( + x_1 * cos[..., 0::2] - x_2 * sin[..., 1::2], + x_1 * sin[..., 1::2] + x_2 * cos[..., 0::2], + ), + dim=-1, + ) + return rotated.flatten(-2, -1).type_as(hidden_states) class DistributedRMSNorm(nn.Module): diff --git a/vllm_omni/diffusion/models/ltx2/ltx2_transformer.py b/vllm_omni/diffusion/models/ltx2/ltx2_transformer.py index a1bf7f7809..95ef919c24 100644 --- a/vllm_omni/diffusion/models/ltx2/ltx2_transformer.py +++ b/vllm_omni/diffusion/models/ltx2/ltx2_transformer.py @@ -1264,6 +1264,7 @@ class LTX2VideoTransformer3DModel(nn.Module): _supports_gradient_checkpointing = True _skip_layerwise_casting_patterns = ["norm"] _repeated_blocks = ["LTX2VideoTransformerBlock"] + _layerwise_offload_blocks_attrs = ["transformer_blocks"] _sp_plan: dict[str, Any] | None = None @staticmethod diff --git a/vllm_omni/distributed/omni_connectors/utils/initialization.py b/vllm_omni/distributed/omni_connectors/utils/initialization.py index 0497bbb3a2..f012af3c9c 100644 --- a/vllm_omni/distributed/omni_connectors/utils/initialization.py +++ b/vllm_omni/distributed/omni_connectors/utils/initialization.py @@ -206,6 +206,19 @@ def load_omni_transfer_config( if config_dict is None: return None + # Normalize new-schema (top-level ``connectors`` + ``stages``) into the + # legacy ``runtime.connectors`` + ``stage_args`` shape the parser reads. + if "stages" in config_dict and "stage_args" not in config_dict: + normalized: dict[str, Any] = dict(config_dict) + runtime = dict(normalized.get("runtime") or {}) + if "connectors" in normalized and "connectors" not in runtime: + runtime["connectors"] = normalized["connectors"] + if "edges" in normalized and "edges" not in runtime: + runtime["edges"] = normalized["edges"] + normalized["runtime"] = runtime + normalized["stage_args"] = normalized["stages"] + config_dict = normalized + # Parse connectors connectors = {} runtime_config = config_dict.get("runtime", {}) diff --git a/vllm_omni/engine/__init__.py b/vllm_omni/engine/__init__.py index c8a96e6d25..6c92d7952d 100644 --- a/vllm_omni/engine/__init__.py +++ b/vllm_omni/engine/__init__.py @@ -79,6 +79,10 @@ class OmniEngineCoreRequest(EngineCoreRequest): class OmniEngineCoreOutput(EngineCoreOutput): pooling_output: dict[str, torch.Tensor] | None = None + # Finished flag for streaming input segment + is_segment_finished: bool | None = False + # Streaming update prompt length + new_prompt_len_snapshot: int | None = None class OmniEngineCoreOutputs(EngineCoreOutputs): diff --git a/vllm_omni/engine/arg_utils.py b/vllm_omni/engine/arg_utils.py index 5b69d6b1f0..d98ce7d419 100644 --- a/vllm_omni/engine/arg_utils.py +++ b/vllm_omni/engine/arg_utils.py @@ -3,7 +3,7 @@ import json import os import tempfile -from dataclasses import dataclass, field +from dataclasses import dataclass, field, fields from typing import Any from vllm.engine.arg_utils import EngineArgs @@ -300,3 +300,254 @@ def create_model_config(self) -> OmniModelConfig: def output_modality(self) -> OutputModality: """Parse engine_output_type into a type-safe OutputModality flag.""" return OutputModality.from_string(self.engine_output_type) + + +# ============================================================================ +# CLI argument routing +# ============================================================================ +# +# vLLM-Omni's CLI flags live in three buckets: +# +# ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ +# │ OrchestratorArgs │ │ OmniEngineArgs │ │ (upstream vllm) │ +# │ │ │ │ │ server/api │ +# │ stage_timeout │ │ max_num_seqs │ │ host, port │ +# │ worker_backend │ │ gpu_mem_util │ │ ssl_keyfile │ +# │ deploy_config │ │ dtype, quant │ │ api_key │ +# │ ... │ │ ... │ │ ... │ +# └──────────────────┘ └──────────────────┘ └──────────────────┘ +# │ │ │ +# ▼ ▼ ▼ +# orchestrator each stage uvicorn / +# consumes engine FastAPI +# +# Fields in ``SHARED_FIELDS`` (e.g. ``model``, ``log_stats``) flow to BOTH +# orchestrator and engine by design. +# +# Invariants enforced by ``tests/test_arg_utils.py``: +# +# 1. ``OrchestratorArgs`` ∩ ``OmniEngineArgs`` ⊆ ``SHARED_FIELDS`` +# 2. Every CLI flag is classifiable into one of the three buckets +# 3. User-typed flags that match none of the above are logged as dropped +# +# Adding a new orchestrator-only flag → add a field to ``OrchestratorArgs``. +# Everything else is automatic. + + +@dataclass(frozen=True) +class OrchestratorArgs: + """CLI flags consumed by the orchestrator. + + Contract: every field here is either + (a) orchestrator-only (never needed by a stage engine), OR + (b) orchestrator-read-then-redistributed (e.g. ``async_chunk`` is read + from CLI, written to ``DeployConfig``, then propagated to every + stage via ``merge_pipeline_deploy`` — not via direct kwargs + forwarding). + + Fields that BOTH orchestrator and engine genuinely need (e.g. ``model``, + ``log_stats``) should be listed in ``SHARED_FIELDS`` below; ``split_kwargs`` + will copy them to both buckets. + """ + + # === Lifecycle === + stage_init_timeout: int = 300 + init_timeout: int = 600 + + # === Cross-stage Communication === + shm_threshold_bytes: int = 65536 + batch_timeout: int = 10 + + # === Cluster / Backend === + worker_backend: str = "multi_process" + ray_address: str | None = None + + # === Config Files === + stage_configs_path: str | None = None + deploy_config: str | None = None + stage_overrides: str | None = None # raw JSON string; parsed downstream + + # === Mode Switches (orchestrator reads, DeployConfig redistributes) === + async_chunk: bool | None = None + + # === Observability === + log_stats: bool = False + + # === Headless Mode (also forwarded to engine — see SHARED_FIELDS) === + stage_id: int | None = None + + # === Pre-built Objects === + parallel_config: Any = None + + # === Multi-stage guards === + # --tokenizer is captured here so it does not propagate to every stage + # uniformly (different stages often need different tokenizers, e.g. + # qwen3_omni thinker vs talker). Users wanting a per-stage tokenizer + # should set it in the deploy YAML. + tokenizer: str | None = None + + +# Fields that live in BOTH OrchestratorArgs and OmniEngineArgs by design. +# Changes to this set are a review red flag — revisit the contract. +SHARED_FIELDS: frozenset[str] = frozenset( + { + "model", # orch: detect model_type; engine: load weights + "stage_id", # orch: route (headless); engine: identity + "log_stats", # both want the flag + "stage_configs_path", # orch: load legacy YAML; engine: may reference for validation + } +) + + +def orchestrator_field_names() -> frozenset[str]: + """Return the names of every field on OrchestratorArgs.""" + return frozenset(f.name for f in fields(OrchestratorArgs)) + + +def internal_blacklist_keys() -> frozenset[str]: + """Return the set of CLI keys that must never be forwarded as per-stage + engine overrides. + + Derived from ``OrchestratorArgs`` fields minus ``SHARED_FIELDS``, so + adding a new orchestrator-owned flag is a one-line change to the + dataclass — this function updates automatically. + """ + return orchestrator_field_names() - SHARED_FIELDS + + +def split_kwargs( + kwargs: dict[str, Any], + *, + engine_cls: type | None = None, + user_typed: set[str] | None = None, + strict: bool = False, +) -> tuple[OrchestratorArgs, dict[str, Any]]: + """Partition CLI kwargs into (orchestrator, engine) buckets. + + Args: + kwargs: Raw dict, typically ``vars(args)``. + engine_cls: Engine dataclass used to whitelist-filter the engine + bucket. Defaults to ``OmniEngineArgs``. Pass a custom class + for testing. + user_typed: Keys the user actually typed on the command line. Used + to warn when a user-typed flag is unclassifiable. + strict: If True, raise ``ValueError`` on ambiguous (double-classified + but not in ``SHARED_FIELDS``) fields. Default False to keep the + rollout non-breaking; flip to True in tests and CI. + + Returns: + ``(orchestrator_args, engine_kwargs)``. ``engine_kwargs`` has already + been whitelist-filtered against ``engine_cls`` — safe to pass directly + to ``engine_cls(**engine_kwargs)``. + """ + if engine_cls is None: + engine_cls = OmniEngineArgs + + orch_fields = orchestrator_field_names() + engine_fields = {f.name for f in fields(engine_cls)} + + orch_kwargs: dict[str, Any] = {} + engine_candidate: dict[str, Any] = {} + shared_values: dict[str, Any] = {} + unclassified: dict[str, Any] = {} + + for key, value in kwargs.items(): + in_orch = key in orch_fields + in_engine = key in engine_fields + is_shared = key in SHARED_FIELDS + + if is_shared: + shared_values[key] = value + elif in_orch and in_engine: + # Declared in both but not marked shared → ambiguous. + msg = ( + f"Field {key!r} is defined on both OrchestratorArgs and " + f"{engine_cls.__name__} but is not in SHARED_FIELDS. " + f"This causes double-routing. Either remove the duplicate or " + f"add {key!r} to SHARED_FIELDS if the sharing is intentional." + ) + if strict: + raise ValueError(msg) + logger.error(msg) + # Default: treat as orchestrator-only to preserve existing behavior. + orch_kwargs[key] = value + elif in_orch: + orch_kwargs[key] = value + elif in_engine: + engine_candidate[key] = value + else: + unclassified[key] = value + + # Warn on user-typed but unclassifiable flags so we don't silently drop + # something the user cared about (fixes the class of bug that spawned #873). + if unclassified and user_typed: + user_typed_unknown = sorted(k for k in unclassified if k in user_typed) + if user_typed_unknown: + logger.warning( + "CLI flags not consumed by vllm-omni and dropped before " + "per-stage engine construction: %s. If these are vllm " + "frontend/uvicorn flags (host, port, ssl_*, api_key, …) this " + "is expected; otherwise check your spelling.", + user_typed_unknown, + ) + + # Engine bucket: shared + engine-only. We do NOT pass through unclassified + # fields — that's exactly the server/uvicorn noise we want to shed. + engine_kwargs = {**shared_values, **engine_candidate} + + # Construct the orchestrator dataclass. Shared fields that OrchestratorArgs + # also declares get copied into its constructor. + orch_init: dict[str, Any] = dict(orch_kwargs) + for key, value in shared_values.items(): + if key in orch_fields: + orch_init[key] = value + orch_args = OrchestratorArgs(**orch_init) + + return orch_args, engine_kwargs + + +def derive_server_dests_from_vllm_parser() -> frozenset[str]: + """Derive the set of argparse dests that belong to vllm's frontend/server. + + Returns every dest registered by ``make_arg_parser`` that is NOT a field + of ``OmniEngineArgs`` and NOT a field of ``OrchestratorArgs``. Useful for + CI tests to assert all CLI flags are classifiable without maintaining + a hardcoded server list. + + Returns empty frozenset if vllm's parser cannot be built (e.g. in a + minimal test environment). + """ + try: + from vllm.entrypoints.openai.cli_args import make_arg_parser + from vllm.utils.argparse_utils import FlexibleArgumentParser + except ImportError: + logger.debug("Cannot import vllm parser — server-dest derivation skipped") + return frozenset() + + try: + parser = make_arg_parser(FlexibleArgumentParser()) + all_dests = {a.dest for a in parser._actions if a.dest and a.dest != "help"} + except Exception as exc: + logger.debug("Failed to build vllm parser: %s", exc) + return frozenset() + + engine_fields = {f.name for f in fields(OmniEngineArgs)} + orch_fields = orchestrator_field_names() + + return frozenset(all_dests - engine_fields - orch_fields - SHARED_FIELDS) + + +def orchestrator_args_from_argparse(args: Any) -> OrchestratorArgs: + """Build an ``OrchestratorArgs`` from an ``argparse.Namespace``. + + Only copies attributes that exist on the namespace — missing fields fall + back to the dataclass default. Useful when the full parser is already + built and ``vars(args)`` would include noise. + """ + kwargs: dict[str, Any] = {} + for f in fields(OrchestratorArgs): + if hasattr(args, f.name): + value = getattr(args, f.name) + if value is not None or f.default is None: + kwargs[f.name] = value + return OrchestratorArgs(**kwargs) diff --git a/vllm_omni/engine/async_omni_engine.py b/vllm_omni/engine/async_omni_engine.py index 474effc0e0..f8ac6822b1 100644 --- a/vllm_omni/engine/async_omni_engine.py +++ b/vllm_omni/engine/async_omni_engine.py @@ -33,6 +33,7 @@ from vllm.v1.engine import EngineCoreRequest from vllm.v1.engine.input_processor import InputProcessor +from vllm_omni.config.stage_config import strip_parent_engine_args from vllm_omni.diffusion.data import DiffusionParallelConfig from vllm_omni.diffusion.stage_diffusion_client import StageDiffusionClient from vllm_omni.diffusion.stage_diffusion_proc import ( @@ -98,6 +99,27 @@ logger = init_logger(__name__) +# ============================================================================ +# Parent-EngineArgs field-routing contracts (consumed by +# AsyncOmniEngine._strip_parent_engine_args when ``stage_configs_path`` is set). +# ============================================================================ + +# Fields that must survive the "equal to default → strip" filter because +# diffusion stages need them even when equal to vllm's default value +# (e.g. colocate worker setup relies on worker_extension_cls being forwarded). +_PARENT_ARGS_KEEP: frozenset[str] = frozenset({"worker_extension_cls"}) + +# Omni orchestrator-level fields consumed by ``_resolve_stage_configs`` that +# must never leak into per-stage EngineArgs (``stage_configs_path`` would +# trigger the ``create_model_config`` guard). +_PARENT_ARGS_STRIP: frozenset[str] = frozenset({"stage_configs_path"}) + +# Fields always populated by callers (via ``from_cli_args`` / ``asdict``) so +# their presence as an override is never a surprise — suppress the +# "override ignored" warning for these. +_PARENT_ARGS_NO_WARN: frozenset[str] = frozenset({"model"}) + + def _patch_generation_config_if_needed(model_config: Any) -> None: """Ensure try_get_generation_config won't crash for models whose HF config.json lacks model_type (e.g. CosyVoice3). We probe it once; @@ -1314,45 +1336,17 @@ def _strip_single_engine_args(kwargs: dict[str, Any]) -> dict[str, Any]: Logs a warning for any parent field whose value differs from the dataclass default, so users know their explicit overrides are ignored. + See the module-level ``_PARENT_ARGS_*`` constants for the routing + contracts this method enforces. """ - # worker_extension_cls is a parent field but must pass through to - # diffusion stages for colocate worker setup. - _keep = {"worker_extension_cls"} - # Orchestrator-level OmniEngineArgs fields that are consumed by - # _resolve_stage_configs and must not leak into per-stage configs - # (stage_configs_path would trigger the create_model_config guard). - _strip_omni = {"stage_configs_path"} - # Fields that are always set by callers (via from_cli_args / asdict) - # and would always appear as overridden — suppress from the warning - # so it only surfaces genuinely surprising overrides. - _no_warn = {"model"} - parent_fields: dict[str, dataclasses.Field] = {f.name: f for f in dataclasses.fields(EngineArgs)} - overridden: list[str] = [] - result: dict[str, Any] = {} - for k, v in kwargs.items(): - if k in _strip_omni: - continue - if k not in parent_fields or k in _keep: - result[k] = v - continue - # Detect explicitly-set values that differ from the default. - # Values may have been through asdict() which converts dataclass - # defaults to dicts, so normalise before comparing. - field = parent_fields[k] - if field.default is not dataclasses.MISSING: - default = field.default - elif field.default_factory is not dataclasses.MISSING: - default = field.default_factory() - else: - default = dataclasses.MISSING - if default is dataclasses.MISSING or v is None: - continue - # Normalise dataclass defaults to dicts for comparison - if dataclasses.is_dataclass(default) and not isinstance(default, type): - default = dataclasses.asdict(default) - if v != default and k not in _no_warn: - overridden.append(k) + result, overridden = strip_parent_engine_args( + kwargs, + parent_fields=parent_fields, + keep_keys=_PARENT_ARGS_KEEP, + strip_keys=_PARENT_ARGS_STRIP, + no_warn_keys=_PARENT_ARGS_NO_WARN, + ) if overridden: logger.warning( @@ -1367,6 +1361,12 @@ def _resolve_stage_configs(self, model: str, kwargs: dict[str, Any]) -> tuple[st """Resolve stage configs and inject defaults shared by orchestrator/headless.""" stage_configs_path = kwargs.get("stage_configs_path", None) + deploy_config_path = kwargs.pop("deploy_config", None) + stage_overrides_json = kwargs.pop("stage_overrides", None) + # Set of CLI keys the user actually typed; ``None`` means we have no + # parser-level info (e.g. programmatic Omni() call) and the lower + # layers should treat all kwargs as explicit. + cli_explicit_keys = kwargs.pop("_cli_explicit_keys", None) explicit_stage_configs = kwargs.pop("stage_configs", None) if explicit_stage_configs is not None: logger.warning( @@ -1379,13 +1379,27 @@ def _resolve_stage_configs(self, model: str, kwargs: dict[str, Any]) -> tuple[st else: base_kwargs = kwargs - # Use the legacy config loading path (load_and_resolve_stage_configs). - # StageConfigFactory wiring will be done in config refactor [2/N]. + # Parse --stage-overrides JSON string if provided + stage_overrides = None + if stage_overrides_json: + if isinstance(stage_overrides_json, str): + try: + stage_overrides = json.loads(stage_overrides_json) + except json.JSONDecodeError as exc: + raise ValueError( + f"--stage-overrides is not valid JSON: {exc}. Got: {stage_overrides_json!r}" + ) from exc + else: + stage_overrides = stage_overrides_json + config_path, stage_configs = load_and_resolve_stage_configs( model, stage_configs_path, base_kwargs, default_stage_cfg_factory=lambda: self._create_default_diffusion_stage_cfg(kwargs), + deploy_config_path=deploy_config_path, + stage_overrides=stage_overrides, + cli_explicit_keys=cli_explicit_keys, ) # Inject diffusion LoRA-related knobs from kwargs if not present in the stage config. diff --git a/vllm_omni/engine/orchestrator.py b/vllm_omni/engine/orchestrator.py index 8204d70e68..a5d7ad032e 100644 --- a/vllm_omni/engine/orchestrator.py +++ b/vllm_omni/engine/orchestrator.py @@ -42,6 +42,7 @@ def build_engine_core_request_from_tokens( params: SamplingParams | PoolingParams, arrival_time: float | None = None, model_config: ModelConfig | None = None, + resumable: bool = False, mm_features: list | None = None, ) -> OmniEngineCoreRequest: """Build an OmniEngineCoreRequest directly from an OmniTokensPrompt. @@ -85,6 +86,7 @@ def build_engine_core_request_from_tokens( cache_salt=None, data_parallel_rank=None, prompt_embeds=prompt_embeds, + resumable=resumable, additional_information=additional_info_payload, ) @@ -108,6 +110,20 @@ class OrchestratorRequestState: mm_processor_kwargs: dict | None = None mm_features: list | None = None + streaming: StreamingInputState = field(default_factory=lambda: StreamingInputState()) + + +@dataclass +class StreamingInputState: + # Flag of streaming input request + enabled: bool = False + # Flag of segment of streaming input finished + segment_finished: bool = False + # Streaming update prompt length + new_prompt_len_snapshot: int | None = None + # Model/bridge-specific runtime states (e.g., thinker->talker) + bridge_states: dict[str, Any] = field(default_factory=dict) + class Orchestrator: """Runs inside a background thread's asyncio event loop. @@ -391,15 +407,36 @@ async def _route_output( ) if ( - finished + (finished or (req_state.streaming.enabled and req_state.streaming.segment_finished)) and stage_id < req_state.final_stage_id and not self.async_chunk - and not self._next_stage_already_submitted(stage_id, req_state) + and (not self._next_stage_already_submitted(stage_id, req_state) or req_state.streaming.enabled) ): - if self._cfg_tracker.has_companions(req_id) and not self._cfg_tracker.all_companions_done(req_id): + if ( + finished + and self._cfg_tracker.has_companions(req_id) + and not self._cfg_tracker.all_companions_done(req_id) + ): self._cfg_tracker.defer_parent(req_id, output, stage_id) else: - await self._forward_to_next_stage(req_id, stage_id, output, req_state) + await self._forward_to_next_stage( + req_id, + stage_id, + output, + req_state, + is_streaming_session=req_state.streaming.enabled, + is_final_update=False, + ) + if req_state.streaming.enabled and finished: + # For streaming sessions, send the terminal (resumable=False) update only on a finish + await self._forward_to_next_stage( + req_id, + stage_id, + output, + req_state, + is_streaming_session=True, + is_final_update=True, + ) if finished and stage_id == req_state.final_stage_id: # PD: clean up any lingering KV params for this request @@ -573,6 +610,9 @@ async def _forward_to_next_stage( stage_id: int, output: Any, req_state: OrchestratorRequestState, + *, + is_streaming_session: bool = False, + is_final_update: bool = False, ) -> None: """Forward output from current stage to the next stage. @@ -582,6 +622,7 @@ async def _forward_to_next_stage( next_stage_id = stage_id + 1 next_client = self.stage_clients[next_stage_id] params = req_state.sampling_params_list[next_stage_id] + next_stage_resumable = is_streaming_session and not is_final_update if next_client.stage_type == "diffusion": self.stage_clients[stage_id].set_engine_outputs([output]) @@ -671,6 +712,7 @@ async def _forward_to_next_stage( next_inputs = next_client.process_engine_inputs( stage_list=self.stage_clients, prompt=req_state.prompt, + streaming_context=(req_state.streaming if req_state.streaming.enabled else None), ) except Exception: logger.exception( @@ -692,6 +734,7 @@ async def _forward_to_next_stage( params=params, model_config=self.stage_vllm_configs[next_stage_id].model_config, mm_features=_mm_features, + resumable=next_stage_resumable, ) # TODO: Here we directly use the req id to assign. @@ -733,6 +776,13 @@ async def _process_stage_outputs(self, stage_id: int, raw_outputs: EngineCoreOut raw_outputs.timestamp, None, ) + for eco in raw_outputs.outputs: + if not hasattr(eco, "request_id"): + continue + req_state = self.request_states.get(eco.request_id) + if req_state: + req_state.streaming.segment_finished = eco.is_segment_finished + req_state.streaming.new_prompt_len_snapshot = eco.new_prompt_len_snapshot if processed.reqs_to_abort: await self.stage_clients[stage_id].abort_requests_async(processed.reqs_to_abort) @@ -768,6 +818,8 @@ async def _handle_add_request(self, msg: dict[str, Any]) -> None: # Track request state - use original_prompt so downstream stages # (e.g. thinker2talker) can access the raw dict with multi_modal_data. + request = prompt + is_streaming = bool(getattr(request, "resumable", False)) req_state = OrchestratorRequestState( request_id=request_id, prompt=original_prompt, @@ -775,13 +827,13 @@ async def _handle_add_request(self, msg: dict[str, Any]) -> None: final_stage_id=final_stage_id, mm_features=getattr(prompt, "mm_features", None), # Save mm_features for PD ) + req_state.streaming.enabled = is_streaming req_state.stage_submit_ts[stage_id] = _time.time() self.request_states[request_id] = req_state # Stage-0 prompt is already a fully-formed OmniEngineCoreRequest # (pre-processed by AsyncOmniEngine.add_request, output processor # already registered there) - submit directly. - request = prompt stage_client = self.stage_clients[stage_id] if stage_client.stage_type == "diffusion": if isinstance(prompt, list): @@ -817,6 +869,7 @@ async def _handle_streaming_update(self, msg: dict[str, Any]) -> None: if "sampling_params_list" in msg and msg["sampling_params_list"]: req_state.sampling_params_list = msg["sampling_params_list"] + req_state.streaming.enabled = True req_state.stage_submit_ts[stage_id] = _time.time() stage_client = self.stage_clients[stage_id] diff --git a/vllm_omni/engine/stage_engine_core_client.py b/vllm_omni/engine/stage_engine_core_client.py index ab2de757ba..66bae22a5d 100644 --- a/vllm_omni/engine/stage_engine_core_client.py +++ b/vllm_omni/engine/stage_engine_core_client.py @@ -302,11 +302,21 @@ def process_engine_inputs( self, stage_list: list[Any], prompt: OmniTokensPrompt | list[OmniTokensPrompt] | None = None, + streaming_context: Any | None = None, ) -> list[OmniTokensPrompt]: """Process inputs from upstream stages.""" from vllm_omni.inputs.data import OmniTokensPrompt if self.custom_process_input_func is not None: + # Keep legacy arg call for non-streaming processors. + if bool(getattr(streaming_context, "enabled", False)): + return self.custom_process_input_func( + stage_list, + self.engine_input_source, + prompt, + self.requires_multimodal_data, + streaming_context, + ) return self.custom_process_input_func( stage_list, self.engine_input_source, diff --git a/vllm_omni/engine/stage_init_utils.py b/vllm_omni/engine/stage_init_utils.py index c697e34bac..89dfdc163c 100644 --- a/vllm_omni/engine/stage_init_utils.py +++ b/vllm_omni/engine/stage_init_utils.py @@ -434,6 +434,20 @@ def build_vllm_config( filtered_engine_args_dict = filter_dataclass_kwargs(OmniEngineArgs, engine_args_dict) omni_engine_args = OmniEngineArgs(**filtered_engine_args_dict) + + # Multi-stage pipelines (qwen3_tts code2wav, etc.) set max_model_len + # larger than HF max_position_embeddings by design. vLLM's validator + # rejects that without the env flag. + if filtered_engine_args_dict.get("max_model_len") is not None and not os.environ.get( + "VLLM_ALLOW_LONG_MAX_MODEL_LEN" + ): + os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1" + logger.debug( + "Auto-set VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 for stage %s (max_model_len=%s).", + stage_config.stage_id, + filtered_engine_args_dict["max_model_len"], + ) + vllm_config = omni_engine_args.create_engine_config( usage_context=UsageContext.LLM_CLASS, headless=headless, diff --git a/vllm_omni/entrypoints/async_omni.py b/vllm_omni/entrypoints/async_omni.py index 5823aa4ab0..9606cc80d0 100644 --- a/vllm_omni/entrypoints/async_omni.py +++ b/vllm_omni/entrypoints/async_omni.py @@ -309,7 +309,6 @@ async def _add_streaming_input_request( if not stage0_params.skip_clone: stage0_params = stage0_params.clone() stage0_params.skip_clone = True - stage0_params.output_kind = RequestOutputKind.DELTA has_submitted_first_chunk = False diff --git a/vllm_omni/entrypoints/cli/serve.py b/vllm_omni/entrypoints/cli/serve.py index 6e9adc2461..8bccfbb591 100644 --- a/vllm_omni/entrypoints/cli/serve.py +++ b/vllm_omni/entrypoints/cli/serve.py @@ -9,6 +9,7 @@ import json import os import signal +import sys from types import FrameType from typing import Any @@ -21,6 +22,7 @@ from vllm_omni.entrypoints.cli.logo import log_logo from vllm_omni.entrypoints.openai.api_server import omni_run_server +from vllm_omni.entrypoints.utils import detect_explicit_cli_keys logger = init_logger(__name__) @@ -79,6 +81,9 @@ class OmniServeCommand(CLISubcommand): """The `serve` subcommand for the vLLM CLI.""" name = "serve" + # Parser stashed at subparser_init so ``cmd`` can resolve each user-typed + # flag to its real ``dest`` via the parser's action table. + _parser: FlexibleArgumentParser | None = None @staticmethod def cmd(args: argparse.Namespace) -> None: @@ -90,6 +95,10 @@ def cmd(args: argparse.Namespace) -> None: if hasattr(args, "model_tag") and args.model_tag is not None: args.model = args.model_tag + # Stash the set of long-option keys the user actually typed so the + # stage-config factory can give YAML precedence over argparse defaults. + args._cli_explicit_keys = detect_explicit_cli_keys(sys.argv[1:], OmniServeCommand._parser) + if args.headless: run_headless(args) else: @@ -138,11 +147,33 @@ def subparser_init(self, subparsers: argparse._SubParsersAction) -> FlexibleArgu help="Default task type for TTS models (CustomVoice, VoiceDesign, or Base). " "If not specified, will be inferred from model path.", ) + # TODO(@lishunyang12): deprecate once all models migrate to --deploy-config omni_config_group.add_argument( "--stage-configs-path", type=str, default=None, - help="Path to the stage configs file. If not specified, the stage configs will be loaded from the model.", + help="[Deprecated — will be removed in a future release] Path to a legacy " + "stage configs YAML (stage_args format). Prefer --deploy-config for new-format deploy YAMLs.", + ) + omni_config_group.add_argument( + "--deploy-config", + type=str, + default=None, + help="Path to a deploy config YAML (new format with stages/engine_args). " + "Mutually exclusive with --stage-configs-path.", + ) + omni_config_group.add_argument( + "--stage-overrides", + type=str, + default=None, + help="Per-stage JSON overrides. Example: " + '\'{"0": {"gpu_memory_utilization": 0.8}, "2": {"enforce_eager": true}}\'', + ) + omni_config_group.add_argument( + "--async-chunk", + action=argparse.BooleanOptionalAction, + default=None, + help="Override the deploy YAML's ``async_chunk:`` bool. Unset leaves the YAML value in force.", ) omni_config_group.add_argument( "--stage-id", @@ -406,6 +437,9 @@ def subparser_init(self, subparsers: argparse._SubParsersAction) -> FlexibleArgu action="store_true", help="Enable diffusion pipeline profiler to display stage durations.", ) + # Stash via type(self) so the docs hook (which execs this function in a + # sandboxed globals dict via ``DummySelf``) doesn't fail on a NameError. + type(self)._parser = serve_parser return serve_parser @@ -461,10 +495,15 @@ def run_headless(args: argparse.Namespace) -> None: raise ValueError("headless mode requires worker_backend=multi_process") args_dict = vars(args).copy() + # Preserve the explicit-keys set captured at parse time so per-stage yaml + # values (e.g. stage 1's ``gpu_memory_utilization: 0.5``) are not + # overwritten by argparse defaults for flags the user didn't type. + cli_explicit_keys = args_dict.pop("_cli_explicit_keys", None) config_path, stage_configs = load_and_resolve_stage_configs( model, args_dict.get("stage_configs_path"), args_dict, + cli_explicit_keys=cli_explicit_keys, ) # Locate the stage config that matches stage_id. diff --git a/vllm_omni/entrypoints/omni_base.py b/vllm_omni/entrypoints/omni_base.py index 2291481341..d83709d528 100644 --- a/vllm_omni/entrypoints/omni_base.py +++ b/vllm_omni/entrypoints/omni_base.py @@ -1,6 +1,8 @@ from __future__ import annotations +import argparse import os +import sys import time import types import weakref @@ -16,7 +18,7 @@ from vllm_omni.engine.async_omni_engine import AsyncOmniEngine from vllm_omni.entrypoints.client_request_state import ClientRequestState from vllm_omni.entrypoints.pd_utils import PDDisaggregationMixin -from vllm_omni.entrypoints.utils import get_final_stage_id_for_e2e +from vllm_omni.entrypoints.utils import detect_explicit_cli_keys, get_final_stage_id_for_e2e from vllm_omni.metrics.stats import OrchestratorAggregator as OrchestratorMetrics from vllm_omni.model_executor.model_loader.weight_utils import download_weights_from_hf_specific from vllm_omni.outputs import OmniRequestOutput @@ -73,6 +75,48 @@ def omni_snapshot_download(model_id: str) -> str: class OmniBase(PDDisaggregationMixin): """Shared runtime foundation for AsyncOmni and Omni.""" + @classmethod + def from_cli_args( + cls, + args: argparse.Namespace, + *, + parser: argparse.ArgumentParser | None = None, + **overrides: Any, + ) -> OmniBase: + """Construct an ``Omni`` / ``AsyncOmni`` from an ``argparse.Namespace``. + + Mirrors the ``EngineArgs.from_cli_args`` pattern used upstream and in + ``OmniEngineArgs.from_cli_args``. This is the recommended entry point + for any argparse-based caller (offline scripts, tests, CI): it + expands ``vars(args)`` into kwargs and automatically captures which + flags the user typed on the command line so that argparse defaults + do not silently override deploy YAML values. + + Passing ``parser`` is strongly recommended: without it, flag-to-dest + resolution falls back to a name-based heuristic that misidentifies + flags with ``dest=`` overrides, alias flags, and ``--disable-X`` / + ``store_false`` pairs. See :func:`detect_explicit_cli_keys`. + + Args: + args: Parsed argparse namespace from ``parser.parse_args()``. + parser: The argparse parser used to produce ``args``. When + provided, each user-typed flag is resolved to its real + ``dest`` via the parser's action table. + **overrides: Extra keyword arguments that take precedence over + attributes on ``args``. + + Example:: + + parser = FlexibleArgumentParser() + parser.add_argument("--model", required=True) + args = parser.parse_args() + omni = Omni.from_cli_args(args, parser=parser) # preferred + omni = Omni.from_cli_args(args, parser=parser, model="other") + """ + kwargs: dict[str, Any] = {**vars(args), **overrides} + kwargs["_cli_explicit_keys"] = detect_explicit_cli_keys(sys.argv[1:], parser) + return cls(**kwargs) + def __init__( self, model: str, diff --git a/vllm_omni/entrypoints/openai/api_server.py b/vllm_omni/entrypoints/openai/api_server.py index b45c096d0a..745b719d5b 100644 --- a/vllm_omni/entrypoints/openai/api_server.py +++ b/vllm_omni/entrypoints/openai/api_server.py @@ -53,7 +53,6 @@ from vllm.entrypoints.openai.models.protocol import BaseModelPath from vllm.entrypoints.openai.models.serving import OpenAIServingModels from vllm.entrypoints.openai.orca_metrics import metrics_header -from vllm.entrypoints.openai.realtime.connection import RealtimeConnection from vllm.entrypoints.openai.realtime.serving import OpenAIServingRealtime from vllm.entrypoints.openai.responses.serving import OpenAIServingResponses from vllm.entrypoints.openai.server_utils import get_uvicorn_log_config @@ -108,6 +107,7 @@ VideoListResponse, VideoResponse, ) +from vllm_omni.entrypoints.openai.realtime_connection import RealtimeConnection from vllm_omni.entrypoints.openai.serving_chat import OmniOpenAIServingChat from vllm_omni.entrypoints.openai.serving_speech import OmniOpenAIServingSpeech from vllm_omni.entrypoints.openai.serving_speech_stream import OmniStreamingSpeechHandler @@ -1204,6 +1204,22 @@ async def streaming_speech(websocket: WebSocket): @router.websocket("/v1/realtime") async def realtime_websocket(websocket: WebSocket): """WebSocket endpoint for OpenAI-style realtime interactions.""" + engine_client = getattr(websocket.app.state, "engine_client", None) + if engine_client is not None and getattr(engine_client, "async_chunk", False): + await websocket.accept() + await websocket.send_json( + { + "type": "error", + "error": ( + "The /v1/realtime API is not supported when async_chunk is enabled on the server. " + "Use a stage configuration with async_chunk disabled and restart the server before using " + "this endpoint." + ), + "code": "unsupported", + } + ) + await websocket.close() + return serving = getattr(websocket.app.state, "openai_serving_realtime", None) if serving is None: await websocket.accept() @@ -1305,11 +1321,10 @@ async def generate_images(request: ImageGenerationRequest, raw_request: Request) # Get engine client (AsyncOmni) from app state engine_client, model_name, stage_configs = _get_engine_and_model(raw_request) - # Validate model field (warn if mismatch, don't error) if request.model is not None and request.model != model_name: - logger.warning( - f"Model mismatch: request specifies '{request.model}' but " - f"server is running '{model_name}'. Using server model." + raise HTTPException( + status_code=HTTPStatus.BAD_REQUEST.value, + detail=(f"Model mismatch: request specifies '{request.model}' but server is running '{model_name}'."), ) try: @@ -1501,8 +1516,9 @@ async def edit_images( # 1. get engine and model engine_client, model_name, stage_configs = _get_engine_and_model(raw_request) if model is not None and model != model_name: - logger.warning( - f"Model mismatch: request specifies '{model}' but server is running '{model_name}'. Using server model." + raise HTTPException( + status_code=HTTPStatus.BAD_REQUEST.value, + detail=(f"Model mismatch: request specifies '{model}' but server is running '{model_name}'."), ) # 2. get output format & compression output_format = _choose_output_format(output_format, background) @@ -2218,10 +2234,12 @@ async def _parse_video_form( app_model_name, app_stage_configs = _resolve_video_runtime_context(raw_request) effective_model_name = handler.model_name or app_model_name or request.model or "unknown" if request.model is not None and effective_model_name is not None and request.model != effective_model_name: - logger.warning( - "Model mismatch: request specifies '%s' but server is running '%s'. Using server model.", - request.model, - effective_model_name, + raise HTTPException( + status_code=HTTPStatus.BAD_REQUEST.value, + detail=( + f"Model mismatch: request specifies '{request.model}' but server is running " + f"'{effective_model_name}'." + ), ) handler.set_stage_configs_if_missing(app_stage_configs) except HTTPException: diff --git a/vllm_omni/entrypoints/openai/realtime_connection.py b/vllm_omni/entrypoints/openai/realtime_connection.py new file mode 100644 index 0000000000..1d5470f569 --- /dev/null +++ b/vllm_omni/entrypoints/openai/realtime_connection.py @@ -0,0 +1,193 @@ +from __future__ import annotations + +import asyncio +import base64 +import json +from collections.abc import AsyncGenerator +from uuid import uuid4 + +import numpy as np +from vllm.entrypoints.openai.engine.protocol import UsageInfo +from vllm.entrypoints.openai.realtime.connection import RealtimeConnection as VllmRealtimeConnection +from vllm.entrypoints.openai.realtime.protocol import TranscriptionDelta, TranscriptionDone +from vllm.logger import init_logger + +logger = init_logger(__name__) + + +class RealtimeConnection(VllmRealtimeConnection): + """Omni realtime connection with audio-only server events. + + Reuses upstream vLLM websocket/session lifecycle and only customizes + generation output handling to emit audio deltas. + """ + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + # Last audio buffer seen for this realtime generation (cumulative or concatenation + # of increments); used to turn server cumulative PCM into true deltas. + self._realtime_audio_ref: np.ndarray | None = None + + async def start_generation(self): + await super().start_generation() + + @staticmethod + def _tensor_to_numpy(value) -> np.ndarray | None: + if value is None: + return None + if isinstance(value, np.ndarray): + arr = value + elif hasattr(value, "detach"): + arr = value.detach().float().cpu().numpy() + else: + try: + arr = np.asarray(value) + except Exception: + return None + if arr.ndim > 1: + arr = arr.reshape(-1) + return arr.astype(np.float32, copy=False) + + @staticmethod + def _numpy_audio_prefix_match(prev: np.ndarray, curr: np.ndarray) -> bool: + n = prev.shape[0] + if n == 0: + return True + if curr.shape[0] < n: + return False + return bool(np.allclose(curr[:n], prev, rtol=1e-3, atol=2e-4)) + + def _raw_waveform_to_deltas(self, arr: np.ndarray) -> list[np.ndarray]: + """Convert one streaming PCM f32 chunk into incremental piece(s) for the client. + + Some engine paths emit a growing cumulative waveform each step; others emit + true per-step deltas. We support both without duplicating audio on the client. + """ + if arr.size == 0: + return [] + ref = self._realtime_audio_ref + if ref is None: + self._realtime_audio_ref = arr.copy() + return [arr] + if self._numpy_audio_prefix_match(ref, arr): + delta = arr[ref.shape[0] :] + self._realtime_audio_ref = arr.copy() + return [delta] if delta.size > 0 else [] + # True per-step delta (not a prefix extension of what we have seen). + self._realtime_audio_ref = np.concatenate([ref, arr]) + return [arr] + + def _extract_audio_chunks(self, output) -> tuple[list[np.ndarray], int]: + mm = getattr(output, "multimodal_output", None) + if not isinstance(mm, dict): + return [], 24000 + + sr = mm.get("sr") or mm.get("sample_rate") or mm.get("audio_sample_rate") or 24000 + key = "audio" if "audio" in mm else ("model_outputs" if "model_outputs" in mm else None) + if key is None: + return [], int(sr) + + raw_audio = mm.get(key) + chunks: list[np.ndarray] = [] + if isinstance(raw_audio, (list, tuple)): + if len(raw_audio) > 0: + arr = self._tensor_to_numpy(raw_audio[-1]) + if arr is not None and arr.size > 0: + chunks.extend(self._raw_waveform_to_deltas(arr)) + else: + arr = self._tensor_to_numpy(raw_audio) + if arr is not None and arr.size > 0: + chunks.extend(self._raw_waveform_to_deltas(arr)) + return chunks, int(sr) + + @staticmethod + def _pcm16_b64(audio_f32: np.ndarray) -> str: + clipped = np.clip(audio_f32, -1.0, 1.0) + pcm16 = (clipped * 32767.0).astype(np.int16) + return base64.b64encode(pcm16.tobytes()).decode("utf-8") + + async def _run_generation( + self, + streaming_input_gen: AsyncGenerator, + input_stream: asyncio.Queue[list[int]], + ): + request_id = f"rt-{self.connection_id}-{uuid4()}" + sent_audio = False + audio_done_sent = False + full_text = "" + sent_text_len = 0 + prompt_token_ids_len = 0 + completion_tokens_len = 0 + self._realtime_audio_ref = None + + try: + result_gen = self.serving.engine_client.generate( + prompt=streaming_input_gen, + request_id=request_id, + ) + + async for output in result_gen: + if output.outputs and len(output.outputs) > 0: + output0 = output.outputs[0] + token_ids = list(output0.token_ids) + if token_ids: + input_stream.put_nowait(token_ids) + # token_ids are cumulative per request + completion_tokens_len = len(token_ids) + if not prompt_token_ids_len and output.prompt_token_ids: + prompt_token_ids_len = len(output.prompt_token_ids) + cumulative_text = output0.text or "" + if cumulative_text: + if len(cumulative_text) >= sent_text_len: + delta_text = cumulative_text[sent_text_len:] + else: + delta_text = cumulative_text + sent_text_len = len(cumulative_text) + full_text = cumulative_text + else: + delta_text = "" + + if delta_text: + await self.send(TranscriptionDelta(delta=delta_text)) + + audio_chunks, sample_rate = self._extract_audio_chunks(output) + + for chunk in audio_chunks: + sent_audio = True + await self.send_json( + { + "type": "response.audio.delta", + "audio": self._pcm16_b64(chunk), + "format": "pcm16", + "sample_rate_hz": sample_rate, + } + ) + + if not self._is_connected: + break + + usage = UsageInfo( + prompt_tokens=prompt_token_ids_len, + completion_tokens=completion_tokens_len, + total_tokens=prompt_token_ids_len + completion_tokens_len, + ) + await self.send(TranscriptionDone(text=full_text, usage=usage)) + + if sent_audio: + await self.send_json({"type": "response.audio.done", "has_audio": True}) + audio_done_sent = True + except Exception as e: + logger.exception("Error in generation: %s", e) + await self.send_error(str(e), "processing_error") + finally: + # Always send terminal event so clients don't hang forever. + if self._is_connected and not audio_done_sent: + try: + await self.send_json({"type": "response.audio.done", "has_audio": sent_audio}) + except Exception: + logger.exception("Failed to send response.audio.done") + while not self.audio_queue.empty(): + self.audio_queue.get_nowait() + + async def send_json(self, payload: dict): + await self.websocket.send_text(json.dumps(payload)) diff --git a/vllm_omni/entrypoints/openai/serving_speech.py b/vllm_omni/entrypoints/openai/serving_speech.py index 3eaf18111c..ba8292f0c2 100644 --- a/vllm_omni/entrypoints/openai/serving_speech.py +++ b/vllm_omni/entrypoints/openai/serving_speech.py @@ -457,6 +457,25 @@ def _estimate_fish_prompt_len(self, text: str, ref_text: str, ref_audio: object) logger.warning("Failed to estimate Fish Speech prompt length, using fallback 2048: %s", e) return 2048 + async def _build_voxcpm2_prompt(self, request: OpenAICreateSpeechRequest) -> dict[str, Any]: + """Build prefill prompt for VoxCPM2 TTS (`prompt_token_ids` padded to full prefill length).""" + from vllm_omni.model_executor.models.voxcpm2.voxcpm2_talker import build_voxcpm2_prompt + + self._voxcpm2_encode("") # lazy-init tokenizer + split_map + ref_audio = None + ref_sr = None + if request.ref_audio is not None: + ref_audio, ref_sr = await self._resolve_ref_audio(request.ref_audio) + return build_voxcpm2_prompt( + hf_config=self.engine_client.model_config.hf_config, + tokenizer=self._voxcpm2_tokenizer, + split_map=self._voxcpm2_split_map, + text=request.input, + ref_audio=ref_audio, + ref_sr=ref_sr, + ref_text=request.ref_text, + ) + def _get_uploaded_audio_data(self, voice_name: str) -> str | None: """Get base64 encoded audio data for uploaded voice.""" voice_name_lower = voice_name.lower() @@ -1524,16 +1543,8 @@ async def _prepare_speech_generation( if request.instructions: prompt["instruct"] = request.instructions elif self._tts_model_type == "voxcpm2": + prompt = await self._build_voxcpm2_prompt(request) tts_params = {} - additional: dict[str, Any] = {} - if request.ref_audio is not None: - wav_list, sr = await self._resolve_ref_audio(request.ref_audio) - additional["reference_audio"] = [[wav_list, sr]] - # Pre-split multichar Chinese tokens (VoxCPM2 was trained with single-char CJK IDs). - token_ids = self._voxcpm2_encode(request.input) - prompt: dict[str, Any] = {"prompt_token_ids": token_ids} - if additional: - prompt["additional_information"] = additional elif self._is_tts: validation_error = self._validate_tts_request(request) if validation_error: diff --git a/vllm_omni/entrypoints/utils.py b/vllm_omni/entrypoints/utils.py index 84391c2ea8..5757d38990 100644 --- a/vllm_omni/entrypoints/utils.py +++ b/vllm_omni/entrypoints/utils.py @@ -1,3 +1,4 @@ +import argparse import os import types from collections import Counter @@ -5,10 +6,12 @@ from pathlib import Path from typing import Any, get_args, get_origin +import yaml from vllm.logger import init_logger from vllm.transformers_utils.config import get_config, get_hf_file_to_dict from vllm.transformers_utils.repo_utils import file_or_path_exists +from vllm_omni.config.stage_config import StageConfigFactory from vllm_omni.config.yaml_util import create_config, load_yaml_config, merge_configs from vllm_omni.entrypoints.stage_utils import _to_dict from vllm_omni.platforms import current_omni_platform @@ -23,6 +26,65 @@ } +def detect_explicit_cli_keys( + argv: list[str], + parser: argparse.ArgumentParser | None = None, +) -> set[str]: + """Walk ``argv`` and return the set of ``dest`` attribute names the user + explicitly provided (e.g. ``--max-num-seqs 64`` → ``max_num_seqs``). + + Used to distinguish user-typed CLI args from argparse default values so + deploy YAMLs are not silently overridden by parser defaults. Shared + across online (``vllm serve``) and offline (scripts, examples, tests, + CI) entry points — offline callers that parse CLI args via argparse + should invoke this on ``sys.argv[1:]`` and pass the result through to + ``AsyncOmni`` / ``Omni`` via the ``_cli_explicit_keys`` kwarg. + + When ``parser`` is provided, each token is looked up in the parser's + action table to find its real ``dest``. This correctly handles flags + with ``dest=`` overrides, alias flags (e.g. ``--usp`` / + ``--ulysses-degree`` both mapping to ``ulysses_degree``), and + ``--disable-foo`` / ``store_false`` patterns that map to a differently + named dest. Callers with access to an ``argparse.ArgumentParser`` should + always pass it. + + When ``parser`` is ``None``, a name-based heuristic is used as a + fallback (hyphens → underscores, plus a ``no_`` prefix strip for + ``argparse.BooleanOptionalAction``). This is correct for simple flags + but silently misidentifies ``--disable-X``-style flags and explicit + ``dest=`` overrides, so prefer the parser-aware form. + """ + if parser is not None: + dest_map: dict[str, str] = {} + for action in parser._actions: + for opt in action.option_strings: + dest_map[opt] = action.dest + explicit: set[str] = set() + for tok in argv: + if not tok.startswith("--"): + continue + flag = tok.split("=", 1)[0] + dest = dest_map.get(flag) + if dest is not None: + explicit.add(dest) + return explicit + + # Fallback: name-based heuristic (legacy path for callers without a parser). + explicit = set() + for tok in argv: + if not tok.startswith("--"): + continue + name = tok[2:].split("=", 1)[0] + if not name: + continue + attr = name.replace("-", "_") + explicit.add(attr) + # BooleanOptionalAction: --no-foo records as dest `foo`, not `no_foo`. + if attr.startswith("no_"): + explicit.add(attr[3:]) + return explicit + + def inject_omni_kv_config(stage: Any, omni_conn_cfg: dict[str, Any], omni_from: str, omni_to: str) -> None: """Inject connector configuration into stage engine arguments.""" # Prepare omni_kv_config dict @@ -273,29 +335,59 @@ def resolve_model_config_path(model: str) -> str: return str(stage_config_path) -def load_stage_configs_from_model(model: str, base_engine_args: dict | None = None) -> list: +def load_stage_configs_from_model( + model: str, + base_engine_args: dict | None = None, + deploy_config_path: str | None = None, + stage_overrides: dict[str, dict[str, Any]] | None = None, + cli_explicit_keys: set[str] | None = None, +) -> list: """Load stage configurations from model's default config file. - .. deprecated:: - This is the legacy OmegaConf-based loading path. New code should use - ``StageConfigFactory.create_from_model()`` instead. This function will - be removed once all callers are migrated (see PR series [2/N]). + For models registered in the pipeline registry (new path), uses + ``StageConfigFactory.create_from_model()`` which merges + PipelineConfig + DeployConfig + CLI overrides. - Loads stage configurations based on the model type and device type. - First tries to load a device-specific YAML file from stage_configs/{device_type}/ - directory. If not found, falls back to the default config file. + For other models (legacy path), loads stage configs from YAML. Args: model: Model name or path (used to determine model_type) + base_engine_args: Base engine args to merge as CLI overrides. + deploy_config_path: Optional explicit deploy config path. + stage_overrides: Per-stage overrides from --stage-overrides. + cli_explicit_keys: Set of CLI keys the user actually typed. When + provided, only these keys override deploy YAML; argparse defaults + stay subordinate to YAML. ``None`` means treat every kwarg as + explicit (programmatic ``Omni()`` calls). Returns: List of stage configuration dictionaries - - Raises: - FileNotFoundError: If no stage config file exists for the model type """ if base_engine_args is None: base_engine_args = {} + + cli_overrides = _convert_dataclasses_to_dict(dict(base_engine_args)) + # Per-stage JSON overrides are always explicit (the user typed --stage-overrides). + explicit = set(cli_explicit_keys) if cli_explicit_keys is not None else None + if stage_overrides: + for stage_id_str, overrides in stage_overrides.items(): + for key, val in overrides.items(): + stage_key = f"stage_{stage_id_str}_{key}" + cli_overrides[stage_key] = val + if explicit is not None: + explicit.add(stage_key) + + stages = StageConfigFactory.create_from_model( + model, + cli_overrides=cli_overrides, + deploy_config_path=deploy_config_path, + cli_explicit_keys=explicit, + ) + if stages is not None: + # Convert StageConfig objects to OmegaConf for backward compat + return [stage.to_omegaconf() for stage in stages] + + # Legacy fallback: load from YAML stage_config_path = resolve_model_config_path(model) if stage_config_path is None: return [] @@ -312,10 +404,9 @@ def load_stage_configs_from_yaml( base_engine_args: dict | None = None, prefer_stage_engine_args: bool = True, ) -> list: - """Load stage configurations from a YAML file. + """Load stage configurations from a YAML file (legacy OmegaConf path). - .. deprecated:: - Legacy OmegaConf-based loader. Will be removed in PR series [2/N]. + TODO(@lishunyang12): remove once all models use PipelineConfig + DeployConfig. Args: config_path: Path to the YAML configuration file @@ -449,22 +540,75 @@ def load_and_resolve_stage_configs( stage_configs_path: str | None, kwargs: dict | None, default_stage_cfg_factory: Any = None, + deploy_config_path: str | None = None, + stage_overrides: dict[str, dict[str, Any]] | None = None, + cli_explicit_keys: set[str] | None = None, ) -> tuple[str, list]: """Load stage configurations from model or YAML file with fallback to defaults. Args: model: Model name or path - stage_configs_path: Optional path to YAML file containing stage configurations + stage_configs_path: Optional path to legacy YAML (stage_args format) kwargs: Engine arguments to merge with stage configs default_stage_cfg_factory: Optional callable that takes no args and returns default stage config list when no configs are found + deploy_config_path: Optional path to deploy YAML (new format). + Mutually exclusive with ``stage_configs_path``. + stage_overrides: Per-stage overrides from ``--stage-overrides`` JSON. + Keys are stage_id strings, values are dicts of overrides. Returns: Tuple of (config_path, stage_configs) """ - if stage_configs_path is None: + if stage_configs_path is not None and deploy_config_path is not None: + raise ValueError( + "--stage-configs-path and --deploy-config are mutually exclusive: " + "they use different path resolution rules and loading paths. " + "Use --deploy-config for new-format YAMLs (preferred); " + "--stage-configs-path is kept only for the legacy `stage_args` format " + "and will be removed in a future release." + ) + if stage_configs_path is not None and deploy_config_path is None: + if not os.path.exists(stage_configs_path): + raise FileNotFoundError( + f"--stage-configs-path {stage_configs_path!r} does not exist. " + "Legacy `stage_configs/` yamls were replaced by `vllm_omni/deploy/.yaml`; " + "use --deploy-config. See docs/configuration/stage_configs.md." + ) + with open(stage_configs_path, encoding="utf-8") as f: + _peek = yaml.safe_load(f) or {} + if "stages" in _peek and "stage_args" not in _peek: + deploy_config_path = stage_configs_path + stage_configs_path = None + else: + logger.warning( + "--stage-configs-path is deprecated; migrate %r and use --deploy-config.", + stage_configs_path, + ) + + if deploy_config_path is not None: + config_path = deploy_config_path + stage_configs = load_stage_configs_from_model( + model, + base_engine_args=kwargs, + deploy_config_path=deploy_config_path, + stage_overrides=stage_overrides, + cli_explicit_keys=cli_explicit_keys, + ) + if not stage_configs: + if default_stage_cfg_factory is not None: + default_stage_cfg = default_stage_cfg_factory() + stage_configs = create_config(default_stage_cfg) + else: + stage_configs = [] + elif stage_configs_path is None: config_path = resolve_model_config_path(model) - stage_configs = load_stage_configs_from_model(model, base_engine_args=kwargs) + stage_configs = load_stage_configs_from_model( + model, + base_engine_args=kwargs, + stage_overrides=stage_overrides, + cli_explicit_keys=cli_explicit_keys, + ) if not stage_configs: if default_stage_cfg_factory is not None: default_stage_cfg = default_stage_cfg_factory() diff --git a/vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py b/vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py new file mode 100644 index 0000000000..b44d08eb32 --- /dev/null +++ b/vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py @@ -0,0 +1,78 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Qwen2.5-Omni pipeline topology (frozen). + +Stage 0: Thinker — multimodal understanding + text generation +Stage 1: Talker — text embeddings → speech tokens +Stage 2: Code2Wav — speech tokens → audio waveform +""" + +from vllm_omni.config.stage_config import ( + PipelineConfig, + StageExecutionType, + StagePipelineConfig, +) + +_PROC = "vllm_omni.model_executor.stage_input_processors.qwen2_5_omni" + +QWEN2_5_OMNI_PIPELINE = PipelineConfig( + model_type="qwen2_5_omni", + model_arch="Qwen2_5OmniForConditionalGeneration", + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="thinker", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + requires_multimodal_data=True, + engine_output_type="latent", + sampling_constraints={"detokenize": True}, + ), + StagePipelineConfig( + stage_id=1, + model_stage="talker", + execution_type=StageExecutionType.LLM_AR, + input_sources=(0,), + engine_output_type="latent", + custom_process_input_func=f"{_PROC}.thinker2talker", + sampling_constraints={ + "detokenize": True, + "stop_token_ids": [8294], + }, + ), + StagePipelineConfig( + stage_id=2, + model_stage="code2wav", + execution_type=StageExecutionType.LLM_GENERATION, + input_sources=(1,), + final_output=True, + final_output_type="audio", + engine_output_type="audio", + sampling_constraints={"detokenize": True}, + ), + ), +) + + +# Single-stage thinker-only variant for the abort test. +QWEN2_5_OMNI_THINKER_ONLY_PIPELINE = PipelineConfig( + model_type="qwen2_5_omni_thinker_only", + model_arch="Qwen2_5OmniForConditionalGeneration", + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="thinker", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + requires_multimodal_data=True, + engine_output_type="latent", + sampling_constraints={"detokenize": True}, + ), + ), +) diff --git a/vllm_omni/model_executor/models/qwen3_omni/pipeline.py b/vllm_omni/model_executor/models/qwen3_omni/pipeline.py new file mode 100644 index 0000000000..1c69ec7957 --- /dev/null +++ b/vllm_omni/model_executor/models/qwen3_omni/pipeline.py @@ -0,0 +1,63 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Qwen3-Omni-MoE pipeline topology (frozen). + +Stage 0: Thinker — multimodal understanding + text generation +Stage 1: Talker — text embeddings → RVQ codec codes +Stage 2: Code2Wav — RVQ codes → audio waveform +""" + +from vllm_omni.config.stage_config import ( + PipelineConfig, + StageExecutionType, + StagePipelineConfig, +) + +_PROC = "vllm_omni.model_executor.stage_input_processors.qwen3_omni" + +QWEN3_OMNI_PIPELINE = PipelineConfig( + model_type="qwen3_omni_moe", + model_arch="Qwen3OmniMoeForConditionalGeneration", + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="thinker", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + requires_multimodal_data=True, + hf_config_name="thinker_config", + engine_output_type="latent", + custom_process_next_stage_input_func=(f"{_PROC}.thinker2talker_async_chunk"), + sampling_constraints={"detokenize": True}, + ), + StagePipelineConfig( + stage_id=1, + model_stage="talker", + execution_type=StageExecutionType.LLM_AR, + input_sources=(0,), + hf_config_name="talker_config", + engine_output_type="latent", + custom_process_input_func=f"{_PROC}.thinker2talker", + custom_process_next_stage_input_func=(f"{_PROC}.talker2code2wav_async_chunk"), + sampling_constraints={ + "detokenize": False, + "stop_token_ids": [2150], + }, + ), + StagePipelineConfig( + stage_id=2, + model_stage="code2wav", + execution_type=StageExecutionType.LLM_GENERATION, + input_sources=(1,), + final_output=True, + final_output_type="audio", + hf_config_name="thinker_config", + engine_output_type="audio", + custom_process_input_func=f"{_PROC}.talker2code2wav", + sampling_constraints={"detokenize": True}, + ), + ), +) diff --git a/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py b/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py index 6508feda1a..133bc2fe32 100644 --- a/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py +++ b/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py @@ -171,6 +171,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): "trailing_text_hidden", "tts_pad_embed_projected", } + # Keys that need to be accumulated across streaming inputs + self.streaming_accumulated_keys: set[str] = { + "thinker_prefill_embeddings", + "thinker_hidden_states", + } elif self.model_stage == "code2wav": self.enable_update_additional_information = True diff --git a/vllm_omni/model_executor/models/qwen3_tts/pipeline.py b/vllm_omni/model_executor/models/qwen3_tts/pipeline.py new file mode 100644 index 0000000000..5051715cea --- /dev/null +++ b/vllm_omni/model_executor/models/qwen3_tts/pipeline.py @@ -0,0 +1,48 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""Qwen3-TTS pipeline: Talker (text → RVQ codec) → Code2Wav (codec → audio). + +Chunked vs end-to-end mode is dispatched from ``deploy.async_chunk``. +""" + +from vllm_omni.config.stage_config import ( + PipelineConfig, + StageExecutionType, + StagePipelineConfig, +) + +_PROC = "vllm_omni.model_executor.stage_input_processors.qwen3_tts" + +QWEN3_TTS_PIPELINE = PipelineConfig( + model_type="qwen3_tts", + # Pipeline-level default; the code2wav stage overrides per-stage below. + model_arch="Qwen3TTSTalkerForConditionalGeneration", + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="qwen3_tts", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + owns_tokenizer=True, + engine_output_type="latent", + async_chunk_process_next_stage_input_func=(f"{_PROC}.talker2code2wav_async_chunk"), + sampling_constraints={ + "detokenize": False, + "stop_token_ids": [2150], + }, + ), + StagePipelineConfig( + stage_id=1, + model_stage="code2wav", + execution_type=StageExecutionType.LLM_GENERATION, + input_sources=(0,), + final_output=True, + final_output_type="audio", + engine_output_type="audio", + model_arch="Qwen3TTSCode2Wav", + sync_process_input_func=f"{_PROC}.talker2code2wav", + sampling_constraints={"detokenize": True}, + extras={"tts_args": {"max_instructions_length": 500}}, + ), + ), +) diff --git a/vllm_omni/model_executor/models/qwen3_tts/pipeline.yaml b/vllm_omni/model_executor/models/qwen3_tts/pipeline.yaml deleted file mode 100644 index fd8ea3a3f4..0000000000 --- a/vllm_omni/model_executor/models/qwen3_tts/pipeline.yaml +++ /dev/null @@ -1,93 +0,0 @@ -model_type: qwen3_tts -async_chunk: true - -stages: - - stage_id: 0 - model_stage: qwen3_tts - stage_type: llm - is_comprehension: true - input_sources: [] - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - runtime: - devices: "0" - engine_args: - max_num_seqs: 10 - model_arch: Qwen3TTSTalkerForConditionalGeneration - hf_overrides: - architectures: [Qwen3TTSTalkerForConditionalGeneration] - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.08 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - model_stage: code2wav - stage_type: llm - input_sources: [0] - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - final_output: true - final_output_type: audio - runtime: - devices: "0" - engine_args: - max_num_seqs: 1 - model_arch: Qwen3TTSCode2Wav - hf_overrides: - architectures: [Qwen3TTSCode2Wav] - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.08 - distributed_executor_backend: "mp" - max_num_batched_tokens: 65536 - max_model_len: 65536 - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - codec_streaming: true - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - codec_chunk_frames: 25 - # Match the decoder sliding attention window to avoid chunk-boundary noise. - codec_left_context_frames: 72 - -edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py b/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py index 6b7b688f15..d9cbcf7d4e 100644 --- a/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py +++ b/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py @@ -23,6 +23,7 @@ from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead from vllm.model_executor.models.qwen3 import Qwen3Model from vllm.model_executor.models.utils import AutoWeightsLoader, PPMissingLayer, WeightsMapper, maybe_prefix +from vllm.multimodal.audio import AudioResampler from vllm.sequence import IntermediateTensors from vllm_omni.model_executor.models.output_templates import OmniOutput @@ -1094,9 +1095,8 @@ def _extract_speaker_embedding(self, wav: np.ndarray, sr: int) -> torch.Tensor: # Resample to 24kHz for speaker encoder. target_sr = int(getattr(self.config.speaker_encoder_config, "sample_rate", 24000)) if sr != target_sr: - from vllm.multimodal.audio import resample_audio_resampy - - wav = resample_audio_resampy(wav.astype(np.float32), orig_sr=int(sr), target_sr=target_sr) + resampler = AudioResampler(target_sr=target_sr) + wav = resampler.resample(wav.astype(np.float32), orig_sr=int(sr)) sr = target_sr # Follow official implementation: mel_spectrogram expects 24kHz. diff --git a/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py b/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py index 3db5cfd1b8..14bfbc5eed 100644 --- a/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py +++ b/vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py @@ -22,7 +22,7 @@ import torch from torch.nn.utils.rnn import pad_sequence from transformers import AutoConfig, AutoFeatureExtractor, AutoModel -from vllm.multimodal.audio import resample_audio_resampy +from vllm.multimodal.audio import AudioResampler from vllm.multimodal.media.audio import load_audio as _load_audio_file from .tokenizer_12hz.configuration_qwen3_tts_tokenizer_v2 import Qwen3TTSTokenizerV2Config @@ -161,7 +161,8 @@ def load_audio( audio = np.mean(audio, axis=-1) if sr != target_sr: - audio = resample_audio_resampy(audio, orig_sr=sr, target_sr=target_sr) + resampler = AudioResampler(target_sr=target_sr) + audio = resampler.resample(audio, orig_sr=sr) return audio.astype(np.float32) @@ -209,7 +210,8 @@ def _normalize_audio_inputs( if a.ndim > 1: a = np.mean(a, axis=-1) if int(sr) != target_sr: - a = resample_audio_resampy(a.astype(np.float32), orig_sr=int(sr), target_sr=target_sr) + resampler = AudioResampler(target_sr=target_sr) + a = resampler.resample(a.astype(np.float32), orig_sr=int(sr)) out.append(a.astype(np.float32)) return out diff --git a/vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py b/vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py index b666e41ebc..3724528898 100644 --- a/vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py +++ b/vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py @@ -13,6 +13,7 @@ import copy import dataclasses import logging +import math import os import time from collections.abc import Iterable @@ -40,6 +41,11 @@ _ENABLE_PROFILING = os.environ.get("VOXCPM2_PROFILE", "0") == "1" +# Lower bound for the _active_states leak-warn threshold. The effective +# threshold is max(_ACTIVE_STATE_LEAK_WARN_MIN, 4 * max_batch_size) so small +# deployments still get a usable floor instead of a tiny noisy one. +_ACTIVE_STATE_LEAK_WARN_MIN = 512 + def is_cjk_char(c: str) -> bool: """Check if a character is a CJK ideograph.""" @@ -80,6 +86,44 @@ def split_multichar_chinese(token_ids: list[int], split_map: dict[int, list[int] return result +def build_voxcpm2_prompt( + hf_config: Any, + tokenizer: Any, + split_map: dict[int, list[int]], + text: str, + ref_audio: Any | None = None, + ref_sr: int | None = None, + ref_text: str | None = None, +) -> dict[str, Any]: + """Build a VoxCPM2 prefill prompt whose ``prompt_token_ids`` length matches + the talker-side prefill length. + + Used by both online serving (``serving_speech._build_voxcpm2_prompt``) and + the offline example, so the talker-side length assertion never fires. + """ + ids = split_multichar_chinese(tokenizer.encode(text, add_special_tokens=True), split_map) + bos = tokenizer.bos_token_id + if ids and ids[0] == bos: + ids = ids[1:] + prefill_len = len(ids) + 1 # + audio_start + additional: dict[str, Any] = {"text_token_ids": [ids]} + if ref_audio is not None: + vae = hf_config.audio_vae_config + patch_samples = hf_config.patch_size * math.prod(vae["encoder_rates"]) + ref_len = math.ceil(math.ceil(len(ref_audio) * vae["sample_rate"] / ref_sr) / patch_samples) + if ref_text is not None: + additional["prompt_audio"] = [[ref_audio, ref_sr]] + additional["prompt_text"] = [ref_text] + ref_ids = split_multichar_chinese(tokenizer.encode(ref_text, add_special_tokens=True), split_map) + if ref_ids and ref_ids[0] == bos: + ref_ids = ref_ids[1:] + prefill_len += ref_len + len(ref_ids) + else: + additional["reference_audio"] = [[ref_audio, ref_sr]] + prefill_len += ref_len + 2 # ref_start / ref_end + return {"prompt_token_ids": [1] * prefill_len, "additional_information": additional} + + def _encode_raw_audio( tts: nn.Module, samples: list[float] | torch.Tensor, @@ -401,6 +445,9 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): self._results_queue: list[tuple[str, torch.Tensor | None]] = [] self._audio_queue: list[tuple[str, Any]] = [] self._deferred_cleanup_ids: set[str] = set() + self._active_state_warn_threshold = max(_ACTIVE_STATE_LEAK_WARN_MIN, 4 * self._max_batch_size) + # one-shot by design: fires at most once per process to avoid log spam. + self._active_state_warned = False @property def tts(self) -> nn.Module: @@ -410,9 +457,20 @@ def tts(self) -> nn.Module: # -------------------- request state management -------------------- def _get_or_create_state(self, request_id: str) -> _RequestState: - if request_id not in self._active_states: - self._active_states[request_id] = _RequestState(request_id=request_id) - return self._active_states[request_id] + state = self._active_states.get(request_id) + if state is None: + state = _RequestState(request_id=request_id) + self._active_states[request_id] = state + if len(self._active_states) > self._active_state_warn_threshold and not self._active_state_warned: + logger.warning( + "VoxCPM2: _active_states size=%d exceeds threshold %d " + "(max_batch_size=%d); possible cleanup path leak", + len(self._active_states), + self._active_state_warn_threshold, + self._max_batch_size, + ) + self._active_state_warned = True + return state def _switch_to_request(self, request_id: str) -> _RequestState: if request_id != self._current_request_id: @@ -793,19 +851,12 @@ def _prepare_residual_prefill(self, state: _RequestState, base_lm_out: torch.Ten tts_len = text_mask.shape[1] scaffold_len = base_lm_out.shape[0] - - if scaffold_len < tts_len: - # Voice clone / continuation: scaffold only processed vllm tokens. - # Pad to match TTS sequence length (extra positions are masked out). - pad = torch.zeros( - tts_len - scaffold_len, - base_lm_out.shape[-1], - device=base_lm_out.device, - dtype=base_lm_out.dtype, - ) - enc_out = torch.cat([base_lm_out, pad], dim=0).unsqueeze(0) - else: - enc_out = base_lm_out.unsqueeze(0) + assert scaffold_len == tts_len, ( + f"voxcpm2 prefill length mismatch: scaffold_len={scaffold_len} tts_len={tts_len}; " + "caller must pad prompt_token_ids to the full prefill length " + "(see serving_speech._build_voxcpm2_prompt or the offline example)." + ) + enc_out = base_lm_out.unsqueeze(0) prefix_feat_cond = ( feat[:, -1, ...] @@ -1055,15 +1106,12 @@ def preprocess( is_prefill = span_len > 1 if is_prefill: - # Evict stale states - pending_ids = {rid for rid, *_ in self._pending_requests} - pending_ids.add(req_id) - if self._current_request_id: - pending_ids.add(self._current_request_id) - for rid in [r for r, s in self._active_states.items() if r not in pending_ids and s.prefill_completed]: - self._cleanup_request(rid) - - token_ids = input_ids.tolist() + # Do not evict state here: _pending_requests is a per-step prefix, + # not the full batch. Cleanup is driven by on_requests_finished -> + # _flush_deferred_cleanup (fed by vLLM scheduler._free_request via + # gpu_ar_model_runner.py). + real = info_dict.get("text_token_ids") + token_ids = input_ids.tolist() if real is None else real[0] # Fail-fast: unsplit multichar Chinese IDs in input_ids means the # serving layer didn't pre-split. Silent fixup here would cause # input_ids/embeds length mismatch (scheduler slot count is fixed). diff --git a/vllm_omni/model_executor/stage_configs/bagel.yaml b/vllm_omni/model_executor/stage_configs/bagel.yaml index dfe9da1c26..75f7c8a063 100644 --- a/vllm_omni/model_executor/stage_configs/bagel.yaml +++ b/vllm_omni/model_executor/stage_configs/bagel.yaml @@ -71,10 +71,6 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 - # Distributed connectors configuration (optional) # More connectors will be supported in the future. connectors: @@ -104,4 +100,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml b/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml index af038f59fb..7a0d851f0f 100644 --- a/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml +++ b/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml @@ -64,10 +64,6 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 - # Distributed connectors configuration (optional) # More connectors will be supported in the future. connectors: @@ -104,4 +100,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml b/vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml index bb24763f90..b2d4b07b13 100644 --- a/vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml +++ b/vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml @@ -22,6 +22,3 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/bagel_think.yaml b/vllm_omni/model_executor/stage_configs/bagel_think.yaml index 0d2098a203..2575e6736d 100644 --- a/vllm_omni/model_executor/stage_configs/bagel_think.yaml +++ b/vllm_omni/model_executor/stage_configs/bagel_think.yaml @@ -65,9 +65,6 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: shared_memory_connector: @@ -78,4 +75,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/bagel_usp2.yaml b/vllm_omni/model_executor/stage_configs/bagel_usp2.yaml index 33002b9aa5..4599f8b059 100644 --- a/vllm_omni/model_executor/stage_configs/bagel_usp2.yaml +++ b/vllm_omni/model_executor/stage_configs/bagel_usp2.yaml @@ -62,9 +62,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: shared_memory_connector: name: SharedMemoryConnector @@ -73,4 +70,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/cosyvoice3_async_chunk.yaml b/vllm_omni/model_executor/stage_configs/cosyvoice3_async_chunk.yaml index ca7e9850ae..13419ef107 100644 --- a/vllm_omni/model_executor/stage_configs/cosyvoice3_async_chunk.yaml +++ b/vllm_omni/model_executor/stage_configs/cosyvoice3_async_chunk.yaml @@ -63,9 +63,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: connector_of_shared_memory: @@ -82,4 +79,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/dynin_omni.yaml b/vllm_omni/model_executor/stage_configs/dynin_omni.yaml index 0724146aa7..131a0d1cd7 100644 --- a/vllm_omni/model_executor/stage_configs/dynin_omni.yaml +++ b/vllm_omni/model_executor/stage_configs/dynin_omni.yaml @@ -67,14 +67,9 @@ stage_args: # Top-level runtime config (concise): default windows and stage edges runtime: enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage edges: - from: 0 to: 1 - window_size: -1 - from: 1 to: 2 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/dynin_omni_multiconnector.yaml b/vllm_omni/model_executor/stage_configs/dynin_omni_multiconnector.yaml index 7259daa9ea..4a54f8188a 100644 --- a/vllm_omni/model_executor/stage_configs/dynin_omni_multiconnector.yaml +++ b/vllm_omni/model_executor/stage_configs/dynin_omni_multiconnector.yaml @@ -71,9 +71,6 @@ stage_args: # Top-level runtime config (concise): default windows and stage edges runtime: enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage #### # same as Qwen2.5_omni version # Distributed connectors configuration (optional) @@ -108,7 +105,5 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 - from: 1 to: 2 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml b/vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml index 0b0b278592..90f80c22d7 100644 --- a/vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml +++ b/vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml @@ -71,10 +71,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 16 - connectors: connector_of_shared_memory: name: SharedMemoryConnector @@ -93,4 +89,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/glm_image.yaml b/vllm_omni/model_executor/stage_configs/glm_image.yaml index 3cc23e1e25..05ac84a7a0 100644 --- a/vllm_omni/model_executor/stage_configs/glm_image.yaml +++ b/vllm_omni/model_executor/stage_configs/glm_image.yaml @@ -70,11 +70,6 @@ stage_args: # Top-level runtime config runtime: enabled: true - defaults: - window_size: -1 # Trigger downstream only after full upstream completion - max_inflight: 1 # Process serially within each stage - edges: - from: 0 # AR → Diffusion: trigger after AR completes to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/glm_image_muilticonnector.yaml b/vllm_omni/model_executor/stage_configs/glm_image_muilticonnector.yaml index 719c73a9fc..7bd66c403f 100644 --- a/vllm_omni/model_executor/stage_configs/glm_image_muilticonnector.yaml +++ b/vllm_omni/model_executor/stage_configs/glm_image_muilticonnector.yaml @@ -70,14 +70,9 @@ stage_args: # Top-level runtime config with MultiConnector support runtime: enabled: true - defaults: - window_size: -1 # Trigger downstream only after full upstream completion - max_inflight: 1 # Process serially within each stage - edges: - from: 0 # AR → Diffusion to: 1 - window_size: -1 # OmniConnector configuration for efficient inter-stage tensor transfer connectors: diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml index 203b54f257..b68b184ec3 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml @@ -39,6 +39,3 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml index 9f6adece0f..413e0f09cb 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml @@ -69,10 +69,6 @@ stage_args: # Top-level runtime config runtime: enabled: true - defaults: - window_size: -1 # Trigger downstream only after full upstream completion - max_inflight: 1 # Process serially within each stage edges: - from: 0 # AR → Diffusion to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml index aeef27a974..586b601bc5 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml @@ -30,6 +30,3 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml index a60fe9a5b5..1d8c7f4812 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml @@ -29,6 +29,3 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml index e029c38362..41ed74ba62 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml @@ -39,6 +39,3 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml index 60da8e0bc7..a0a1a0dc1c 100644 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml @@ -40,6 +40,3 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/model_executor/stage_configs/mimo_audio_async_chunk.yaml b/vllm_omni/model_executor/stage_configs/mimo_audio_async_chunk.yaml index b3c6bbbaf0..2fa1b982af 100644 --- a/vllm_omni/model_executor/stage_configs/mimo_audio_async_chunk.yaml +++ b/vllm_omni/model_executor/stage_configs/mimo_audio_async_chunk.yaml @@ -74,10 +74,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 - connectors: connector_of_shared_memory: name: SharedMemoryConnector @@ -93,4 +89,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen2_5_omni.yaml b/vllm_omni/model_executor/stage_configs/qwen2_5_omni.yaml deleted file mode 100644 index 0a307b4477..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen2_5_omni.yaml +++ /dev/null @@ -1,107 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config has been verified on 2x H100-80G GPU. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - max_num_batched_tokens: 32768 - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - async_scheduling: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen2_5_omni_multiconnector.yaml b/vllm_omni/model_executor/stage_configs/qwen2_5_omni_multiconnector.yaml deleted file mode 100644 index 6e4f871e38..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen2_5_omni_multiconnector.yaml +++ /dev/null @@ -1,141 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config has been verified on 1x H100-80G GPU. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - mm_processor_cache_gb: 0 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - # Distributed connector configuration (optional) - output_connectors: - to_stage_1: mooncake_connector - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - # Distributed connector configuration (optional) - input_connectors: - from_stage_0: mooncake_connector - output_connectors: - to_stage_2: mooncake_connector - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "2" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.3 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - # Distributed connector configuration (optional) - input_connectors: - from_stage_1: mooncake_connector - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - - # Distributed connectors configuration (optional) - # More connectors will be supported in the future. - connectors: - # Mooncake connector for cross-node/intra-node communication - mooncake_connector: - name: MooncakeStoreConnector - extra: - host: "127.0.0.1" - metadata_server: "http://10.90.67.86:8080/metadata" - master: "10.90.67.86:50051" - segment: 512000000 # 512MB - localbuf: 64000000 # 64MB - proto: "tcp" - - # Yuanrong connector for cross-node/intra-node communication - yuanrong_connector: - name: YuanrongConnector - extra: - host: "127.0.0.1" - port: "35000" - - # SharedMemory connector for intra-node communication - # Alternative SHM connector with different threshold - shared_memory_connector: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 # 64KB threshold - - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml b/vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml deleted file mode 100644 index 6c0f17f4f4..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml +++ /dev/null @@ -1,102 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -async_chunk: false -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 1 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 32 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - stop_token_ids: [0] diff --git a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml b/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml deleted file mode 100644 index d365c089da..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +++ /dev/null @@ -1,118 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 16-layer RVQ codec codes) -# Stage 2: Code2Wav (16-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 1 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk - final_output: true - final_output_type: text - is_comprehension: true - # Use named connector to apply runtime.connectors.extra. - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk - engine_input_source: [0] - # final_output: true - # final_output_type: text - # Distributed connector configuration - input_connectors: - from_stage_0: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 51200 # [TODO] if max_num_batch_tokens < max_num_seqs * 800, there will be precision problem. - hf_config_name: thinker_config - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - stop_token_ids: [0] - -runtime: - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - # Align with Omni: small chunks with sufficient context overlap. - codec_chunk_frames: 25 # code2wav decode chunk size - codec_left_context_frames: 25 # code2wav left context size diff --git a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_multiconnector.yaml b/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_multiconnector.yaml deleted file mode 100644 index 6c2d2a7669..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_omni_moe_multiconnector.yaml +++ /dev/null @@ -1,143 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings -> 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes -> audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - hf_config_name: thinker_config - tensor_parallel_size: 1 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - # Distributed connector configuration - output_connectors: - to_stage_1: mooncake_connector - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - # tensor_parallel_size: 2 - enable_prefix_caching: false - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - # Distributed connector configuration - input_connectors: - from_stage_0: mooncake_connector - output_connectors: - to_stage_2: mooncake_connector - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 64 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - # Distributed connector configuration - input_connectors: - from_stage_1: mooncake_connector - -# Top-level runtime config: default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 1 - - # Distributed connectors configuration - connectors: - # Mooncake connector for cross-node/intra-node communication - mooncake_connector: - name: MooncakeStoreConnector - extra: - host: "127.0.0.1" - metadata_server: "http://10.90.67.86:8080/metadata" - master: "10.90.67.86:50051" - segment: 512000000 # 512MB - localbuf: 64000000 # 64MB - proto: "tcp" - - # SharedMemory connector for intra-node communication - shared_memory_connector: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 # 64KB threshold - - edges: - - from: 0 - to: 1 - window_size: -1 - - from: 1 - to: 2 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml b/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml deleted file mode 100644 index 8e1a23f8eb..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml +++ /dev/null @@ -1,100 +0,0 @@ -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - model_stage: qwen3_tts - max_num_seqs: 10 - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - # Use named connector to apply runtime.connectors.extra. - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - # Must be divisible by num_code_groups and cover (left_context + chunk). - # Prefill length is Q * num_frames (e.g. 16 * 2148 = 34368); keep headroom past 32k. - max_num_batched_tokens: 65536 - # async_chunk appends windows per step; max_model_len must cover accumulated flat codec stream. - max_model_len: 65536 - engine_input_source: [0] - final_output: true - final_output_type: audio - # Distributed connector configuration - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - stop_token_ids: [0] - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 1 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - # Frame-aligned codec streaming transport. - codec_streaming: true - # Connector polling / timeout (unit: loop count, sleep interval in seconds). - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - # Match the decoder sliding attention window to avoid chunk-boundary noise. - codec_chunk_frames: 25 - codec_left_context_frames: 72 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml b/vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml deleted file mode 100644 index ea6a37f551..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml +++ /dev/null @@ -1,101 +0,0 @@ -# Same as qwen3_tts.yaml with batched talker and code2wav. -# Stage 0: max_num_seqs 4, stage 1: max_num_seqs 4. -# max_num_seqs must be a power of two to align with CUDA graph capture sizes -# (stage 0) and must match --batch-size in end2end.py / benchmark scripts. -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - model_stage: qwen3_tts - max_num_seqs: 4 - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - # Use named connector to apply runtime.connectors.extra. - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - model_stage: code2wav - max_num_seqs: 4 - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: true - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - # Must be divisible by num_code_groups and cover (left_context + chunk). - max_num_batched_tokens: 65536 - # Flat codec prompt can exceed 32k tokens (Q * frames); align with max_tokens below. - max_model_len: 65536 - engine_input_source: [0] - final_output: true - final_output_type: audio - # Distributed connector configuration - input_connectors: - from_stage_0: connector_of_shared_memory - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - stop_token_ids: [0] - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 1 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - # Frame-aligned codec streaming transport. - codec_streaming: true - # Connector polling / timeout (unit: loop count, sleep interval in seconds). - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - # Match the decoder sliding attention window to avoid chunk-boundary noise. - codec_chunk_frames: 25 - codec_left_context_frames: 72 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/qwen3_tts_no_async_chunk.yaml b/vllm_omni/model_executor/stage_configs/qwen3_tts_no_async_chunk.yaml deleted file mode 100644 index 61eb9c9f50..0000000000 --- a/vllm_omni/model_executor/stage_configs/qwen3_tts_no_async_chunk.yaml +++ /dev/null @@ -1,65 +0,0 @@ -async_chunk: false -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - model_stage: qwen3_tts - max_num_seqs: 1 - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: false - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - max_num_batched_tokens: 65536 - max_model_len: 65536 - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav - final_output: true - final_output_type: audio - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - stop_token_ids: [0] diff --git a/vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml b/vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml index d2e920806d..4ca8d11ad7 100644 --- a/vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml +++ b/vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml @@ -72,9 +72,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: connector_of_shared_memory: @@ -94,4 +91,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/voxcpm_async_chunk.yaml b/vllm_omni/model_executor/stage_configs/voxcpm_async_chunk.yaml index cf78d4e438..c6fd177a35 100644 --- a/vllm_omni/model_executor/stage_configs/voxcpm_async_chunk.yaml +++ b/vllm_omni/model_executor/stage_configs/voxcpm_async_chunk.yaml @@ -77,9 +77,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: voxcpm_shm: @@ -99,4 +96,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml b/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml index 31cccb9ccf..b0d9a81e78 100644 --- a/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml +++ b/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml @@ -82,10 +82,6 @@ stage_args: # Top-level runtime config (concise): default windows and stage edges runtime: enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - connectors: connector_of_shared_memory: name: SharedMemoryConnector @@ -102,4 +98,3 @@ runtime: edges: - from: 0 # language_model → acoustic_transformer: trigger only after receiving full input (-1) to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_input_processors/bagel.py b/vllm_omni/model_executor/stage_input_processors/bagel.py index bfcff0ea0f..52cc14d3aa 100644 --- a/vllm_omni/model_executor/stage_input_processors/bagel.py +++ b/vllm_omni/model_executor/stage_input_processors/bagel.py @@ -82,6 +82,8 @@ def expand_cfg_prompts( neg_prompt = _get_negative_prompt(prompt, sampling_params) if "image" in modalities: + if not neg_prompt: + return [] neg_prompt_dict = { "prompt": neg_prompt, "modalities": prompt.get("modalities", []), @@ -166,6 +168,8 @@ def expand_cfg_prompts_think( companion_params = {"max_tokens": 1} if "image" in modalities: + if not neg_prompt: + return [] neg_prompt_dict = { "prompt": neg_prompt, "modalities": prompt.get("modalities", []), @@ -287,9 +291,10 @@ def _get_negative_prompt( ) -> str: """Resolve the negative prompt for CFG from prompt or sampling params. - An empty string is treated the same as absent (falls through to - the Bagel default token pair), because an empty negative prompt is - not meaningful for CFG guidance. + Returns the negative prompt string when one is supplied, otherwise an + empty string. Callers decide how to treat the empty case: text2img + skips the cfg_text companion entirely, while img2img substitutes it + into the cfg_text prompt template. """ neg = prompt.get("negative_prompt") if neg: @@ -300,4 +305,4 @@ def _get_negative_prompt( if neg: return neg - return "<|im_start|><|im_end|>" + return "" diff --git a/vllm_omni/model_executor/stage_input_processors/qwen3_omni.py b/vllm_omni/model_executor/stage_input_processors/qwen3_omni.py index c502041fe2..699e4b194a 100644 --- a/vllm_omni/model_executor/stage_input_processors/qwen3_omni.py +++ b/vllm_omni/model_executor/stage_input_processors/qwen3_omni.py @@ -4,6 +4,7 @@ """Stage input processor for Qwen3 Omni MoE: Thinker → Talker transition.""" import logging +from dataclasses import dataclass, field from typing import Any import torch @@ -180,6 +181,102 @@ def _resolve_tts_token_embedding( return val.detach().to(device=device, dtype=torch.float) if val is not None else None +# ========================= +# Streaming input helpers +# ========================= + + +@dataclass +class _Thinker2TalkerStreamingState: + last_prompt_len: int = 0 + last_output_len: int = 0 + merged_sequences: list[int] = field(default_factory=list) + + +@dataclass +class _Qwen3OmniStreamingState: + thinker2talker: _Thinker2TalkerStreamingState = field(default_factory=_Thinker2TalkerStreamingState) + talker2code2wav_last_seq_len: int = 0 + + +def _get_qwen3_streaming_state( + request_id: str, + streaming_context: Any | None, +) -> _Qwen3OmniStreamingState: + bridge_states = getattr(streaming_context, "bridge_states", None) + per_model_state = bridge_states.setdefault("qwen3_omni", {}) + state = per_model_state.get(request_id) + if state is None: + state = _Qwen3OmniStreamingState() + per_model_state[request_id] = state + return state + + +def _get_streaming_talker_tokens( + request_id: str, + prompt_token_ids: list[int], + output_token_ids: list[int], + new_prompt_len_snapshot: int | None = None, + streaming_context: Any | None = None, + *, + clear_state: bool = False, +) -> tuple[list[int], list[int], list[int], list[int]]: + """Return streaming token slices and merged token views for thinker->talker. + e.g. For the second streaming input request: + merged_sequences: [input_prompt 1, output_tokens 1[:-1], input_prompt 2, output_tokens 2] + thinker_input_ids: [input_prompt 1, output_tokens 1[:-1], input_prompt 2] + Returns: + inc_prompt: prompt token delta for this segment. + inc_output: output token delta for this segment. + merged_sequences: full thinker_sequences to send downstream. + thinker_input_ids: full thinker_input_ids paired with merged_sequences. + """ + state = _get_qwen3_streaming_state(request_id, streaming_context).thinker2talker + if new_prompt_len_snapshot: + prompt_token_ids = prompt_token_ids[:-new_prompt_len_snapshot] + cur_prompt_len = len(prompt_token_ids) + cur_output_len = len(output_token_ids) + + inc_prompt = prompt_token_ids[state.last_prompt_len :] + inc_output = output_token_ids[state.last_output_len :] + delta_sequences = inc_prompt + inc_output + cached_sequences = state.merged_sequences + + merged_sequences = cached_sequences + delta_sequences + thinker_input_ids = cached_sequences + inc_prompt + + # Persist history for next segment. Drop the latest sampled token to keep + # thinker_input_ids / thinker_sequences alignment with next-step append. + cached_sequences.extend(delta_sequences[:-1]) + + state.last_prompt_len = cur_prompt_len + state.last_output_len = cur_output_len + + if clear_state: + state.last_prompt_len = 0 + state.last_output_len = 0 + state.merged_sequences.clear() + + return inc_prompt, inc_output, merged_sequences, thinker_input_ids + + +def _get_streaming_codec_delta_len( + cur_seq_len: int, + request_id: str, + talker_output: Any, + streaming_context: Any | None = None, +) -> int: + """Return newly added seq_len for talker->code2wav in streaming mode.""" + state = _get_qwen3_streaming_state(request_id, streaming_context) + prev_seq_len = state.talker2code2wav_last_seq_len + seq_len = cur_seq_len - prev_seq_len + state.talker2code2wav_last_seq_len = cur_seq_len + 1 + if bool(getattr(talker_output, "finished", False)): + # Final segment: clear history to avoid cross-session carry-over. + state.talker2code2wav_last_seq_len = 0 + return seq_len + + # ========================= # Thinker -> Talker # ========================= @@ -272,6 +369,7 @@ def thinker2talker( engine_input_source: list[int], prompt: OmniTokensPrompt | TextPrompt | None = None, requires_multimodal_data: bool = False, + streaming_context: Any | None = None, ) -> list[OmniTokensPrompt]: """ Process thinker outputs to create talker inputs. @@ -305,18 +403,37 @@ def thinker2talker( # Process each thinker output for i, thinker_output in enumerate(thinker_outputs): output = thinker_output.outputs[0] + req_id = str(getattr(thinker_output, "request_id", f"idx-{i}")) + prompt_token_ids = _ensure_list(thinker_output.prompt_token_ids) + output_ids = _ensure_list(output.token_ids) + is_streaming_session = bool(getattr(streaming_context, "enabled", False)) + if is_streaming_session: + prompt_token_ids, output_ids, thinker_sequences, thinker_input_ids = _get_streaming_talker_tokens( + req_id, + prompt_token_ids, + output_ids, + getattr(streaming_context, "new_prompt_len_snapshot", None), + streaming_context, + clear_state=bool(getattr(thinker_output, "finished", False)), + ) + else: + thinker_sequences = prompt_token_ids + output_ids + thinker_input_ids = prompt_token_ids + # For streaming input, just send incremental prefill and hidden states tensor to talker + # Equally applicable to non-streaming cases. + new_seq_length = len(prompt_token_ids + output_ids) - 1 thinker_mm = output.multimodal_output # Full thinker embedding sequence for the talker: single thinker engine in the # non-PD path; after optional merge with prefill-side tensors in PD mode. - thinker_emb = thinker_mm[_EMBED_LAYER_KEY].detach().to(device=device, dtype=torch.float) - thinker_hid = thinker_mm[_HIDDEN_LAYER_KEY].detach().to(device=device, dtype=torch.float) + thinker_emb = thinker_mm[_EMBED_LAYER_KEY].detach().to(device=device, dtype=torch.float)[-new_seq_length:] + thinker_hid = thinker_mm[_HIDDEN_LAYER_KEY].detach().to(device=device, dtype=torch.float)[-new_seq_length:] prefill_mm: dict[str, Any] | None = None if prefill_stage is not None: prefill_mm = _get_prefill_multimodal_output(prefill_stage, i) if prefill_mm is not None: - expected_total = len(thinker_output.prompt_token_ids) + len(output.token_ids) + expected_total = len(prompt_token_ids) + len(output_ids) try: thinker_emb, thinker_hid = _merge_pd_embeddings( thinker_emb, thinker_hid, prefill_mm, device, expected_total=expected_total @@ -327,10 +444,8 @@ def thinker2talker( info = { "thinker_prefill_embeddings": thinker_emb, "thinker_hidden_states": thinker_hid, - "thinker_sequences": ( - thinker_output.prompt_token_ids + output.token_ids - ), # the thinker_sequences is the whole ids - "thinker_input_ids": thinker_output.prompt_token_ids, + "thinker_sequences": thinker_sequences, # the thinker_sequences is the whole ids + "thinker_input_ids": thinker_input_ids, # Provide thinker-side TTS token embeddings for talker projection "tts_bos_embed": _resolve_tts_token_embedding( "tts_bos_embed", thinker_mm=thinker_mm, prefill_mm=prefill_mm, device=device @@ -441,6 +556,7 @@ def talker2code2wav( engine_input_source: list[int], prompt: OmniTokensPrompt | TextPrompt | None = None, requires_multimodal_data: bool = False, + streaming_context: Any | None = None, ) -> list[OmniTokensPrompt]: """ Process talker outputs to create code2wav inputs. @@ -462,9 +578,14 @@ def talker2code2wav( talker_outputs = _validate_stage_inputs(stage_list, engine_input_source) code2wav_inputs: list[OmniTokensPrompt] = [] # Process each talker output - for talker_output in talker_outputs: + for i, talker_output in enumerate(talker_outputs): output = talker_output.outputs[0] - seq_len = len(output.token_ids) - 1 + req_id = str(getattr(talker_output, "request_id", f"idx-{i}")) + cur_seq_len = len(output.token_ids) - 1 + seq_len = cur_seq_len + is_streaming_session = bool(getattr(streaming_context, "enabled", False)) + if is_streaming_session: + seq_len = _get_streaming_codec_delta_len(cur_seq_len, req_id, talker_output, streaming_context) # Extract codec codes from talker output # Expected shape: [8, seq_len] (8-layer RVQ codes) codec_codes = ( diff --git a/vllm_omni/patch.py b/vllm_omni/patch.py index d4ab78f13a..f6c483a92f 100644 --- a/vllm_omni/patch.py +++ b/vllm_omni/patch.py @@ -12,12 +12,13 @@ from vllm.v1.engine import EngineCoreRequest as _OriginalEngineCoreRequest from vllm.v1.request import Request as _OriginalRequest from vllm.v1.request import RequestStatus +from vllm.v1.request import StreamingUpdate as _OriginalStreamingUpdate import vllm_omni.logger # noqa: F401 from vllm_omni.engine import OmniEngineCoreOutput, OmniEngineCoreOutputs, OmniEngineCoreRequest from vllm_omni.inputs.data import OmniTokensPrompt from vllm_omni.model_executor.layers.rotary_embedding import OmniMRotaryEmbedding -from vllm_omni.request import OmniRequest +from vllm_omni.request import OmniRequest, OmniStreamingUpdate # ============================================================================= # Patch ModelConfig.is_mm_prefix_lm to support omni-specific models @@ -115,5 +116,7 @@ def _patched_glm_image_text_config_init(self, *args, **kwargs): module.MRotaryEmbedding = OmniMRotaryEmbedding if hasattr(module, "Request") and module.Request == _OriginalRequest: module.Request = OmniRequest + if hasattr(module, "StreamingUpdate") and module.StreamingUpdate == _OriginalStreamingUpdate: + module.StreamingUpdate = OmniStreamingUpdate if hasattr(module, "EngineCoreRequest") and module.EngineCoreRequest == _OriginalEngineCoreRequest: module.EngineCoreRequest = OmniEngineCoreRequest diff --git a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml index 053e8a8cca..0fd03949d1 100644 --- a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml +++ b/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml @@ -33,6 +33,3 @@ stage_args: # Runtime defaults runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 diff --git a/vllm_omni/platforms/npu/stage_configs/qwen2_5_omni.yaml b/vllm_omni/platforms/npu/stage_configs/qwen2_5_omni.yaml deleted file mode 100644 index 8f7af161d6..0000000000 --- a/vllm_omni/platforms/npu/stage_configs/qwen2_5_omni.yaml +++ /dev/null @@ -1,97 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true # haven't supported talker ACL graph on NPU - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "2" # Example: use a different NPU than the previous stage; use "0" if single NPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe.yaml b/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe.yaml deleted file mode 100644 index 2638c99cd4..0000000000 --- a/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe.yaml +++ /dev/null @@ -1,99 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 5x A2/A3-64G NPUs. -stage_args: - - stage_id: 0 - runtime: - devices: "0,1" - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - hf_config_name: thinker_config - tensor_parallel_size: 2 - # profiler_config: - # profiler: torch - # torch_profiler_dir: ./perf - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - runtime: - devices: "2" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: true # haven't supported talker ACL graph on NPU - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - # tensor_parallel_size: 2 - enable_prefix_caching: false - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - runtime: - devices: "2" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe_async_chunk.yaml b/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe_async_chunk.yaml deleted file mode 100644 index 9aa20baecf..0000000000 --- a/vllm_omni/platforms/npu/stage_configs/qwen3_omni_moe_async_chunk.yaml +++ /dev/null @@ -1,101 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 16-layer RVQ codec codes) -# Stage 2: Code2Wav (16-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0,1" - engine_args: - max_num_seqs: 10 - model_stage: thinker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: false - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 2 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "2" - engine_args: - max_num_seqs: 10 - model_stage: talker - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk - engine_input_source: [0] - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.0 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "2" - engine_args: - max_num_seqs: 10 - model_stage: code2wav - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 51200 # [TODO] if max_num_batched_tokens < max_num_seqs * 800, there will be precision problem. - hf_config_name: thinker_config - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/vllm_omni/platforms/npu/stage_configs/qwen3_tts.yaml b/vllm_omni/platforms/npu/stage_configs/qwen3_tts.yaml deleted file mode 100644 index cd82d91b71..0000000000 --- a/vllm_omni/platforms/npu/stage_configs/qwen3_tts.yaml +++ /dev/null @@ -1,96 +0,0 @@ -async_chunk: true -stage_args: - - stage_id: 0 - stage_type: llm - is_comprehension: true - runtime: - devices: "0" - engine_args: - model_stage: qwen3_tts - max_num_seqs: 1 - model_arch: Qwen3TTSTalkerForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: latent - gpu_memory_utilization: 0.3 - distributed_executor_backend: "mp" - max_num_batched_tokens: 512 - max_model_len: 4096 - custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk - # Use named connector to apply runtime.connectors.extra. - output_connectors: - to_stage_1: connector_of_shared_memory - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: false - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 1 - stage_type: llm - runtime: - devices: "0" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3TTSCode2Wav - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - async_scheduling: false - enable_prefix_caching: false - engine_output_type: audio - gpu_memory_utilization: 0.2 - distributed_executor_backend: "mp" - max_num_batched_tokens: 65536 - max_model_len: 65536 - engine_input_source: [0] - final_output: true - final_output_type: audio - # Distributed connector configuration - input_connectors: - from_stage_0: connector_of_shared_memory - tts_args: - max_instructions_length: 500 - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: true - repetition_penalty: 1.0 - -runtime: - enabled: true - defaults: - window_size: -1 - max_inflight: 1 - - connectors: - connector_of_shared_memory: - name: SharedMemoryConnector - extra: - shm_threshold_bytes: 65536 - # Frame-aligned codec streaming transport. - codec_streaming: true - # Connector polling / timeout (unit: loop count, sleep interval in seconds). - connector_get_sleep_s: 0.01 - connector_get_max_wait_first_chunk: 3000 - connector_get_max_wait: 300 - # Align with Omni: small chunks with sufficient context overlap. - codec_chunk_frames: 25 - codec_left_context_frames: 72 - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/vllm_omni/platforms/npu/stage_configs/voxcpm_async_chunk.yaml b/vllm_omni/platforms/npu/stage_configs/voxcpm_async_chunk.yaml index 0a4ed7497d..87843634cb 100644 --- a/vllm_omni/platforms/npu/stage_configs/voxcpm_async_chunk.yaml +++ b/vllm_omni/platforms/npu/stage_configs/voxcpm_async_chunk.yaml @@ -73,9 +73,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: connector_of_shared_memory: @@ -90,4 +87,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/platforms/rocm/stage_configs/qwen2_5_omni.yaml b/vllm_omni/platforms/rocm/stage_configs/qwen2_5_omni.yaml deleted file mode 100644 index 35e8193545..0000000000 --- a/vllm_omni/platforms/rocm/stage_configs/qwen2_5_omni.yaml +++ /dev/null @@ -1,102 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config has been verified on 2x H100-80G GPU. -stage_args: - - stage_id: 0 - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage (CUDA_VISIBLE_DEVICES/torch.cuda.set_device) - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - max_num_batched_tokens: 32768 - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - - stage_id: 1 - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.8 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - - stage_id: 2 - runtime: - process: true - devices: "2" # Example: use a different GPU than the previous stage; use "0" if single GPU - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.15 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/vllm_omni/platforms/rocm/stage_configs/qwen3_omni_moe.yaml b/vllm_omni/platforms/rocm/stage_configs/qwen3_omni_moe.yaml deleted file mode 100644 index 0ca150bee6..0000000000 --- a/vllm_omni/platforms/rocm/stage_configs/qwen3_omni_moe.yaml +++ /dev/null @@ -1,97 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config has been verified on 2x H100-80G GPUs. -stage_args: - - stage_id: 0 - runtime: - devices: "0" - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 1 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - runtime: - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - # tensor_parallel_size: 2 - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - runtime: - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.1 - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/vllm_omni/platforms/xpu/stage_configs/bagel.yaml b/vllm_omni/platforms/xpu/stage_configs/bagel.yaml index 0fc8a25ea5..7b27f6a443 100644 --- a/vllm_omni/platforms/xpu/stage_configs/bagel.yaml +++ b/vllm_omni/platforms/xpu/stage_configs/bagel.yaml @@ -67,10 +67,6 @@ stage_args: # Runtime edges runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 - # Distributed connectors configuration (optional) # More connectors will be supported in the future. connectors: @@ -83,4 +79,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml index 8f969ced5f..4e0005f82a 100644 --- a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml +++ b/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml @@ -78,6 +78,3 @@ stage_args: # Top-level runtime config (concise): default windows and stage edges runtime: enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage diff --git a/vllm_omni/platforms/xpu/stage_configs/qwen2_5_omni.yaml b/vllm_omni/platforms/xpu/stage_configs/qwen2_5_omni.yaml deleted file mode 100644 index 7dbedb29a5..0000000000 --- a/vllm_omni/platforms/xpu/stage_configs/qwen2_5_omni.yaml +++ /dev/null @@ -1,101 +0,0 @@ -# stage config for running qwen2.5-omni for multi-stage omni runtime. - -# The following config is verified with 2 * Intel Arc Pro B60 XPU. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0" # Visible devices for this stage - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 # thinker weight is around 16.74GB for Qwen2.5-Omni-7B - enforce_eager: false - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.5 # talker weight is 6.03GB for Qwen2.5-Omni-7B - enforce_eager: false - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: latent - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker - default_sampling_params: - temperature: 0.9 - top_p: 0.8 - top_k: 40 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - stop_token_ids: [8294] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true - devices: "1" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen2_5OmniForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - gpu_memory_utilization: 0.3 # code2wav weight is around 1.46GB for Qwen2.5-Omni-7B - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio - engine_input_source: [1] - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true - defaults: - window_size: -1 # Simplified: trigger downstream only after full upstream completion - max_inflight: 1 # Simplified: process serially within each stage - - edges: - - from: 0 # thinker → talker: trigger only after receiving full input (-1) - to: 1 - window_size: -1 - - from: 1 # talker → code2wav: trigger only after receiving full input (-1) - to: 2 - window_size: -1 diff --git a/vllm_omni/platforms/xpu/stage_configs/qwen3_omni_moe.yaml b/vllm_omni/platforms/xpu/stage_configs/qwen3_omni_moe.yaml deleted file mode 100644 index 49914bebc4..0000000000 --- a/vllm_omni/platforms/xpu/stage_configs/qwen3_omni_moe.yaml +++ /dev/null @@ -1,102 +0,0 @@ -# Stage config for running Qwen3-Omni-MoE with 3-stage architecture -# Stage 0: Thinker (multimodal understanding + text generation) -# Stage 1: Talker (text embeddings → 8-layer RVQ codec codes) -# Stage 2: Code2Wav (8-layer RVQ codes → audio waveform) - -# The following config is verified with 8 * Intel Arc Pro B60 XPU. -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "0,1,2,3" - engine_args: - model_stage: thinker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 # thinker weight is around 61.08GB for Qwen3-Omni-30B-A3B-Instruct - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output hidden states for talker - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - hf_config_name: thinker_config - tensor_parallel_size: 4 - max_cudagraph_capture_size: 0 - final_output: true - final_output_type: text - is_comprehension: true - default_sampling_params: - temperature: 0.4 - top_p: 0.9 - top_k: 1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.05 - - - stage_id: 1 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "4" - engine_args: - model_stage: talker - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: ar - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.6 # talker weight is around 8.5GB for Qwen3-Omni-30B-A3B-Instruct - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # Output codec codes for code2wav - enable_prefix_caching: false - max_num_batched_tokens: 32768 - distributed_executor_backend: "mp" - hf_config_name: talker_config - max_cudagraph_capture_size: 0 - engine_input_source: [0] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker - # final_output: true - # final_output_type: text - default_sampling_params: - temperature: 0.9 - top_k: 50 - max_tokens: 4096 - seed: 42 - detokenize: False - repetition_penalty: 1.05 - stop_token_ids: [2150] - - - stage_id: 2 - stage_type: llm # Use llm stage type for AR stages - runtime: - devices: "4" - engine_args: - model_stage: code2wav - max_num_seqs: 1 - model_arch: Qwen3OmniMoeForConditionalGeneration - worker_type: generation - scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - engine_output_type: audio # Final output: audio waveform - gpu_memory_utilization: 0.3 # code2wav weight is around 0.4GB for Qwen3-Omni-30B-A3B-Instruct - distributed_executor_backend: "mp" - max_num_batched_tokens: 1000000 - hf_config_name: thinker_config - max_cudagraph_capture_size: 0 - engine_input_source: [1] - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav - final_output: true - final_output_type: audio - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 65536 - seed: 42 - detokenize: True - repetition_penalty: 1.1 diff --git a/vllm_omni/platforms/xpu/stage_configs/voxtral_tts.yaml b/vllm_omni/platforms/xpu/stage_configs/voxtral_tts.yaml index 10051c1eda..0820ab6320 100644 --- a/vllm_omni/platforms/xpu/stage_configs/voxtral_tts.yaml +++ b/vllm_omni/platforms/xpu/stage_configs/voxtral_tts.yaml @@ -88,9 +88,6 @@ stage_args: runtime: enabled: true - defaults: - window_size: -1 - max_inflight: 1 connectors: connector_of_shared_memory: @@ -108,4 +105,3 @@ runtime: edges: - from: 0 to: 1 - window_size: -1 diff --git a/vllm_omni/request.py b/vllm_omni/request.py index 3ec325316f..48cbf9b31d 100644 --- a/vllm_omni/request.py +++ b/vllm_omni/request.py @@ -1,8 +1,11 @@ from collections.abc import Callable +from dataclasses import dataclass from typing import TYPE_CHECKING import numpy as np import torch +from vllm.multimodal.inputs import MultiModalFeatureSpec +from vllm.sampling_params import SamplingParams from vllm.v1.request import Request if TYPE_CHECKING: @@ -92,3 +95,34 @@ def from_engine_core_request( resumable=request.resumable, reasoning_ended=request.reasoning_ended, ) + + +@dataclass +class OmniStreamingUpdate: + """ + Override: add additional information + Lightweight data for streaming session continuation. + + Contains only the fields needed to update an existing streaming session + with new input data. + """ + + mm_features: list[MultiModalFeatureSpec] | None + prompt_token_ids: list[int] | None + max_tokens: int + arrival_time: float + sampling_params: SamplingParams | None + additional_information: AdditionalInformationPayload | None = None + + @classmethod + def from_request(cls, request: "Request") -> "OmniStreamingUpdate | None": + if not request.resumable: + return None + return cls( + mm_features=request.mm_features, + prompt_token_ids=request.prompt_token_ids, + max_tokens=request.max_tokens, + arrival_time=request.arrival_time, + sampling_params=request.sampling_params, + additional_information=request.additional_information, + ) diff --git a/vllm_omni/worker/gpu_model_runner.py b/vllm_omni/worker/gpu_model_runner.py index de78011c75..d1c15eac64 100644 --- a/vllm_omni/worker/gpu_model_runner.py +++ b/vllm_omni/worker/gpu_model_runner.py @@ -308,6 +308,7 @@ def _update_states(self, scheduler_output: "SchedulerOutput"): for new_req_data in scheduler_output.scheduled_new_reqs: req_id = new_req_data.req_id if req_id in self.requests: + self._update_streaming_input_additional_info(new_req_data, req_id) req_state = self._update_streaming_request(req_id, new_req_data) reqs_to_add.append(req_state) continue @@ -1414,3 +1415,30 @@ def _update_intermediate_buffer(self, req_id: str, upd: dict) -> None: def _merge_additional_information_update(self, req_id, upd): logger.warning_once("_merge_additional_information_update is deprecated, use _update_intermediate_buffer") return self._update_intermediate_buffer(req_id, upd) + + def _update_streaming_input_additional_info(self, new_req_data, req_id): + # For streaming input prefill case only. Update buffer from last segment input + cached_additional_info = self.model_intermediate_buffer.get(req_id, {}) + if cached_additional_info: + payload_info = getattr(new_req_data, "additional_information", None) + inc_info = deserialize_additional_information(payload_info) + if isinstance(inc_info, dict) and inc_info: + merged_info = dict(cached_additional_info) + for key, value in inc_info.items(): + accumulated_keys: set[str] = set() + if hasattr(self, "model") and hasattr(self.model, "streaming_accumulated_keys"): + accumulated_keys = self.model.streaming_accumulated_keys + if key in accumulated_keys and isinstance(value, torch.Tensor): + inc_tensor = value.detach().to("cpu").contiguous() + old_tensor = merged_info.get(key) + if old_tensor is None: + merged_info[key] = inc_tensor + else: + merged_info[key] = torch.cat((old_tensor, inc_tensor), dim=0) + continue + + # Default for other keys: latest value. + merged_info[key] = value + merged_info["num_processed_tokens"] = 0 + self.model_intermediate_buffer[req_id] = merged_info + setattr(self.requests[req_id], "additional_information_cpu", merged_info) diff --git a/vllm_omni/worker_v2/model_states/omni_model_state.py b/vllm_omni/worker_v2/model_states/omni_model_state.py index 35e0b93d94..c1bd940d94 100644 --- a/vllm_omni/worker_v2/model_states/omni_model_state.py +++ b/vllm_omni/worker_v2/model_states/omni_model_state.py @@ -155,39 +155,44 @@ def _default_mrope_positions( self.have_multimodal_outputs: bool = getattr(model, "have_multimodal_outputs", False) self.plugins: list[OmniModelStatePlugin] = [] - # Static inputs_embeds buffer for FULL CUDA graph compatibility. - # Models with preprocess modify inputs_embeds each step. By - # allocating a static GPU buffer and returning it from both - # prepare_dummy_inputs (capture) and prepare_inputs (runtime), - # FULL graph captures the tensor address once and preprocess - # fills it in-place before each replay. + # Talker's codec_embedding dim may differ from hf_text_config.hidden_size; probe real dim. + self._embed_dim = self._get_embed_dim(model, device) if self.has_preprocess else 0 + + # Static inputs_embeds buffer for FULL CUDA graph — preprocess fills it in-place each step. self._static_inputs_embeds: torch.Tensor | None = None - if self.has_preprocess and not self.supports_mm_inputs: + if self._embed_dim > 0: self._static_inputs_embeds = torch.zeros( - (self.max_num_tokens, self.inputs_embeds_size), + (self.max_num_tokens, self._embed_dim), dtype=self.dtype, device=device, ) - # Static MTP buffers — avoid per-step torch.cat allocations. - # Pre-allocated for max batch size so _run_batched_mtp can - # use .copy_() instead of torch.cat(). + # Static MTP buffers so _run_batched_mtp uses .copy_() instead of torch.cat(). self._mtp_input_ids: torch.Tensor | None = None self._mtp_input_embeds: torch.Tensor | None = None self._mtp_hidden: torch.Tensor | None = None self._mtp_text_step: torch.Tensor | None = None - if self.has_preprocess and hasattr(model, "talker_mtp"): + if self._embed_dim > 0 and hasattr(model, "talker_mtp"): max_bs = max_num_reqs - hidden_size = self.inputs_embeds_size self._mtp_input_ids = torch.zeros(max_bs, dtype=torch.long, device=device) - self._mtp_input_embeds = torch.zeros((max_bs, hidden_size), dtype=self.dtype, device=device) - self._mtp_hidden = torch.zeros((max_bs, hidden_size), dtype=self.dtype, device=device) - self._mtp_text_step = torch.zeros((max_bs, hidden_size), dtype=self.dtype, device=device) + self._mtp_input_embeds = torch.zeros((max_bs, self._embed_dim), dtype=self.dtype, device=device) + self._mtp_hidden = torch.zeros((max_bs, self._embed_dim), dtype=self.dtype, device=device) + self._mtp_text_step = torch.zeros((max_bs, self._embed_dim), dtype=self.dtype, device=device) if hasattr(model, "get_omni_plugins"): for plugin in model.get_omni_plugins(): self.register_plugin(plugin) + @staticmethod + def _get_embed_dim(model: nn.Module, device: torch.device) -> int: + """Return the embedding dim that ``embed_input_ids`` produces (may differ from hf_text_config).""" + if hasattr(model, "embed_input_ids"): + dummy = torch.zeros(1, dtype=torch.long, device=device) + with torch.no_grad(): + out = model.embed_input_ids(dummy) + return out.shape[-1] + return 0 + # ------------------------------------------------------------------ # Attention metadata: use actual max_seq_len, not max_model_len # ------------------------------------------------------------------