Skip to content

[Config Refactor][2/N] Pipeline + Deploy Config Schema#2383

Merged
hsliuustc0106 merged 112 commits intovllm-project:mainfrom
lishunyang12:config-refactor-2a
Apr 19, 2026
Merged

[Config Refactor][2/N] Pipeline + Deploy Config Schema#2383
hsliuustc0106 merged 112 commits intovllm-project:mainfrom
lishunyang12:config-refactor-2a

Conversation

@lishunyang12
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 commented Mar 31, 2026

RFC: #2072

Motivation

Before this refactor, vLLM-Omni's multi-stage pipelines mixed topology (which stages exist, how they wire, what functions they call) and deployment parameters (TP size, memory budgets, device placement, connectors) into a single stage_configs/<model>.yaml per platform. Adding a new platform meant editing N files; changing a max_num_seqs meant forking the whole YAML; model developers and deployment engineers edited the same file with different concerns in mind.

This PR implements RFC #2072 — splitting the legacy YAML into two layers connected by a runtime merge with documented precedence.

Design

Before vs. after

flowchart LR
    subgraph Before["Before — single YAML per platform"]
        direction TB
        Y1["stage_configs/qwen3_omni_moe.yaml<br/>(topology + params + connectors)"]
        Y2["platforms/npu/stage_configs/qwen3_omni_moe.yaml<br/>(full copy for NPU)"]
        Y3["platforms/rocm/stage_configs/qwen3_omni_moe.yaml<br/>(full copy for ROCm)"]
    end
    subgraph After["After — split by concern"]
        direction TB
        P["models/qwen3_omni/pipeline.py<br/>(frozen topology — developer-owned)"]
        D["deploy/qwen3_omni_moe.yaml<br/>(CUDA defaults — deployer-owned)"]
        DN["deploy/qwen3_omni_moe.yaml<br/>:platforms.npu.stages<br/>(platform deltas inline)"]
    end
    Before --> After
Loading
  • Pipeline (models/<name>/pipeline.py) calls register_pipeline(PipelineConfig(...)) at import time. Frozen — deploy cannot reshape the graph.
  • Deploy (deploy/<model>.yaml) carries per-stage TP size, GPU memory, device placement, connectors, and platform deltas.
  • A single CUDA default yaml with platforms: { npu, rocm, xpu } sections replaces three parallel files.

Two-level config objects

Aspect PipelineConfig DeployConfig
Mutability frozen=True mutable (platform/CLI override)
Source Python (pipeline.py) YAML (deploy/*.yaml)
Owner model developers deployment engineers
Contains stages, edges, processor funcs, sampling constraints TP/mem/devices, connectors, platform deltas

Precedence chain

flowchart LR
    D1["Parser defaults"] -->|weakest| D2["Base deploy YAML"]
    D2 --> D3["Overlay YAML<br/>via base_config:"]
    D3 --> D4["Platform section<br/>platforms.npu.stages"]
    D4 --> D5["Global CLI<br/>--gpu-memory-utilization"]
    D5 -->|strongest| D6["Per-stage CLI<br/>--stage-overrides JSON"]
Loading

User-typed keys are tracked via _cli_explicit_keys (parser-aware: walks parser._actions so --disable-Xdest=enable_X and alias flags resolve correctly) so argparse defaults do not silently overwrite YAML values.

CLI flag routing — OrchestratorArgs

make_arg_parser flattens uvicorn / FastAPI / engine / orchestrator flags into a single namespace. The old code maintained two hardcoded frozensets (~49 strings total) as denylists — fragile. OrchestratorArgs replaces them with a dataclass; split_kwargs classifies each flag by dataclass membership; CI invariants in tests/test_arg_utils.py catch unclassified flags at test time.

Auto-discovery

No more hardcoded PIPELINE_MODELS / _ARCHITECTURE_MODELS dicts. _discover_all_pipelines scans model_executor/models/*/pipeline.py and registers them; a contributor adding a new model just drops a pipeline.py.

Summary of Changes

Area Change
New modules vllm_omni/engine/arg_utils.py (OrchestratorArgs + SHARED_FIELDS + split_kwargs), vllm_omni/deploy/*.yaml (3 default deploy configs + CI overlays)
New pipelines models/qwen2_5_omni/pipeline.py, qwen3_omni/pipeline.py, qwen3_tts/pipeline.py
New CLI flags --deploy-config, --stage-overrides, --async-chunk / --no-async-chunk
Removed hardcoded state INTERNAL_STAGE_OVERRIDE_KEYS, SERVER_ONLY_KEYS, PIPELINE_MODELS, _ARCHITECTURE_MODELS — replaced by dataclass-derived invariants and auto-discovery
Refactorings merge_pipeline_deploy split into 4 single-responsibility helpers (SLAP); _apply_platform_overrides deduplicated; execution_type → (stage_type, worker_type) lookup table; parser-aware detect_explicit_cli_keys
Schema clean-up 8 pipeline-wide engine settings (trust_remote_code, distributed_executor_backend, dtype, quantization, enable_prefix_caching, enable_chunked_prefill, data_parallel_size, pipeline_parallel_size) moved from per-stage to top-level DeployConfig
Validation merge_pipeline_deploy raises if async_chunk=True but no stage declares an async handler; get_scheduler_cls raises on invalid stage_id / unmapped execution_type; _deep_merge_stage warns on type-mismatch clobber; --stage-configs-path and --deploy-config are mutex; scheduler map stores class refs (rename fails at import)
Legacy paths retained --stage-configs-path (deprecated in help text, to be removed in 2c); ModelPipeline / StageConfig / _parse_pipeline_yaml preserved for not-yet-migrated models
Tests tests/test_arg_utils.py (15 invariants incl. BVA), expanded tests/test_config_factory.py (+644 lines)
YAML style Deploy YAMLs carry only non-default values; every field falls back to the StageDeployConfig dataclass default (single source of truth at vllm_omni/config/stage_config.py)
CLI entry Omni.from_cli_args(args, parser=parser) / AsyncOmni.from_cli_args(args, parser=parser) mirror OmniEngineArgs.from_cli_args; optional parser= enables accurate _cli_explicit_keys resolution
Docs docs/configuration/stage_configs.md rewritten with unified schema tables, connector schema, worked override precedence example; examples/online_serving/qwen3_tts/README.md gains a Sync vs async-chunk mode section

Test Plan

1. Unit tests + smoke scripts (CPU-only)

pytest tests/test_config_factory.py -v
pytest tests/test_arg_utils.py -v
python tools/smoke_config_loading.py        # 19 checks
python tools/smoke_cli_explicit_keys.py     # 13 checks

2. E2E launch matrices (GPU box)

qwen2_5_omni

BASE="vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091"

# Default — auto-loads vllm_omni/deploy/qwen2_5_omni.yaml
$BASE

# Global CLI flag (applies to every stage; explicit beats yaml)
$BASE --max-model-len 16384

# Per-stage override via JSON
$BASE --stage-overrides '{"1":{"gpu_memory_utilization":0.5},"2":{"max_num_batched_tokens":16384}}'

# Explicit precedence: per-stage beats global
$BASE --max-num-seqs 4 --stage-overrides '{"1":{"max_num_seqs":8}}'

# BooleanOptionalAction flags
$BASE --enable-prefix-caching
$BASE --no-enable-prefix-caching
$BASE --no-async-chunk

qwen3_omni_moe

BASE="vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091"

# Default — auto-loads vllm_omni/deploy/qwen3_omni_moe.yaml
$BASE
$BASE --stage-overrides '{"1":{"gpu_memory_utilization":0.5}}'

# Multi-node overlay (README has a worked example)
$BASE --deploy-config /path/to/my_qwen3_omni_multinode.yaml

qwen3_tts (async vs sync codec, both from one yaml)

BASE_TTS="vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091"

# Async chunk ON (yaml default) — chunked streaming
$BASE_TTS

# Async chunk OFF — same pipeline, alternate processor function selected
# automatically by merge_pipeline_deploy based on deploy.async_chunk.
$BASE_TTS --no-async-chunk

# Batched throughput via per-stage overrides (replaces the deleted qwen3_tts_batch.yaml)
$BASE_TTS --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},"1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}'

3. Server flag isolation (regression check for #873-class bugs)

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 \
    --host 0.0.0.0 \
    --served-model-name my-omni \
    --api-key secret123 \
    --allowed-local-media-path /tmp/

Expected: clean startup. A TypeError: unexpected keyword argument 'host' from OmniEngineArgs.__init__ would indicate server flags leaking into the per-stage engine path.

Review feedback addressed

19 threads from @alex-jw-brooks resolved — correctness fixes (mutex validation for --stage-configs-path / --deploy-config, async_chunk handler check, get_scheduler_cls error paths, deep-merge clobber warning, parser-aware flag detection), cleanups (removed redundant qwen3_tts_no_async_chunk alias, dead get_stage_config wrappers in 4 test files), and doc clarifications (logical device IDs, engine_extras rationale). See #2887 for deferred follow-ups (hardware auto-sizing, model-instance-driven config values, override type validation, central pipeline registry).

What ships in follow-ups

  • 2c: remove --stage-configs-path and legacy ModelPipeline / _parse_pipeline_yaml; migrate remaining legacy models (fish_speech, cosyvoice3, mimo_audio, voxtral_tts) to the registry; split stage_config.py (~1200 LOC) into focused modules once the legacy surface is gone.
  • Tooling PR: carve tools/smoke_*.py into its own PR.
  • [Follow-up] Deploy/pipeline config follow-ups from #2383 #2887 tracks the other review follow-ups.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 929007a841

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +245 to +246
if value is not None:
result[key] = value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Whitelist stage override keys before copying CLI kwargs

This loop forwards every non-None CLI kwarg that is not in a small denylist, so non-engine server flags can leak into per-stage engine_args. In the OpenAI server path, AsyncOmni is built from vars(args), which includes API/uvicorn options; once forwarded here, they eventually hit AsyncOmniEngineArgs(model, **engine_args) and can fail with unexpected-keyword errors for migrated models. Please filter by an allowlist of engine/runtime keys (or a parser-backed schema) instead of forwarding arbitrary kwargs.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/config/resolved_config.py Outdated
ea["hf_config_name"] = self.hf_config_name
if self.engine_output_type:
ea["engine_output_type"] = self.engine_output_type
ea["async_chunk"] = self.async_chunk
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve CLI async_chunk override in resolved config

CLI overrides can set async_chunk in engine_args, but this assignment overwrites that value with self.async_chunk, which is currently sourced from deploy YAML (deploy.async_chunk) in _resolve. As a result, --async-chunk cannot actually override deploy defaults, breaking the documented precedence and preventing users from toggling async mode at runtime.

Useful? React with 👍 / 👎.

"ray_address",
"batch_timeout",
"log_stats",
"tokenizer",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward tokenizer override for registered-model path

Marking tokenizer as an internal key causes it to be dropped from all stage overrides in the new factory path. The legacy YAML path explicitly injected tokenizer into stage engine args, so this is a behavior regression: users passing --tokenizer for migrated models (like qwen3_omni_moe) will silently run with the default tokenizer instead of the requested one.

Useful? React with 👍 / 👎.

@lishunyang12 lishunyang12 marked this pull request as draft March 31, 2026 14:41
@lishunyang12 lishunyang12 force-pushed the config-refactor-2a branch 6 times, most recently from d19c35b to 198e296 Compare March 31, 2026 15:18
@lishunyang12 lishunyang12 marked this pull request as ready for review April 6, 2026 09:34
@lishunyang12
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 @david6666666 @wuhang2014 My P0 priority will be on this pr this week.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e1c756f8b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/config/stage_config.py Outdated
Comment on lines +693 to +695
per_stage_only = {
k: v for k, v in cli_overrides.items()
if re.match(r"stage_\d+_", k) and v is not None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit global CLI overrides in registry path

Filtering cli_overrides down to only stage_<id>_* keys means explicit global engine flags are silently ignored for migrated models (for example --max-model-len, --max-num-seqs, or --gpu-memory-utilization). As a result, users can no longer tune all stages via normal CLI arguments and must rewrite everything into per-stage JSON overrides, which is a behavior regression versus the legacy path.

Useful? React with 👍 / 👎.

Comment on lines +447 to +451
config_path = deploy_config_path
stage_configs = load_stage_configs_from_model(
model,
base_engine_args=kwargs,
deploy_config_path=deploy_config_path,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse deploy connector schema before using deploy path as config

This new branch forwards deploy YAML paths as config_path, but stage initialization still loads transfer connectors through load_omni_transfer_config_for_model, whose parser expects legacy runtime.connectors + stage_args structure. New deploy files use connectors + stages, so connector specs/extras are dropped; in async-chunk or distributed setups, custom connector backends and connector tuning in deploy configs will not take effect.

Useful? React with 👍 / 👎.

@lishunyang12 lishunyang12 marked this pull request as draft April 6, 2026 13:01
@lishunyang12 lishunyang12 reopened this Apr 9, 2026
@lishunyang12 lishunyang12 changed the title [Config Refactor][2/N] Pipeline + Deploy Config Schema (qwen3_omni) [Config Refactor][2/N] Pipeline + Deploy Config Schema (qwen3_omni+hunyuan image) Apr 9, 2026
Comment thread tests/dfx/perf/deploy/qwen3_omni.yaml Outdated
Comment thread tests/dfx/stability/deploy/qwen3_omni.yaml Outdated
Comment thread examples/offline_inference/qwen3_omni/end2end_async_chunk.py
Comment thread tests/e2e/deploy/qwen3_omni_ci.yaml Outdated
Comment thread test_online_repro.py Outdated
@lishunyang12 lishunyang12 changed the title [Config Refactor][2/N] Pipeline + Deploy Config Schema (qwen3_omni+hunyuan image) [Config Refactor][2/N] Pipeline + Deploy Config Schema Apr 10, 2026
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 10, 2026
tools/e2e_serve_smoke.sh starts the qwen3_omni vllm-omni server with the
new --deploy-config, waits for ready, optionally asserts a pattern in the
server log (Layer 5 precedence verification), sends a chat completion
request, verifies the response shape, and tears down cleanly.

Single script covers Layer 4 (bare serve) and Layer 5 (precedence
verification) via the E2E_LOG_GREP env var and forwarded extra args.

Signed-off-by: lishunyang <lishunyang12@163.com>
Comment thread tests/e2e/online_serving/test_qwen3_omni_expansion.py Outdated
Comment thread vllm_omni/deploy/qwen3_omni_moe.yaml
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 19, 2026
…oject#2383

vllm-project#2383 replaced the per-model stage_configs/*.yaml layout with auto-loaded
vllm_omni/deploy/<model>.yaml (Pipeline in Python, Deploy in YAML) and
switched the DFX runner's config-loading dir from stage_configs/ to
deploy/.  This PR's test matrix and bench CLI still carried the old
references:

- test_tts.json: drop `stage_config_name` from the Qwen3-TTS entries;
  vllm-omni now auto-loads vllm_omni/deploy/qwen3_tts.yaml for both
  Base and CustomVoice checkpoints.
- model_configs.yaml: drop the `stage_config` field — the bench CLI
  does not reference it and auto-discovery handles pipeline lookup.
- bench_tts.py: remove the dead `--stage-configs-dir` flag and the
  `_DEFAULT_STAGE_CONFIGS_DIR` constant; both were unused and pointed
  at a directory vllm-project#2383 deleted.
- Delete tests/dfx/perf/stage_configs/voxcpm2.yaml — the directory no
  longer exists post-vllm-project#2383.

VoxCPM2 is not yet migrated to the Pipeline + Deploy schema in vllm-project#2383
(only qwen2_5_omni / qwen3_omni / qwen3_tts ship pipeline.py + deploy
YAML) and still loads via the legacy `ModelPipeline` path.  Drop the
test_voxcpm2 entry from test_tts.json to unblock DFX nightly; will
re-add as a follow-up once VoxCPM2 gets its deploy YAML.

The latency / throughput / quality baselines remain unchanged — they
come from H20 sweeps on stable checkpoints and should still hold under
the new deploy YAML (stage 0 now sets max_num_seqs=10 and
async_scheduling=true, which can only improve throughput numbers).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
… in pipeline.py

Moves pipeline declarations to vllm_omni/config/pipeline_registry.py (one
dict per category, keyed by model_type -> (module, var)), mirroring vLLM's
models/registry.py. _PIPELINE_REGISTRY is now a lazy proxy that imports the
module on first lookup, so a missed registration is impossible to hide in a
per-model pipeline.py.

- New: vllm_omni/config/pipeline_registry.py (_OMNI_PIPELINES,
  _DIFFUSION_PIPELINES, union _VLLM_OMNI_PIPELINES)
- stage_config: replace dict _PIPELINE_REGISTRY with _LazyPipelineRegistry;
  drop the now-unnecessary _discover_all_pipelines walk.
- qwen2_5_omni / qwen3_omni / qwen3_tts pipeline.py: remove
  register_pipeline() self-calls; pipelines are declared centrally now.
- register_pipeline() kept public for plugins/tests; dynamic registrations
  override the central entry.

Addresses vllm-project#2887 item 4 and vllm-project#2383 (comment).
Preparatory work for #3/N (17 single-stage diffusion models).

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…inheritance test with overlay

build_stage_runtime_overrides: ``model``, ``stage_id``, ``log_stats`` and
``stage_configs_path`` are all in SHARED_FIELDS — they are set uniformly
by the orchestrator, not per-stage. Previously internal_blacklist_keys()
subtracted SHARED_FIELDS from orchestrator_field_names(), so these keys
leaked into a stage's runtime_overrides dict (e.g. a user passing
--model foo made every stage see {"model": "foo"} as a per-stage override).
Fix: default internal_keys to `internal_blacklist_keys() | SHARED_FIELDS`.
Fixes tests/test_config_factory.py ::test_cli_override_excludes_internal_keys,
::test_per_stage_override_excludes_internal_keys,
::test_build_stage_runtime_overrides_ignores_other_stage_and_internal_keys.

test_ci_inherits_from_main: CI overlay
(tests/utils._CI_OVERLAYS["qwen3_omni_moe"]) now explicitly sets
async_chunk: False (added in vllm-project#2383 fix #53) to override the base yaml.
Update the assertion to match current behaviour and document why.

Signed-off-by: lishunyang <lishunyang12@163.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 20, 2026
Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four
scripts + a YAML registry with zero docs, leaving users to reverse-engineer
the CLI from `--help` output.  Add a single-page README covering:

- quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS)
- plot_results.py usage
- the three task types and which checkpoints support each (notes that
  -CustomVoice lacks speaker_encoder so voice_clone is Base-only)
- model_configs.yaml extension recipe for new TTS models
- dataset matrix (bundled seed_tts_design / seed_tts_smoke, external
  seed-tts-eval with link to the download guide)
- DFX nightly integration: latency / throughput / quality regimes,
  median-vs-mean baseline choice, quality-entry gating rationale
- observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel)
- file layout + cross-references to vllm-project#2558 and vllm-project#2383

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…2383)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: xiaohajiayou <75477391+xiaohajiayou@users.noreply.github.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 20, 2026
Migrate VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, and Fish Speech
S2 Pro to the Pipeline + Deploy config schema introduced in vllm-project#2383.

Each model now declares:
  * vllm_omni/model_executor/models/<model>/pipeline.py — frozen
    topology (model_type, stages, execution_type, input processors,
    sampling constraints).
  * vllm_omni/deploy/<model>.yaml — runtime tunables (max_num_seqs,
    gpu_memory_utilization, devices, sampling params).
  * pipeline_registry.py entry so the lazy loader resolves model_type
    → pipeline.

Legacy vllm_omni/model_executor/stage_configs/<model>.yaml files are
removed. Users can now launch with `vllm serve <model> --omni`; the
deploy config auto-loads from vllm_omni/deploy/<model>.yaml. Async-chunk
variants for CosyVoice3 and MiMo Audio live in separate deploy files
(<model>_async_chunk.yaml) selected with --deploy-config.

Notes:
  * MiMo Audio declares hf_architectures=("MiMoAudioForConditionalGeneration",)
    because MiMoAudioConfig inherits Qwen2Config and reports
    model_type="qwen2" — the factory falls back to architectures for
    disambiguation.
  * Fish Speech's registry key is "fish_qwen3_omni" matching the HF
    top-level model_type (FishSpeechConfig.model_type); the source
    directory stays as fish_speech for readability.
  * Voxtral TTS declares tokenizer_mode/config_format/load_format
    per-stage since they are not pipeline-wide DeployConfig fields yet.

Doc/example sweep:
  * examples/online_serving/voxcpm2/{README.md,openai_speech_client.py,
    gradio_demo.py}: replace stale `python -m ...api_server` invocation
    with `vllm serve openbmb/VoxCPM2 --omni`.
  * examples/online_serving/{fish_speech,mimo_audio}/README.md,
    examples/online_serving/fish_speech/run_{server,gradio_demo}.sh:
    drop --stage-configs-path; auto-load applies.
  * examples/offline_inference/{mimo_audio,voxtral_tts,cosyvoice3}:
    rename --stage-configs-path CLI arg to --deploy-config (default
    None) and forward as deploy_config= kwarg to Omni/AsyncOmni.
  * docs/serving/speech_api.md and docs/user_guide/examples/**: same
    sweep for docs.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 21, 2026
…r model

Follow @lishunyang12's review feedback on vllm-project#2958: match the qwen3_omni_moe
pattern from vllm-project#2383 where a single deploy yaml covers both sync and
async-chunk modes. Users toggle via the ``--async-chunk`` /
``--no-async-chunk`` CLI flag (both already wired via
``argparse.BooleanOptionalAction`` in ``cli/serve.py``).

Changes:
  * ``deploy/cosyvoice3.yaml`` now ships ``async_chunk: true`` with
    ``SharedMemoryConnector`` + ``output/input_connectors`` declared
    unconditionally. The ``sync_process_input_func`` (``text2flow``)
    declared on stage 1 in ``cosyvoice3/pipeline.py`` is picked up
    automatically when ``--no-async-chunk`` flips the mode.
  * ``deploy/mimo_audio.yaml`` now ships ``async_chunk: true`` with both
    stages on ``devices: "0"``. The legacy 2-GPU sync topology is
    reachable via ``--no-async-chunk --stage-1-devices 1
    --stage-1-max-model-len 18192 --stage-1-max-num-batched-tokens 18192``
    (see the header comment in the yaml).
  * Drop ``deploy/cosyvoice3_async_chunk.yaml`` and
    ``deploy/mimo_audio_async_chunk.yaml``.

Test + doc updates:
  * ``tests/e2e/offline_inference/test_cosyvoice3.py`` parametrizes on
    an ``async_chunk: bool`` flag (instead of yaml path) and passes it
    through ``OmniRunner(async_chunk=...)``. Drops the obsolete
    ``_patched_stage_config`` that only applied to the legacy
    ``stage_args`` schema.
  * ``tests/e2e/online_serving/test_cosyvoice3_tts.py`` keeps both
    sync / async parameter blocks pointed at the same deploy yaml
    and distinguishes them with ``--no-async-chunk`` in ``server_args``.
  * ``tests/e2e/online_serving/test_mimo_audio.py`` points at
    ``mimo_audio.yaml`` (the consolidated one).
  * Offline cosyvoice3 README/docs: rephrase the "two yamls" note to
    "one yaml, toggle with ``--no-async-chunk``".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@JaredforReal JaredforReal mentioned this pull request Apr 21, 2026
5 tasks
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
…2383)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: xiaohajiayou <75477391+xiaohajiayou@users.noreply.github.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap tts-test label to trigger buildkite tts models test in nightly CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.