[Config Refactor][2/N] Pipeline + Deploy Config Schema#2383
[Config Refactor][2/N] Pipeline + Deploy Config Schema#2383hsliuustc0106 merged 112 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 929007a841
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if value is not None: | ||
| result[key] = value |
There was a problem hiding this comment.
Whitelist stage override keys before copying CLI kwargs
This loop forwards every non-None CLI kwarg that is not in a small denylist, so non-engine server flags can leak into per-stage engine_args. In the OpenAI server path, AsyncOmni is built from vars(args), which includes API/uvicorn options; once forwarded here, they eventually hit AsyncOmniEngineArgs(model, **engine_args) and can fail with unexpected-keyword errors for migrated models. Please filter by an allowlist of engine/runtime keys (or a parser-backed schema) instead of forwarding arbitrary kwargs.
Useful? React with 👍 / 👎.
| ea["hf_config_name"] = self.hf_config_name | ||
| if self.engine_output_type: | ||
| ea["engine_output_type"] = self.engine_output_type | ||
| ea["async_chunk"] = self.async_chunk |
There was a problem hiding this comment.
Preserve CLI async_chunk override in resolved config
CLI overrides can set async_chunk in engine_args, but this assignment overwrites that value with self.async_chunk, which is currently sourced from deploy YAML (deploy.async_chunk) in _resolve. As a result, --async-chunk cannot actually override deploy defaults, breaking the documented precedence and preventing users from toggling async mode at runtime.
Useful? React with 👍 / 👎.
| "ray_address", | ||
| "batch_timeout", | ||
| "log_stats", | ||
| "tokenizer", |
There was a problem hiding this comment.
Forward tokenizer override for registered-model path
Marking tokenizer as an internal key causes it to be dropped from all stage overrides in the new factory path. The legacy YAML path explicitly injected tokenizer into stage engine args, so this is a behavior regression: users passing --tokenizer for migrated models (like qwen3_omni_moe) will silently run with the default tokenizer instead of the requested one.
Useful? React with 👍 / 👎.
929007a to
4c7e428
Compare
d19c35b to
198e296
Compare
|
@hsliuustc0106 @david6666666 @wuhang2014 My P0 priority will be on this pr this week. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e1c756f8b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| per_stage_only = { | ||
| k: v for k, v in cli_overrides.items() | ||
| if re.match(r"stage_\d+_", k) and v is not None |
There was a problem hiding this comment.
Preserve explicit global CLI overrides in registry path
Filtering cli_overrides down to only stage_<id>_* keys means explicit global engine flags are silently ignored for migrated models (for example --max-model-len, --max-num-seqs, or --gpu-memory-utilization). As a result, users can no longer tune all stages via normal CLI arguments and must rewrite everything into per-stage JSON overrides, which is a behavior regression versus the legacy path.
Useful? React with 👍 / 👎.
| config_path = deploy_config_path | ||
| stage_configs = load_stage_configs_from_model( | ||
| model, | ||
| base_engine_args=kwargs, | ||
| deploy_config_path=deploy_config_path, |
There was a problem hiding this comment.
Parse deploy connector schema before using deploy path as config
This new branch forwards deploy YAML paths as config_path, but stage initialization still loads transfer connectors through load_omni_transfer_config_for_model, whose parser expects legacy runtime.connectors + stage_args structure. New deploy files use connectors + stages, so connector specs/extras are dropped; in async-chunk or distributed setups, custom connector backends and connector tuning in deploy configs will not take effect.
Useful? React with 👍 / 👎.
1e1c756 to
60e201c
Compare
tools/e2e_serve_smoke.sh starts the qwen3_omni vllm-omni server with the new --deploy-config, waits for ready, optionally asserts a pattern in the server log (Layer 5 precedence verification), sends a chat completion request, verifies the response shape, and tears down cleanly. Single script covers Layer 4 (bare serve) and Layer 5 (precedence verification) via the E2E_LOG_GREP env var and forwarded extra args. Signed-off-by: lishunyang <lishunyang12@163.com>
…oject#2383 vllm-project#2383 replaced the per-model stage_configs/*.yaml layout with auto-loaded vllm_omni/deploy/<model>.yaml (Pipeline in Python, Deploy in YAML) and switched the DFX runner's config-loading dir from stage_configs/ to deploy/. This PR's test matrix and bench CLI still carried the old references: - test_tts.json: drop `stage_config_name` from the Qwen3-TTS entries; vllm-omni now auto-loads vllm_omni/deploy/qwen3_tts.yaml for both Base and CustomVoice checkpoints. - model_configs.yaml: drop the `stage_config` field — the bench CLI does not reference it and auto-discovery handles pipeline lookup. - bench_tts.py: remove the dead `--stage-configs-dir` flag and the `_DEFAULT_STAGE_CONFIGS_DIR` constant; both were unused and pointed at a directory vllm-project#2383 deleted. - Delete tests/dfx/perf/stage_configs/voxcpm2.yaml — the directory no longer exists post-vllm-project#2383. VoxCPM2 is not yet migrated to the Pipeline + Deploy schema in vllm-project#2383 (only qwen2_5_omni / qwen3_omni / qwen3_tts ship pipeline.py + deploy YAML) and still loads via the legacy `ModelPipeline` path. Drop the test_voxcpm2 entry from test_tts.json to unblock DFX nightly; will re-add as a follow-up once VoxCPM2 gets its deploy YAML. The latency / throughput / quality baselines remain unchanged — they come from H20 sweeps on stable checkpoints and should still hold under the new deploy YAML (stage 0 now sets max_num_seqs=10 and async_scheduling=true, which can only improve throughput numbers). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… in pipeline.py Moves pipeline declarations to vllm_omni/config/pipeline_registry.py (one dict per category, keyed by model_type -> (module, var)), mirroring vLLM's models/registry.py. _PIPELINE_REGISTRY is now a lazy proxy that imports the module on first lookup, so a missed registration is impossible to hide in a per-model pipeline.py. - New: vllm_omni/config/pipeline_registry.py (_OMNI_PIPELINES, _DIFFUSION_PIPELINES, union _VLLM_OMNI_PIPELINES) - stage_config: replace dict _PIPELINE_REGISTRY with _LazyPipelineRegistry; drop the now-unnecessary _discover_all_pipelines walk. - qwen2_5_omni / qwen3_omni / qwen3_tts pipeline.py: remove register_pipeline() self-calls; pipelines are declared centrally now. - register_pipeline() kept public for plugins/tests; dynamic registrations override the central entry. Addresses vllm-project#2887 item 4 and vllm-project#2383 (comment). Preparatory work for #3/N (17 single-stage diffusion models). Signed-off-by: lishunyang <lishunyang12@163.com>
…inheritance test with overlay
build_stage_runtime_overrides: ``model``, ``stage_id``, ``log_stats`` and
``stage_configs_path`` are all in SHARED_FIELDS — they are set uniformly
by the orchestrator, not per-stage. Previously internal_blacklist_keys()
subtracted SHARED_FIELDS from orchestrator_field_names(), so these keys
leaked into a stage's runtime_overrides dict (e.g. a user passing
--model foo made every stage see {"model": "foo"} as a per-stage override).
Fix: default internal_keys to `internal_blacklist_keys() | SHARED_FIELDS`.
Fixes tests/test_config_factory.py ::test_cli_override_excludes_internal_keys,
::test_per_stage_override_excludes_internal_keys,
::test_build_stage_runtime_overrides_ignores_other_stage_and_internal_keys.
test_ci_inherits_from_main: CI overlay
(tests/utils._CI_OVERLAYS["qwen3_omni_moe"]) now explicitly sets
async_chunk: False (added in vllm-project#2383 fix #53) to override the base yaml.
Update the assertion to match current behaviour and document why.
Signed-off-by: lishunyang <lishunyang12@163.com>
Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four scripts + a YAML registry with zero docs, leaving users to reverse-engineer the CLI from `--help` output. Add a single-page README covering: - quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS) - plot_results.py usage - the three task types and which checkpoints support each (notes that -CustomVoice lacks speaker_encoder so voice_clone is Base-only) - model_configs.yaml extension recipe for new TTS models - dataset matrix (bundled seed_tts_design / seed_tts_smoke, external seed-tts-eval with link to the download guide) - DFX nightly integration: latency / throughput / quality regimes, median-vs-mean baseline choice, quality-entry gating rationale - observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel) - file layout + cross-references to vllm-project#2558 and vllm-project#2383 Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…2383) Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: xiaohajiayou <75477391+xiaohajiayou@users.noreply.github.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Migrate VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, and Fish Speech S2 Pro to the Pipeline + Deploy config schema introduced in vllm-project#2383. Each model now declares: * vllm_omni/model_executor/models/<model>/pipeline.py — frozen topology (model_type, stages, execution_type, input processors, sampling constraints). * vllm_omni/deploy/<model>.yaml — runtime tunables (max_num_seqs, gpu_memory_utilization, devices, sampling params). * pipeline_registry.py entry so the lazy loader resolves model_type → pipeline. Legacy vllm_omni/model_executor/stage_configs/<model>.yaml files are removed. Users can now launch with `vllm serve <model> --omni`; the deploy config auto-loads from vllm_omni/deploy/<model>.yaml. Async-chunk variants for CosyVoice3 and MiMo Audio live in separate deploy files (<model>_async_chunk.yaml) selected with --deploy-config. Notes: * MiMo Audio declares hf_architectures=("MiMoAudioForConditionalGeneration",) because MiMoAudioConfig inherits Qwen2Config and reports model_type="qwen2" — the factory falls back to architectures for disambiguation. * Fish Speech's registry key is "fish_qwen3_omni" matching the HF top-level model_type (FishSpeechConfig.model_type); the source directory stays as fish_speech for readability. * Voxtral TTS declares tokenizer_mode/config_format/load_format per-stage since they are not pipeline-wide DeployConfig fields yet. Doc/example sweep: * examples/online_serving/voxcpm2/{README.md,openai_speech_client.py, gradio_demo.py}: replace stale `python -m ...api_server` invocation with `vllm serve openbmb/VoxCPM2 --omni`. * examples/online_serving/{fish_speech,mimo_audio}/README.md, examples/online_serving/fish_speech/run_{server,gradio_demo}.sh: drop --stage-configs-path; auto-load applies. * examples/offline_inference/{mimo_audio,voxtral_tts,cosyvoice3}: rename --stage-configs-path CLI arg to --deploy-config (default None) and forward as deploy_config= kwarg to Omni/AsyncOmni. * docs/serving/speech_api.md and docs/user_guide/examples/**: same sweep for docs. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…r model Follow @lishunyang12's review feedback on vllm-project#2958: match the qwen3_omni_moe pattern from vllm-project#2383 where a single deploy yaml covers both sync and async-chunk modes. Users toggle via the ``--async-chunk`` / ``--no-async-chunk`` CLI flag (both already wired via ``argparse.BooleanOptionalAction`` in ``cli/serve.py``). Changes: * ``deploy/cosyvoice3.yaml`` now ships ``async_chunk: true`` with ``SharedMemoryConnector`` + ``output/input_connectors`` declared unconditionally. The ``sync_process_input_func`` (``text2flow``) declared on stage 1 in ``cosyvoice3/pipeline.py`` is picked up automatically when ``--no-async-chunk`` flips the mode. * ``deploy/mimo_audio.yaml`` now ships ``async_chunk: true`` with both stages on ``devices: "0"``. The legacy 2-GPU sync topology is reachable via ``--no-async-chunk --stage-1-devices 1 --stage-1-max-model-len 18192 --stage-1-max-num-batched-tokens 18192`` (see the header comment in the yaml). * Drop ``deploy/cosyvoice3_async_chunk.yaml`` and ``deploy/mimo_audio_async_chunk.yaml``. Test + doc updates: * ``tests/e2e/offline_inference/test_cosyvoice3.py`` parametrizes on an ``async_chunk: bool`` flag (instead of yaml path) and passes it through ``OmniRunner(async_chunk=...)``. Drops the obsolete ``_patched_stage_config`` that only applied to the legacy ``stage_args`` schema. * ``tests/e2e/online_serving/test_cosyvoice3_tts.py`` keeps both sync / async parameter blocks pointed at the same deploy yaml and distinguishes them with ``--no-async-chunk`` in ``server_args``. * ``tests/e2e/online_serving/test_mimo_audio.py`` points at ``mimo_audio.yaml`` (the consolidated one). * Offline cosyvoice3 README/docs: rephrase the "two yamls" note to "one yaml, toggle with ``--no-async-chunk``". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…2383) Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: xiaohajiayou <75477391+xiaohajiayou@users.noreply.github.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
RFC: #2072
Motivation
Before this refactor, vLLM-Omni's multi-stage pipelines mixed topology (which stages exist, how they wire, what functions they call) and deployment parameters (TP size, memory budgets, device placement, connectors) into a single
stage_configs/<model>.yamlper platform. Adding a new platform meant editing N files; changing amax_num_seqsmeant forking the whole YAML; model developers and deployment engineers edited the same file with different concerns in mind.This PR implements RFC #2072 — splitting the legacy YAML into two layers connected by a runtime merge with documented precedence.
Design
Before vs. after
flowchart LR subgraph Before["Before — single YAML per platform"] direction TB Y1["stage_configs/qwen3_omni_moe.yaml<br/>(topology + params + connectors)"] Y2["platforms/npu/stage_configs/qwen3_omni_moe.yaml<br/>(full copy for NPU)"] Y3["platforms/rocm/stage_configs/qwen3_omni_moe.yaml<br/>(full copy for ROCm)"] end subgraph After["After — split by concern"] direction TB P["models/qwen3_omni/pipeline.py<br/>(frozen topology — developer-owned)"] D["deploy/qwen3_omni_moe.yaml<br/>(CUDA defaults — deployer-owned)"] DN["deploy/qwen3_omni_moe.yaml<br/>:platforms.npu.stages<br/>(platform deltas inline)"] end Before --> Aftermodels/<name>/pipeline.py) callsregister_pipeline(PipelineConfig(...))at import time. Frozen — deploy cannot reshape the graph.deploy/<model>.yaml) carries per-stage TP size, GPU memory, device placement, connectors, and platform deltas.platforms: { npu, rocm, xpu }sections replaces three parallel files.Two-level config objects
PipelineConfigDeployConfigfrozen=Truepipeline.py)deploy/*.yaml)Precedence chain
flowchart LR D1["Parser defaults"] -->|weakest| D2["Base deploy YAML"] D2 --> D3["Overlay YAML<br/>via base_config:"] D3 --> D4["Platform section<br/>platforms.npu.stages"] D4 --> D5["Global CLI<br/>--gpu-memory-utilization"] D5 -->|strongest| D6["Per-stage CLI<br/>--stage-overrides JSON"]User-typed keys are tracked via
_cli_explicit_keys(parser-aware: walksparser._actionsso--disable-X→dest=enable_Xand alias flags resolve correctly) so argparse defaults do not silently overwrite YAML values.CLI flag routing — OrchestratorArgs
make_arg_parserflattens uvicorn / FastAPI / engine / orchestrator flags into a single namespace. The old code maintained two hardcoded frozensets (~49 strings total) as denylists — fragile.OrchestratorArgsreplaces them with a dataclass;split_kwargsclassifies each flag by dataclass membership; CI invariants intests/test_arg_utils.pycatch unclassified flags at test time.Auto-discovery
No more hardcoded
PIPELINE_MODELS/_ARCHITECTURE_MODELSdicts._discover_all_pipelinesscansmodel_executor/models/*/pipeline.pyand registers them; a contributor adding a new model just drops apipeline.py.Summary of Changes
vllm_omni/engine/arg_utils.py(OrchestratorArgs + SHARED_FIELDS + split_kwargs),vllm_omni/deploy/*.yaml(3 default deploy configs + CI overlays)models/qwen2_5_omni/pipeline.py,qwen3_omni/pipeline.py,qwen3_tts/pipeline.py--deploy-config,--stage-overrides,--async-chunk/--no-async-chunkINTERNAL_STAGE_OVERRIDE_KEYS,SERVER_ONLY_KEYS,PIPELINE_MODELS,_ARCHITECTURE_MODELS— replaced by dataclass-derived invariants and auto-discoverymerge_pipeline_deploysplit into 4 single-responsibility helpers (SLAP);_apply_platform_overridesdeduplicated; execution_type → (stage_type, worker_type) lookup table; parser-awaredetect_explicit_cli_keystrust_remote_code,distributed_executor_backend,dtype,quantization,enable_prefix_caching,enable_chunked_prefill,data_parallel_size,pipeline_parallel_size) moved from per-stage to top-levelDeployConfigmerge_pipeline_deployraises ifasync_chunk=Truebut no stage declares an async handler;get_scheduler_clsraises on invalidstage_id/ unmapped execution_type;_deep_merge_stagewarns on type-mismatch clobber;--stage-configs-pathand--deploy-configare mutex; scheduler map stores class refs (rename fails at import)--stage-configs-path(deprecated in help text, to be removed in 2c);ModelPipeline/StageConfig/_parse_pipeline_yamlpreserved for not-yet-migrated modelstests/test_arg_utils.py(15 invariants incl. BVA), expandedtests/test_config_factory.py(+644 lines)StageDeployConfigdataclass default (single source of truth atvllm_omni/config/stage_config.py)Omni.from_cli_args(args, parser=parser)/AsyncOmni.from_cli_args(args, parser=parser)mirrorOmniEngineArgs.from_cli_args; optionalparser=enables accurate_cli_explicit_keysresolutiondocs/configuration/stage_configs.mdrewritten with unified schema tables, connector schema, worked override precedence example;examples/online_serving/qwen3_tts/README.mdgains a Sync vs async-chunk mode sectionTest Plan
1. Unit tests + smoke scripts (CPU-only)
2. E2E launch matrices (GPU box)
qwen2_5_omni
qwen3_omni_moe
qwen3_tts (async vs sync codec, both from one yaml)
3. Server flag isolation (regression check for #873-class bugs)
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 \ --host 0.0.0.0 \ --served-model-name my-omni \ --api-key secret123 \ --allowed-local-media-path /tmp/Expected: clean startup. A
TypeError: unexpected keyword argument 'host'fromOmniEngineArgs.__init__would indicate server flags leaking into the per-stage engine path.Review feedback addressed
19 threads from @alex-jw-brooks resolved — correctness fixes (mutex validation for
--stage-configs-path/--deploy-config,async_chunkhandler check,get_scheduler_clserror paths, deep-merge clobber warning, parser-aware flag detection), cleanups (removed redundantqwen3_tts_no_async_chunkalias, deadget_stage_configwrappers in 4 test files), and doc clarifications (logical device IDs,engine_extrasrationale). See #2887 for deferred follow-ups (hardware auto-sizing, model-instance-driven config values, override type validation, central pipeline registry).What ships in follow-ups
--stage-configs-pathand legacyModelPipeline/_parse_pipeline_yaml; migrate remaining legacy models (fish_speech, cosyvoice3, mimo_audio, voxtral_tts) to the registry; splitstage_config.py(~1200 LOC) into focused modules once the legacy surface is gone.tools/smoke_*.pyinto its own PR.