[New Model] Add support for tencent/Covo-Audio-Chat#2293
[New Model] Add support for tencent/Covo-Audio-Chat#2293Dnoob wants to merge 25 commits intovllm-project:mainfrom
Conversation
a142631 to
e44e30e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b064383be
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e44e30e to
cf389cb
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf389cbb49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| engine_args: | ||
| model_stage: fused_thinker_talker | ||
| max_num_seqs: 1 | ||
| model_arch: CovoAudioForConditionalGeneration |
There was a problem hiding this comment.
Use a registered Covo model_arch in stage config
The new stage config sets model_arch: CovoAudioForConditionalGeneration, but this architecture key is not registered in vllm_omni/model_executor/models/registry.py (the commit only registers CovoAudioForCausalLM and CovoAudioModel as keys that map to the Covo class). Because OmniEngineArgs.create_model_config() injects model_arch directly into architectures, serving with vllm_omni/model_executor/stage_configs/covo_audio.yaml can fail model resolution at startup instead of loading Covo; the config should use a registered key or the registry should add CovoAudioForConditionalGeneration explicitly.
Useful? React with 👍 / 👎.
| _QWEN3_TTS_MODEL_STAGES = {"qwen3_tts"} | ||
| _FISH_TTS_MODEL_STAGES = {"fish_speech_slow_ar"} | ||
| _TTS_MODEL_STAGES: set[str] = _VOXTRAL_TTS_MODEL_STAGES | _QWEN3_TTS_MODEL_STAGES | _FISH_TTS_MODEL_STAGES | ||
| _COVO_AUDIO_MODEL_STAGES = {"fused_thinker_talker"} |
There was a problem hiding this comment.
Restrict Covo TTS stage matching to Covo models
Marking fused_thinker_talker as a Covo TTS stage is too broad: the same model_stage is used by non-Covo pipelines (for example, MiMoAudio stage configs), so _find_tts_stage/_detect_tts_model_type will classify those models as covo_audio and route /v1/audio/speech through Covo-specific prompt construction. This can produce incorrect prompts or runtime failures for other models that share the stage name; model detection should include an architecture/model check instead of relying on stage name alone.
Useful? React with 👍 / 👎.
1232884 to
b372618
Compare
linyueqian
left a comment
There was a problem hiding this comment.
Tested locally on A100-80G. A few findings:
Bug: model path resolution breaks with HF repo names
covo_audio_code2wav.py:26 uses vllm_config.model_config.model directly as a filesystem path. When the model is specified as a HF repo name (e.g. tencent/Covo-Audio-Chat), os.path.join(model_path, "token2wav", ...) resolves to a relative path that doesn't exist:
FileNotFoundError: [Errno 2] No such file or directory: 'tencent/Covo-Audio-Chat/token2wav/global_mean_var.npy'
Need to resolve to the local cache path first, e.g.:
if os.path.isdir(model_name):
model_path = model_name
else:
from huggingface_hub import snapshot_download
model_path = snapshot_download(model_name)Missing dependency: torchdiffeq
The vendored token2wav flow matching module requires torchdiffeq, but it's not declared in pyproject.toml. Stage 1 fails with ModuleNotFoundError: No module named 'torchdiffeq'.
Audio output cuts off abruptly
With the CI config (max_tokens: 512, ignore_eos: true), the text-to-audio test produces ~9.25s of audio that ends mid-sentence. The interleaving ratio (5 text + 15 audio tokens) means only ~384 of the 512 tokens are audio codes. Consider either increasing max_tokens or removing ignore_eos: true so the model can stop naturally.
Minor
- PR is 42 commits behind main, needs rebase
test_audio_to_audiorequiresespeak-ngsystem package (viapyttsx3for synthetic audio generation). Worth noting in the test plan or adding a skip condition.
test_text_to_audio passes after fixing the path issue. Model produces reasonable text + audio output.
For the |
|
@amy-why-3459 PTAL |
|
Thanks for the fixes on path resolution and For try:
from torchdiffeq import odeint
except ImportError:
raise ImportError(
"Covo-Audio code2wav requires `torchdiffeq`. "
"Install it with: pip install torchdiffeq"
)Still outstanding:
|
|
Thank you very much for your contribution. Could you please add a readme file for the model? |
| from .token2latent import Token2latentFlowMatchingWithEmbed | ||
|
|
||
|
|
||
| class Token2WavDecoder(BaseModel): |
There was a problem hiding this comment.
Can we put these function definitions in the model file to avoid creating too many folders?
There was a problem hiding this comment.
Sorry, I didn't consider this earlier and directly copied the structure from the model repo, which made it bloated. I've analyzed it and plan to consolidate the entire token2wav/ folder into a single token2wav.py file and remove unnecessary code. Will update in the next push.
8840f97 to
7d6696c
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
it seems there are a lot of dead code inside this PR
Already added |
lishunyang12
left a comment
There was a problem hiding this comment.
left a few comments on the core model files. the vendored token2wav code is fine to skip lint-wise but the non-vendored parts have a couple issues.
| inputs_embeds: torch.Tensor | None = None, | ||
| generate_audio: bool = True, | ||
| codec: torch.Tensor | None = None, | ||
| sampling_metadata: SamplingMetadata | None = None, |
There was a problem hiding this comment.
This materializes the entire safetensors weight iterator into a Python list just to filter by prefix. For a 7B model that's ~14 GB of extra peak memory. Use a generator instead:
| sampling_metadata: SamplingMetadata | None = None, | |
| llm_weights = ((k, v) for k, v in weights if k.startswith(("llm", "encoder", "audio_adapter"))) |
| **kwargs, | ||
| ) | ||
|
|
||
| return OmniOutput( |
There was a problem hiding this comment.
Several of these forward params (generate_audio, codec, logits_index, sampler, additional_information) are never used. Remove them — dead parameters in the forward signature are confusing for anyone reading the dispatch logic.
| logger.info( | ||
| "Request %s: total_tokens=%d, text_tokens=%d, audio_tokens=%d", | ||
| request_id, | ||
| len(token_ids), |
There was a problem hiding this comment.
This is a lot of per-request logging at INFO level (token counts, full interleaving pattern). In production with concurrent requests this will flood the logs. Drop the pattern log to DEBUG, or remove it.
Adapt Covo-Audio-Chat (Tencent, 7B end-to-end audio language model) to vllm-omni with a 2-stage pipeline: - Stage 0 (fused_thinker_talker): Whisper encoder + AudioAdapter + Qwen2.5-7B LLM → interleaved text + audio tokens - Stage 1 (code2wav): BigVGAN vocoder → 24kHz audio waveform Closes vllm-project#2004 Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
… for Covo-Audio Signed-off-by: Dnoob <dxpouo@gmail.com>
…Audio-Chat - Add prompt_utils.py with shared system prompt and prompt builders, removing 3 duplicates - Add offline inference example (end2end.py + README) - Fix online client: add system prompt, stop_token_ids, ignore_eos, detokenize=false - Fix stage config detokenize setting for code2wav - Use local sample audio for online example instead of S3 download - Add --port 18091 to online README to match client config Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
- Use generator instead of list comprehension in load_weights to reduce peak memory - Remove unused forward parameters (generate_audio, codec, logits_index, sampler, additional_information) - Remove debug logging from stage input processor Signed-off-by: Dnoob <dxpouo@gmail.com>
b0f956b to
9600536
Compare
- Fix MultiModalDataDict import path for vllm 0.19.0 compatibility - Remove unsupported text-only test (model requires audio input) - Use sample_audio.wav instead of pyttsx3 synthetic audio in test - Remove espeak-ng dependency from test Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
|
@lishunyang12 @amy-why-3459 Addressed all review feedback, rebased on main, and updated PR description. PTAL, thanks! |
| return "请回答这段音频里的问题。" | ||
|
|
||
|
|
||
| @pytest.mark.core_model |
There was a problem hiding this comment.
if this want to be added to nightly, please:
1.change label to @pytest.mark.advanced_model
2.rename this script to test_covo_audio_expansion.py
| } | ||
| ] | ||
|
|
||
| outputs = omni_runner.generate(omni_inputs) |
There was a problem hiding this comment.
maybe you can use omni_runner_handler.send_request(request_config) in conftest.py, like tests/e2e/offline_inference/test_qwen3_omni.py
| if: build.env("NIGHTLY") == "1" || build.pull_request.labels includes "nightly-test" | ||
| commands: | ||
| - export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| - pytest -s -v tests/e2e/offline_inference/test_covo_audio.py -m "core_model" --run-level "core_model" |
There was a problem hiding this comment.
i think you can modify commands in 🌕 Omni Model Test with H100, instead of adding a separate new job.
Signed-off-by: Dnoob <dxpouo@gmail.com>
|
@yenuo26 Sorry, I forgot to add the test file under I checked how other omni models organize their tests in this repo. The convention seems to be: offline Adjusted accordingly:
|
Signed-off-by: Dnoob <dxpouo@gmail.com>
Head branch was pushed to by a user without write access
Signed-off-by: Dnoob <dxpouo@gmail.com>
|
fix ci please |
Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
|
@linyueqian @hsliuustc0106 CI failures are fixed and all checks pass. PTAL, thanks! |
| | `HunyuanVideo15Pipeline` | HunyuanVideo-1.5-T2V | `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v`, `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v` | ✅︎ | ✅︎ | | | | ||
| | `HunyuanVideo15ImageToVideoPipeline` | HunyuanVideo-1.5-I2V | `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_i2v`, `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_i2v` | ✅︎ | ✅︎ | | | | ||
| | `VoxtralTTSForConditionalGeneration` | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | ✅︎ | ✅︎ | | | | ||
| | `CovoAudioForConditionalGeneration` | Covo-Audio-Chat | `tencent/Covo-Audio-Chat` | ✅︎ | | | | |
There was a problem hiding this comment.
nit: This PR includes online serving support (OpenAI-compatible client example + test_covo_audio_expansion.py), so the Online column should be ✅︎ instead of empty, to match the other models in this table.
There was a problem hiding this comment.
I don't see an "Online" column in this table, just NVIDIA GPU、AMD GPU、Ascend NPU、Intel GPU. Which one did you mean?
| self.o_dropout = Dropout(dropout) | ||
|
|
||
| self.cpu_config = AttentionConfig(True, True, True) | ||
| device_properties = torch.cuda.get_device_properties(torch.device("cuda")) |
There was a problem hiding this comment.
nit (vendored code): torch.cuda.get_device_properties(torch.device("cuda")) hardcodes device index 0. In a multi-GPU setup where this vocoder runs on a non-default CUDA device, this may select the wrong flash-attention config. I understand this is vendored from upstream, so just flagging for awareness — no action required for this PR.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Blocker Scan
| Category | Result |
|---|---|
| Correctness | PASS |
| Reliability/Safety | PASS |
| Breaking Changes | PASS |
| Test Coverage | PASS (offline test in PR desc, CI green, nightly test added) |
| Documentation | PASS (supported_models.md, offline + online examples + README) |
| Security | PASS |
Merge Gate
- DCO: PASS
- pre-commit: PASS
- Build: PASS
- Buildkite CI (amd/intel): PASS
- Mergeable: FAIL — CONFLICTING, needs rebase onto latest main
Summary
The code is well-structured after multiple review rounds. All previously flagged issues have been addressed:
- HF repo path resolution fixed (snapshot_download fallback)
torchdiffeqlazy import with clear error message- Dead parameters removed from forward signatures
- INFO-level log flooding fixed
llm_weightsuses generator instead of list materialization- Vendored
token2wav/consolidated into singletoken2wav.py CovoAudioForConditionalGenerationregistered in model registry- CI test config updated with
max_tokens: 2048
Left 2 nits as inline comments (supported_models table + vendored code awareness).
The only remaining blocker is the merge conflict. Please rebase onto latest main and I'll approve.
# Conflicts: # vllm_omni/entrypoints/openai/serving_speech.py
|
@hsliuustc0106 PTAL, thanks! |
| [tool.ruff.lint.per-file-ignores] | ||
| "examples/**" = ["E501"] # Allow long lines in examples | ||
| "tests/**" = ["E501"] # Allow long lines in tests | ||
| "**/token2wav/**" = ["E501", "E721", "E741", "F401", "F403", "F405", "F841", "UP028"] # Vendored third-party code |
| @@ -0,0 +1,78 @@ | |||
| # | |||
There was a problem hiding this comment.
please follow the new pipeline with #2383 merged
|
also, please add a new recipe |
Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
Signed-off-by: Dnoob <dxpouo@gmail.com>
|
@hsliuustc0106 All addressed and CI is green. PTAL, thanks! |
|
Resolve conflict. |
Signed-off-by: Dnoob <dxpouo@gmail.com>
…itespace Signed-off-by: Dnoob <dxpouo@gmail.com>
The model to consider
Model Weights: https://huggingface.co/tencent/Covo-Audio-Chat
Model Code: https://github.com/Tencent/Covo-Audio
Model description
This PR adds support for Covo-Audio-Chat (Tencent, 7B end-to-end audio language model) with a 2-stage pipeline:
Stage 0 (fused_thinker_talker)
Stage 1 (code2wav)
Changes
covo_audio.py(dual-stage router),covo_audio_llm.py(Stage 0),covo_audio_code2wav.py(Stage 1)config_covo_audio.py, stage YAML configsprompt_utils.py(centralized prompt templates and construction helpers)token2wav.py(consolidated from upstream model repo into a single module, import paths adapted).npyfilessupported_models.mdFixes #2004
Test plan
E2E test (requires GPU with ≥20 GiB VRAM):
Online serving:
Test result
Environment