-
Notifications
You must be signed in to change notification settings - Fork 1k
[New Model] Add support for tencent/Covo-Audio-Chat #2293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
eca14de
[New Model] Add support for tencent/Covo-Audio-Chat
Dnoob 3c25d66
fix model path resolution and increase CI max_tokens for Covo-Audio-Chat
Dnoob f1b1b3f
[Bugfix] Add lazy import for torchdiffeq and espeak-ng skip condition…
Dnoob e1fb27f
[Feat] Add offline inference example and centralize prompts for Covo-…
Dnoob bbb8ef6
[Refactor] Consolidate token2wav package into single module
Dnoob 9600536
[Refactor] Address review feedback for Covo-Audio-Chat
Dnoob 5baae5d
[Fix] Update import path and simplify test for Covo-Audio-Chat
Dnoob 56289a3
[Style] Fix ruff lint and formatting issues
Dnoob 8345763
ci(covo-audio): add Covo-Audio-Chat e2e test to nightly buildkite
linyueqian bed6057
add online expansion test and consolidate nightly job
Dnoob 5b79c4f
Merge branch 'main' into feat/covo-audio-chat
hsliuustc0106 2a3169e
Merge branch 'main' into feat/covo-audio-chat
Dnoob 89c004b
fix: resolve merge conflict syntax error in registry.py
Dnoob 2978892
[CI] Fix pytest collection failure caused by covo_audio test
Dnoob ea21213
Merge branch 'main' into feat/covo-audio-chat
Dnoob f8ba351
Merge remote-tracking branch 'origin/main' into feat/covo-audio-chat
Dnoob 35dc318
remove token2wav ruff ignore rule
Dnoob b68220a
Merge remote-tracking branch 'origin/main' into feat/covo-audio-chat
Dnoob 8f35f14
migrate covo_audio to new pipeline/deploy config schema
Dnoob d78b6ff
add recipe for covo_audio
Dnoob e15cef1
Merge branch 'main' into feat/covo-audio-chat
Dnoob e018aec
fix ci import errors
Dnoob 584034c
Merge branch 'main' into feat/covo-audio-chat
Dnoob 2e86a7d
Merge branch 'main' into feat/covo-audio-chat
Dnoob 275bde7
fix: replace librosa with vllm.multimodal helper and drop trailing wh…
Dnoob 182bac9
Merge branch 'main' into feat/covo-audio-chat
Dnoob 65ef89a
Merge branch 'main' into feat/covo-audio-chat
Dnoob 4566b39
Merge branch 'main' into feat/covo-audio-chat
Dnoob eaf9aed
Merge branch 'main' into feat/covo-audio-chat
Dnoob 6460963
Fix trailing whitespace in serving speech
Dnoob b4ac81d
Merge branch 'main' into feat/covo-audio-chat
Dnoob 6f739be
Merge branch 'main' into feat/covo-audio-chat
Dnoob ef04d81
examples: use bundled audio asset for Covo-Audio
Dnoob 76261fe
ci: add Covo-Audio Buildkite coverage
Dnoob 73c72c9
fix: update Covo-Audio deploy pipeline config
Dnoob f309c97
ci: mark Covo-Audio e2e tests as nightly coverage
Dnoob 744fdff
Merge branch 'main' into feat/covo-audio-chat
Dnoob File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| # Covo-Audio-Chat (Offline Inference) | ||
|
|
||
| ## Setup | ||
|
|
||
| Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. | ||
|
|
||
| > **Note** | ||
| > Covo-Audio code2wav requires `torchdiffeq`. Install it with: `pip install torchdiffeq` | ||
|
|
||
| ## Run examples | ||
|
|
||
| Get into the example folder: | ||
| ```bash | ||
| cd examples/offline_inference/covo_audio | ||
| ``` | ||
|
|
||
| ### Audio input chat | ||
|
|
||
| Using the default audio asset: | ||
| ```bash | ||
| python end2end.py | ||
| ``` | ||
|
|
||
| Using a custom audio file: | ||
| ```bash | ||
| python end2end.py --audio-path /path/to/audio.wav | ||
| ``` | ||
|
|
||
| Using a local model: | ||
| ```bash | ||
| python end2end.py -m /path/to/Covo-Audio-Chat --output-dir ./my_output | ||
| ``` | ||
|
|
||
| ### Command-line Arguments | ||
|
|
||
| | Argument | Short | Default | Description | | ||
| |----------|-------|---------|-------------| | ||
| | `--model-name` | `-m` | `tencent/Covo-Audio-Chat` | Model path or HuggingFace model ID | | ||
| | `--text` | `-t` | `请回答这段音频里的问题。` | Text prompt / question for the audio | | ||
| | `--audio-path` | `-a` | default audio asset | Path to local audio file | | ||
| | `--sampling-rate` | | `16000` | Sampling rate for audio loading (Hz) | | ||
| | `--output-dir` | | `./output_audio` | Output directory for generated files | | ||
| | `--num-prompts` | | `1` | Number of prompts to generate | | ||
| | `--stage-configs-path` | | (auto) | Path to stage configs YAML file | | ||
| | `--log-stats` | | `false` | Enable detailed statistics logging | | ||
| | `--stage-init-timeout` | | `300` | Stage initialization timeout (seconds) | | ||
| | `--batch-timeout` | | `5` | Batching timeout (seconds) | | ||
| | `--init-timeout` | | `300` | Overall initialization timeout (seconds) | | ||
| | `--shm-threshold-bytes` | | `65536` | Shared memory threshold (bytes) | | ||
|
|
||
| ## Pipeline | ||
|
|
||
| Covo-Audio-Chat uses a 2-stage pipeline: | ||
|
|
||
| - **Stage 0 (fused_thinker_talker):** The 7B LLM generates interleaved text and audio tokens in a single autoregressive pass. | ||
| - **Stage 1 (code2wav):** A BigVGAN-based vocoder converts the extracted audio codes into a 24kHz WAV waveform. | ||
|
|
||
| ## Output | ||
|
|
||
| The script generates two files per request in the output directory: | ||
|
|
||
| - `{request_id}.txt` -- prompt and generated text | ||
| - `{request_id}.wav` -- generated audio (24kHz WAV) | ||
|
|
||
| ## FAQ | ||
|
|
||
| If you encounter `ModuleNotFoundError: No module named 'librosa'`, install it with: | ||
| ```bash | ||
| pip install librosa | ||
| ``` | ||
|
|
||
| ## Environment | ||
|
|
||
| - GPU: 1x A100 (80 GiB) | ||
| - Stage 0 (7B LLM): ~16 GiB VRAM | ||
| - Stage 1 (BigVGAN vocoder): ~2 GiB VRAM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,222 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| """ | ||
| This example shows how to use vLLM-Omni for running offline inference | ||
| with the correct prompt format on Covo-Audio-Chat. | ||
|
|
||
| Usage: | ||
| python end2end.py --audio-path /path/to/audio.wav | ||
| """ | ||
|
|
||
| import os | ||
|
|
||
| import soundfile as sf | ||
| from vllm.assets.audio import AudioAsset | ||
| from vllm.multimodal.media.audio import load_audio | ||
| from vllm.sampling_params import SamplingParams | ||
| from vllm.utils.argparse_utils import FlexibleArgumentParser | ||
|
|
||
| from vllm_omni.entrypoints.omni import Omni | ||
| from vllm_omni.model_executor.models.covo_audio.prompt_utils import ( | ||
| COVO_AUDIO_INPUT_PREFIX, | ||
| build_covo_audio_chat_prompt, | ||
| ) | ||
|
|
||
| SEED = 42 | ||
|
|
||
|
|
||
| def get_audio_query( | ||
| question: str | None = None, | ||
| audio_path: str | None = None, | ||
| sampling_rate: int = 16000, | ||
| ) -> dict: | ||
| if question is None: | ||
| question = "请回答这段音频里的问题。" | ||
| user_content = COVO_AUDIO_INPUT_PREFIX + question | ||
| prompt = build_covo_audio_chat_prompt(user_content) | ||
|
|
||
| if audio_path is None: | ||
| audio_data = AudioAsset("mary_had_lamb").audio_and_sample_rate | ||
| else: | ||
| import numpy as np | ||
|
|
||
| audio_signal, sr = load_audio(audio_path, sr=sampling_rate) | ||
| audio_data = (audio_signal.astype(np.float32), sr) | ||
|
|
||
| return { | ||
| "prompt": prompt, | ||
| "multi_modal_data": {"audio": audio_data}, | ||
| "modalities": ["audio"], | ||
| } | ||
|
|
||
|
|
||
| def main(args): | ||
| query_result = get_audio_query( | ||
| question=args.text, | ||
| audio_path=args.audio_path, | ||
| sampling_rate=args.sampling_rate, | ||
| ) | ||
|
|
||
| omni = Omni( | ||
| model=args.model_name, | ||
| stage_configs_path=args.stage_configs_path, | ||
| log_stats=args.log_stats, | ||
| stage_init_timeout=args.stage_init_timeout, | ||
| batch_timeout=args.batch_timeout, | ||
| init_timeout=args.init_timeout, | ||
| shm_threshold_bytes=args.shm_threshold_bytes, | ||
| ) | ||
|
|
||
| # Stage 0: fused_thinker_talker | ||
| # stop_token_ids=[151645] (<|im_end|>) and ignore_eos=True are required | ||
| # so the model generates interleaved text+audio tokens before stopping. | ||
| thinker_sampling_params = SamplingParams( | ||
| temperature=0.0, | ||
| top_p=1.0, | ||
| top_k=-1, | ||
| max_tokens=2048, | ||
| seed=SEED, | ||
| detokenize=True, | ||
| repetition_penalty=1.05, | ||
| stop_token_ids=[151645], | ||
| ignore_eos=True, | ||
| ) | ||
| # Stage 1: code2wav (audio codes, not real token IDs — skip detokenize) | ||
| code2wav_sampling_params = SamplingParams( | ||
| temperature=0.0, | ||
| top_p=1.0, | ||
| top_k=-1, | ||
| max_tokens=2048, | ||
| seed=SEED, | ||
| detokenize=False, | ||
| repetition_penalty=1.1, | ||
| ) | ||
|
|
||
| sampling_params_list = [ | ||
| thinker_sampling_params, | ||
| code2wav_sampling_params, | ||
| ] | ||
|
|
||
| prompts = [query_result for _ in range(args.num_prompts)] | ||
|
|
||
| omni_outputs = omni.generate(prompts, sampling_params_list) | ||
|
|
||
| output_dir = args.output_dir | ||
| os.makedirs(output_dir, exist_ok=True) | ||
|
|
||
| for stage_outputs in omni_outputs: | ||
| output = stage_outputs.request_output | ||
| if stage_outputs.final_output_type == "text": | ||
| request_id = output.request_id | ||
| text_output = output.outputs[0].text | ||
| prompt_text = output.prompt | ||
| out_txt = os.path.join(output_dir, f"{request_id}.txt") | ||
| lines = [ | ||
| "Prompt:\n", | ||
| str(prompt_text) + "\n", | ||
| "vllm_text_output:\n", | ||
| str(text_output).strip() + "\n", | ||
| ] | ||
| try: | ||
| with open(out_txt, "w", encoding="utf-8") as f: | ||
| f.writelines(lines) | ||
| except Exception as e: | ||
| print(f"[Warn] Failed writing text file {out_txt}: {e}") | ||
| print(f"Request ID: {request_id}, Text saved to {out_txt}") | ||
| elif stage_outputs.final_output_type == "audio": | ||
| request_id = output.request_id | ||
| audio_tensor = output.outputs[0].multimodal_output.get("audio") | ||
| if audio_tensor is None: | ||
| continue | ||
| output_wav = os.path.join(output_dir, f"{request_id}.wav") | ||
| audio_numpy = audio_tensor.float().detach().cpu().numpy() | ||
| if audio_numpy.ndim > 1: | ||
| audio_numpy = audio_numpy.flatten() | ||
| sf.write(output_wav, audio_numpy, samplerate=24000, format="WAV") | ||
| print(f"Request ID: {request_id}, Audio saved to {output_wav}") | ||
|
|
||
| omni.close() | ||
|
|
||
|
|
||
| def parse_args(): | ||
| parser = FlexibleArgumentParser(description="Offline inference demo for Covo-Audio-Chat") | ||
| parser.add_argument( | ||
| "--model-name", | ||
| "-m", | ||
| type=str, | ||
| default="tencent/Covo-Audio-Chat", | ||
| help="Model path or HuggingFace model ID.", | ||
| ) | ||
| parser.add_argument( | ||
| "--text", | ||
| "-t", | ||
| type=str, | ||
| default=None, | ||
| help="Text prompt / question for the audio.", | ||
| ) | ||
| parser.add_argument( | ||
| "--audio-path", | ||
| "-a", | ||
| type=str, | ||
| default=None, | ||
| help="Path to local audio file. Uses default asset if not provided.", | ||
| ) | ||
| parser.add_argument( | ||
| "--sampling-rate", | ||
| type=int, | ||
| default=16000, | ||
| help="Sampling rate for audio loading (default: 16000).", | ||
| ) | ||
| parser.add_argument( | ||
| "--stage-configs-path", | ||
| type=str, | ||
| default=None, | ||
| help="Path to stage configs YAML file.", | ||
| ) | ||
| parser.add_argument( | ||
| "--log-stats", | ||
| action="store_true", | ||
| default=False, | ||
| help="Enable writing detailed statistics.", | ||
| ) | ||
| parser.add_argument( | ||
| "--stage-init-timeout", | ||
| type=int, | ||
| default=300, | ||
| help="Timeout for initializing a single stage in seconds.", | ||
| ) | ||
| parser.add_argument( | ||
| "--batch-timeout", | ||
| type=int, | ||
| default=5, | ||
| help="Timeout for batching in seconds.", | ||
| ) | ||
| parser.add_argument( | ||
| "--init-timeout", | ||
| type=int, | ||
| default=300, | ||
| help="Timeout for initializing stages in seconds.", | ||
| ) | ||
| parser.add_argument( | ||
| "--shm-threshold-bytes", | ||
| type=int, | ||
| default=65536, | ||
| help="Threshold for using shared memory in bytes.", | ||
| ) | ||
| parser.add_argument( | ||
| "--output-dir", | ||
| default="./output_audio", | ||
| help="Output directory for generated files.", | ||
| ) | ||
| parser.add_argument( | ||
| "--num-prompts", | ||
| type=int, | ||
| default=1, | ||
| help="Number of prompts to generate.", | ||
| ) | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| args = parse_args() | ||
| main(args) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This PR includes online serving support (OpenAI-compatible client example +
test_covo_audio_expansion.py), so the Online column should be✅︎instead of empty, to match the other models in this table.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an "Online" column in this table, just NVIDIA GPU、AMD GPU、Ascend NPU、Intel GPU. Which one did you mean?