Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
eca14de
[New Model] Add support for tencent/Covo-Audio-Chat
Dnoob Mar 28, 2026
3c25d66
fix model path resolution and increase CI max_tokens for Covo-Audio-Chat
Dnoob Mar 30, 2026
f1b1b3f
[Bugfix] Add lazy import for torchdiffeq and espeak-ng skip condition…
Dnoob Apr 2, 2026
e1fb27f
[Feat] Add offline inference example and centralize prompts for Covo-…
Dnoob Apr 2, 2026
bbb8ef6
[Refactor] Consolidate token2wav package into single module
Dnoob Apr 4, 2026
9600536
[Refactor] Address review feedback for Covo-Audio-Chat
Dnoob Apr 4, 2026
5baae5d
[Fix] Update import path and simplify test for Covo-Audio-Chat
Dnoob Apr 4, 2026
56289a3
[Style] Fix ruff lint and formatting issues
Dnoob Apr 4, 2026
8345763
ci(covo-audio): add Covo-Audio-Chat e2e test to nightly buildkite
linyueqian Apr 5, 2026
bed6057
add online expansion test and consolidate nightly job
Dnoob Apr 8, 2026
5b79c4f
Merge branch 'main' into feat/covo-audio-chat
hsliuustc0106 Apr 9, 2026
2a3169e
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 13, 2026
89c004b
fix: resolve merge conflict syntax error in registry.py
Dnoob Apr 13, 2026
2978892
[CI] Fix pytest collection failure caused by covo_audio test
Dnoob Apr 14, 2026
ea21213
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 14, 2026
f8ba351
Merge remote-tracking branch 'origin/main' into feat/covo-audio-chat
Dnoob Apr 18, 2026
35dc318
remove token2wav ruff ignore rule
Dnoob Apr 20, 2026
b68220a
Merge remote-tracking branch 'origin/main' into feat/covo-audio-chat
Dnoob Apr 20, 2026
8f35f14
migrate covo_audio to new pipeline/deploy config schema
Dnoob Apr 20, 2026
d78b6ff
add recipe for covo_audio
Dnoob Apr 20, 2026
e15cef1
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 21, 2026
e018aec
fix ci import errors
Dnoob Apr 21, 2026
584034c
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 23, 2026
2e86a7d
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 23, 2026
275bde7
fix: replace librosa with vllm.multimodal helper and drop trailing wh…
Dnoob Apr 23, 2026
182bac9
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 25, 2026
65ef89a
Merge branch 'main' into feat/covo-audio-chat
Dnoob Apr 29, 2026
4566b39
Merge branch 'main' into feat/covo-audio-chat
Dnoob May 4, 2026
eaf9aed
Merge branch 'main' into feat/covo-audio-chat
Dnoob May 6, 2026
6460963
Fix trailing whitespace in serving speech
Dnoob May 6, 2026
b4ac81d
Merge branch 'main' into feat/covo-audio-chat
Dnoob May 11, 2026
6f739be
Merge branch 'main' into feat/covo-audio-chat
Dnoob May 11, 2026
ef04d81
examples: use bundled audio asset for Covo-Audio
Dnoob May 11, 2026
76261fe
ci: add Covo-Audio Buildkite coverage
Dnoob May 11, 2026
73c72c9
fix: update Covo-Audio deploy pipeline config
Dnoob May 11, 2026
f309c97
ci: mark Covo-Audio e2e tests as nightly coverage
Dnoob May 11, 2026
744fdff
Merge branch 'main' into feat/covo-audio-chat
Dnoob May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ th {
| `HunyuanVideo15Pipeline` | HunyuanVideo-1.5-T2V | `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v`, `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v` | ✅︎ | ✅︎ | | |
| `HunyuanVideo15ImageToVideoPipeline` | HunyuanVideo-1.5-I2V | `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_i2v`, `hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_i2v` | ✅︎ | ✅︎ | | |
| `VoxtralTTSForConditionalGeneration` | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | ✅︎ | ✅︎ | | |
| `CovoAudioForConditionalGeneration` | Covo-Audio-Chat | `tencent/Covo-Audio-Chat` | ✅︎ | | | |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This PR includes online serving support (OpenAI-compatible client example + test_covo_audio_expansion.py), so the Online column should be ✅︎ instead of empty, to match the other models in this table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an "Online" column in this table, just NVIDIA GPU、AMD GPU、Ascend NPU、Intel GPU. Which one did you mean?

|`DyninOmniForConditionalGeneration` | Dynin-Omni | `snu-aidas/Dynin-Omni` | ✅︎ | | | |
| `ErnieImagePipeline` | ERNIE-Image | `baidu/ERNIE-Image`, `baidu/ERNIE-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
✅︎ indicates the model is supported on that backend. Empty cells mean not listed as supported on that backend.
76 changes: 76 additions & 0 deletions examples/offline_inference/covo_audio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Covo-Audio-Chat (Offline Inference)

## Setup

Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.

> **Note**
> Covo-Audio code2wav requires `torchdiffeq`. Install it with: `pip install torchdiffeq`

## Run examples

Get into the example folder:
```bash
cd examples/offline_inference/covo_audio
```

### Audio input chat

Using the default audio asset:
```bash
python end2end.py
```

Using a custom audio file:
```bash
python end2end.py --audio-path /path/to/audio.wav
```

Using a local model:
```bash
python end2end.py -m /path/to/Covo-Audio-Chat --output-dir ./my_output
```

### Command-line Arguments

| Argument | Short | Default | Description |
|----------|-------|---------|-------------|
| `--model-name` | `-m` | `tencent/Covo-Audio-Chat` | Model path or HuggingFace model ID |
| `--text` | `-t` | `请回答这段音频里的问题。` | Text prompt / question for the audio |
| `--audio-path` | `-a` | default audio asset | Path to local audio file |
| `--sampling-rate` | | `16000` | Sampling rate for audio loading (Hz) |
| `--output-dir` | | `./output_audio` | Output directory for generated files |
| `--num-prompts` | | `1` | Number of prompts to generate |
| `--stage-configs-path` | | (auto) | Path to stage configs YAML file |
| `--log-stats` | | `false` | Enable detailed statistics logging |
| `--stage-init-timeout` | | `300` | Stage initialization timeout (seconds) |
| `--batch-timeout` | | `5` | Batching timeout (seconds) |
| `--init-timeout` | | `300` | Overall initialization timeout (seconds) |
| `--shm-threshold-bytes` | | `65536` | Shared memory threshold (bytes) |

## Pipeline

Covo-Audio-Chat uses a 2-stage pipeline:

- **Stage 0 (fused_thinker_talker):** The 7B LLM generates interleaved text and audio tokens in a single autoregressive pass.
- **Stage 1 (code2wav):** A BigVGAN-based vocoder converts the extracted audio codes into a 24kHz WAV waveform.

## Output

The script generates two files per request in the output directory:

- `{request_id}.txt` -- prompt and generated text
- `{request_id}.wav` -- generated audio (24kHz WAV)

## FAQ

If you encounter `ModuleNotFoundError: No module named 'librosa'`, install it with:
```bash
pip install librosa
```

## Environment

- GPU: 1x A100 (80 GiB)
- Stage 0 (7B LLM): ~16 GiB VRAM
- Stage 1 (BigVGAN vocoder): ~2 GiB VRAM
222 changes: 222 additions & 0 deletions examples/offline_inference/covo_audio/end2end.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
This example shows how to use vLLM-Omni for running offline inference
with the correct prompt format on Covo-Audio-Chat.

Usage:
python end2end.py --audio-path /path/to/audio.wav
"""

import os

import soundfile as sf
from vllm.assets.audio import AudioAsset
from vllm.multimodal.media.audio import load_audio
from vllm.sampling_params import SamplingParams
from vllm.utils.argparse_utils import FlexibleArgumentParser

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.model_executor.models.covo_audio.prompt_utils import (
COVO_AUDIO_INPUT_PREFIX,
build_covo_audio_chat_prompt,
)

SEED = 42


def get_audio_query(
question: str | None = None,
audio_path: str | None = None,
sampling_rate: int = 16000,
) -> dict:
if question is None:
question = "请回答这段音频里的问题。"
user_content = COVO_AUDIO_INPUT_PREFIX + question
prompt = build_covo_audio_chat_prompt(user_content)

if audio_path is None:
audio_data = AudioAsset("mary_had_lamb").audio_and_sample_rate
else:
import numpy as np

audio_signal, sr = load_audio(audio_path, sr=sampling_rate)
audio_data = (audio_signal.astype(np.float32), sr)

return {
"prompt": prompt,
"multi_modal_data": {"audio": audio_data},
"modalities": ["audio"],
}


def main(args):
query_result = get_audio_query(
question=args.text,
audio_path=args.audio_path,
sampling_rate=args.sampling_rate,
)

omni = Omni(
model=args.model_name,
stage_configs_path=args.stage_configs_path,
log_stats=args.log_stats,
stage_init_timeout=args.stage_init_timeout,
batch_timeout=args.batch_timeout,
init_timeout=args.init_timeout,
shm_threshold_bytes=args.shm_threshold_bytes,
)

# Stage 0: fused_thinker_talker
# stop_token_ids=[151645] (<|im_end|>) and ignore_eos=True are required
# so the model generates interleaved text+audio tokens before stopping.
thinker_sampling_params = SamplingParams(
temperature=0.0,
top_p=1.0,
top_k=-1,
max_tokens=2048,
seed=SEED,
detokenize=True,
repetition_penalty=1.05,
stop_token_ids=[151645],
ignore_eos=True,
)
# Stage 1: code2wav (audio codes, not real token IDs — skip detokenize)
code2wav_sampling_params = SamplingParams(
temperature=0.0,
top_p=1.0,
top_k=-1,
max_tokens=2048,
seed=SEED,
detokenize=False,
repetition_penalty=1.1,
)

sampling_params_list = [
thinker_sampling_params,
code2wav_sampling_params,
]

prompts = [query_result for _ in range(args.num_prompts)]

omni_outputs = omni.generate(prompts, sampling_params_list)

output_dir = args.output_dir
os.makedirs(output_dir, exist_ok=True)

for stage_outputs in omni_outputs:
output = stage_outputs.request_output
if stage_outputs.final_output_type == "text":
request_id = output.request_id
text_output = output.outputs[0].text
prompt_text = output.prompt
out_txt = os.path.join(output_dir, f"{request_id}.txt")
lines = [
"Prompt:\n",
str(prompt_text) + "\n",
"vllm_text_output:\n",
str(text_output).strip() + "\n",
]
try:
with open(out_txt, "w", encoding="utf-8") as f:
f.writelines(lines)
except Exception as e:
print(f"[Warn] Failed writing text file {out_txt}: {e}")
print(f"Request ID: {request_id}, Text saved to {out_txt}")
elif stage_outputs.final_output_type == "audio":
request_id = output.request_id
audio_tensor = output.outputs[0].multimodal_output.get("audio")
if audio_tensor is None:
continue
output_wav = os.path.join(output_dir, f"{request_id}.wav")
audio_numpy = audio_tensor.float().detach().cpu().numpy()
if audio_numpy.ndim > 1:
audio_numpy = audio_numpy.flatten()
sf.write(output_wav, audio_numpy, samplerate=24000, format="WAV")
print(f"Request ID: {request_id}, Audio saved to {output_wav}")

omni.close()


def parse_args():
parser = FlexibleArgumentParser(description="Offline inference demo for Covo-Audio-Chat")
parser.add_argument(
"--model-name",
"-m",
type=str,
default="tencent/Covo-Audio-Chat",
help="Model path or HuggingFace model ID.",
)
parser.add_argument(
"--text",
"-t",
type=str,
default=None,
help="Text prompt / question for the audio.",
)
parser.add_argument(
"--audio-path",
"-a",
type=str,
default=None,
help="Path to local audio file. Uses default asset if not provided.",
)
parser.add_argument(
"--sampling-rate",
type=int,
default=16000,
help="Sampling rate for audio loading (default: 16000).",
)
parser.add_argument(
"--stage-configs-path",
type=str,
default=None,
help="Path to stage configs YAML file.",
)
parser.add_argument(
"--log-stats",
action="store_true",
default=False,
help="Enable writing detailed statistics.",
)
parser.add_argument(
"--stage-init-timeout",
type=int,
default=300,
help="Timeout for initializing a single stage in seconds.",
)
parser.add_argument(
"--batch-timeout",
type=int,
default=5,
help="Timeout for batching in seconds.",
)
parser.add_argument(
"--init-timeout",
type=int,
default=300,
help="Timeout for initializing stages in seconds.",
)
parser.add_argument(
"--shm-threshold-bytes",
type=int,
default=65536,
help="Threshold for using shared memory in bytes.",
)
parser.add_argument(
"--output-dir",
default="./output_audio",
help="Output directory for generated files.",
)
parser.add_argument(
"--num-prompts",
type=int,
default=1,
help="Number of prompts to generate.",
)
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
main(args)
Loading
Loading