Skip to content

[Feature] : Support disaggregated inference pipeline for Qwen3_TTS#1161

Merged
hsliuustc0106 merged 30 commits into
vllm-project:mainfrom
Sy0307:dev/tts_disaggregation
Feb 20, 2026
Merged

[Feature] : Support disaggregated inference pipeline for Qwen3_TTS#1161
hsliuustc0106 merged 30 commits into
vllm-project:mainfrom
Sy0307:dev/tts_disaggregation

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Feb 2, 2026

Support disaggregated inference pipeline for Talker and SpeechTokenizer.

qwen3_tts_talker_ar.py: Stage-0 main model. Implements vLLM-native AR Talker that autoregressively generates layer-0 codec tokens step by step. At each step, it uses the embedded code_predictor to complete residual codebooks (1..Q-1) to form complete audio_codes. Supports preprocess/postprocess/talker_mtp, covering three task types: CustomVoice, VoiceDesign, and Base.

qwen3_tts_code_predictor_vllm.py: Residual code prediction sub-model. A lightweight Transformer that maintains its own _LocalPredictorKVCache (independent of vLLM engine KV cache), providing prefill_logits and decode_logits two-stage interfaces, called by talker_mtp at each step.

qwen3_tts_code2wav.py: Stage-1 model (standard mode). Receives frame-aligned codec tokens and decodes them into waveforms through SpeechTokenizer. Supports both streaming (with left context concatenation) and non-streaming modes. Pure generation phase with no logits/sampling.

qwen3_tts_disaggregated.py: Stage-1 model (async_chunk streaming mode). A phase-aware disaggregated wrapper that completes codec-to-waveform decoding in the preprocess stage, adapting to vLLM's async_chunk streaming scheduler.

Related issues:
#938
#976

Thanks for advice from @JuanPZuluaga and @gcanlin .

CC @gcanlin for preliminary code review.

Purpose

Test Plan

e2e tests in test_qwen3_tts.py & self e2e tests.

Test Result

.venv/bin/python -m pytest tests/e2e/online_serving/test_qwen3_tts.py -v -p no:skip
============================ test session starts ============================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /home/sy03/bpftime_sy03/vllm-omni/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/sy03/bpftime_sy03/vllm-omni
configfile: pyproject.toml
plugins: cov-7.0.0, anyio-4.12.1
collected 6 items                                                           

tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSCustomVoice::test_speech_english_basic PASSED [ 16%]
tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSCustomVoice::test_speech_chinese_basic PASSED [ 33%]
tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSCustomVoice::test_speech_different_voices PASSED [ 50%]
tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSCustomVoice::test_speech_binary_response_not_utf8_error PASSED [ 66%]
tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSAPIEndpoints::test_list_voices_endpoint PASSED [ 83%]
tests/e2e/online_serving/test_qwen3_tts.py::TestQwen3TTSAPIEndpoints::test_models_endpoint PASSED [100%]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner February 2, 2026 19:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b17195c6c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +122 to +123
else:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid stalling entire waiting queue on missing first chunk

When omni_connector is enabled and the head waiting request has no initial chunk, the scheduler executes break and exits the waiting loop entirely. That means subsequent waiting requests (including non-connector or already-ready ones) are not scheduled in this cycle, creating head‑of‑line blocking and potential throughput collapse if the first request’s upstream chunk is delayed. This is a behavior change from placeholder scheduling and can stall unrelated traffic until the first chunk arrives. Consider skipping only that request (e.g., move it to skipped_waiting_requests) instead of breaking the whole waiting loop.

Useful? React with 👍 / 👎.

Comment on lines +50 to +54
if finished:
return {
"code_predictor_codes": [],
"codec_streaming": codec_streaming,
"finished": torch.tensor(True, dtype=torch.bool),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't drop final audio frame when request finishes

The async-chunk adapter returns early when request.is_finished() is true, emitting an empty code_predictor_codes payload and skipping any audio_codes present in the same step. If the request becomes "finished" on the step that produces the last codec frame (common in AR decoding), the final frame is discarded, which will truncate the downstream SpeechTokenizer input. Consider checking/forwarding audio_codes even when finished, or only treating the stop step as finished when no frame is produced.

Useful? React with 👍 / 👎.

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Feb 3, 2026

Good job! How about performance for the current implementation?

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Feb 3, 2026

@tsdocode Could you please help review? Your vLLM implementation performance is impressive. So I think we can co-work in this PR. Or after we have an initial disaggregated implementation, you may integrate your optimizations based on this implementation in a following-up PR.

Comment thread vllm_omni/config/model.py
except Exception:
fps = None
if fps is not None and fps > 0:
self.codec_frame_rate_hz = fps
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here config seems to be too model-specific. How about moving it into configuration_qwen3_tts.py like:

    @property
    def codec_frame_rate_hz(self):
        ...

And in model.py, maybe consider the following generic config:

        if self.codec_frame_rate_hz is None:
            self.codec_frame_rate_hz = getattr(self.hf_config, "codec_frame_rate_hz", None)

# per-request additional_information.
self.talker_mtp_output_key = "audio_codes"

self.model = Qwen3Model(vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing I noticed is RoPE embedding of Talker model is multimodal rope embedding, using the the original Qwen3Model RoPe will cause a slightly difference in the model output which can accumulate error for code predictor model

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will verify it later. Thanks :D

@tsdocode
Copy link
Copy Markdown

tsdocode commented Feb 4, 2026

@Sy0307 @gcanlin

This is my first time looking at vllm-omni so please correct me if I'm wrong, after reading the PR my understanding is:

  • This PR is aim to separate the AR part (Talker + CodePredictor) with the code2vec part
  • The Talker model is setup to use Qwen3 model running with existing vllm setup (kvcache,..) while the predictor has an independence KVCache

My question is:

  • Does the current Talker implementation support continuous batching or PagedAttention (PagedKVCache)?
  • Can we fully isolate the Talker and Predictor? Since the Talker generates 1 step per cycle while the Predictor runs 1 prefill + 14 decode steps, the current unified approach seems to bottleneck at the Predictor.
  • The implementation references specific hooks required by vllm-omni. Is there documentation available listing all required hooks?

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 4, 2026

@Sy0307 @gcanlin

This is my first time looking at vllm-omni so please correct me if I'm wrong, after reading the PR my understanding is:

  • This PR is aim to separate the AR part (Talker + CodePredictor) with the code2vec part
  • The Talker model is setup to use Qwen3 model running with existing vllm setup (kvcache,..) while the predictor has an independence KVCache

My question is:

  • Does the current Talker implementation support continuous batching or PagedAttention (PagedKVCache)?
  • Can we fully isolate the Talker and Predictor? Since the Talker generates 1 step per cycle while the Predictor runs 1 prefill + 14 decode steps, the current unified approach seems to bottleneck at the Predictor.
  • The implementation references specific hooks required by vllm-omni. Is there documentation available listing all required hooks?

@Sy0307 @gcanlin

This is my first time looking at vllm-omni so please correct me if I'm wrong, after reading the PR my understanding is:

  • This PR is aim to separate the AR part (Talker + CodePredictor) with the code2vec part
  • The Talker model is setup to use Qwen3 model running with existing vllm setup (kvcache,..) while the predictor has an independence KVCache

My question is:

  • Does the current Talker implementation support continuous batching or PagedAttention (PagedKVCache)?
  • Can we fully isolate the Talker and Predictor? Since the Talker generates 1 step per cycle while the Predictor runs 1 prefill + 14 decode steps, the current unified approach seems to bottleneck at the Predictor.
  • The implementation references specific hooks required by vllm-omni. Is there documentation available listing all required hooks?

1.PagedAttention / PagedKVCache is now supported. However, I still need to test Continuous Batching. I'll test it once my new machine arrives, but you are welcome to test it yourself in the meantime—just modify the settings in the stage config.
2.The Talker and Predictor are not fully isolated yet. Please refer to the comment here for details: https://github.com/vllm-project/vllm-omni/issues/976#issuecomment-3833386852
3.Plz review stage_configs.md & adding_omni_model.md.

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 4, 2026

Good job! How about performance for the current implementation?

The preliminary test results on 5090 are as follows:

2‑stage(disagg)CCU=1:RTF ≈ 0.136
legacy** single-stage CCU=1:RTF ≈ 0.497

Are there any other metrics worth paying attention to? I can conduct further testing when I am free. And welcome more comments on implementation. I will add more descriptions of design tomorrow as possible.

@tsdocode
Copy link
Copy Markdown

tsdocode commented Feb 4, 2026

Good job! How about performance for the current implementation?

The preliminary test results on 5090 are as follows:

2‑stage(disagg)CCU=1:RTF ≈ 0.136 legacy single-stage CCU=1:RTF ≈ 0.497

Are there any other metrics worth paying attention to? I can conduct further testing when I am free. And welcome more comments on implementation. I will add more descriptions of design tomorrow as possible.

Good job! I think we just need one or some correctness test like holding the random state to see if the output is difference between original implementation and the new one.

@Sy0307 Sy0307 force-pushed the dev/tts_disaggregation branch 2 times, most recently from aaee00b to 38bad84 Compare February 9, 2026 14:01
@tzhouam
Copy link
Copy Markdown
Collaborator

tzhouam commented Feb 10, 2026

Please fix the precommit, thanks

@Gaohan123 Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Please modify your PR title.

@Sy0307 Sy0307 changed the title [WIP] [Feat] : Support disaggregated inference pipeline for Qwen3_TTS [Feature] : Support disaggregated inference pipeline for Qwen3_TTS Feb 10, 2026
@Sy0307 Sy0307 mentioned this pull request Feb 10, 2026
5 tasks
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 10, 2026

I've fixed some bugs and cleaned up the code. It's now ready for review. Feel free to provide any feedback or raise any questions, and I'll respond and make fixes when I have time.

Additionally, I apologize for the previous RTF statistics, which seem to have some errors. I introduced a large number of samples, but some of the generated audio was incorrect, causing the results to deviate from normal values. My preliminary measurements show that the 2-stage RTF should be around 0.30. I will supplement the testing later.

cc @gcanlin @tsdocode @linyueqian

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove modeling_qwen3_tts.py now?

Comment thread vllm_omni/model_executor/models/registry.py Outdated
Comment thread vllm_omni/model_executor/models/registry.py Outdated
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 10, 2026

Could we remove modeling_qwen3_tts.py now?

Yes, I have removed legacy code using one single stage for Qwen3_TTS. I also removed qwen3_tts.py as it is out-dated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a vLLM-native, disaggregated (2-stage) inference pipeline for Qwen3-TTS, enabling Stage-0 AR “Talker” codec generation and Stage-1 “Code2Wav” waveform decoding (including async_chunk streaming integration).

Changes:

  • Introduces new Qwen3-TTS stage-specific model implementations: AR Talker + local-KV-cache residual code predictor + Code2Wav decoder.
  • Adds/updates stage configs and stage-input processor to stream codec frames across stages via connectors (async_chunk).
  • Updates runner/scheduler/connector + OpenAI speech endpoint plumbing to support codec-token prompting and per-chunk metadata propagation.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
vllm_omni/worker/gpu_model_runner.py Generalizes GPU-side talker_mtp handling for TTS and adjusts preprocess/embed plumbing for stage execution.
vllm_omni/model_executor/stage_input_processors/qwen3_tts.py New async_chunk adapter that windows codec frames and packs them for Stage-1 decoding.
vllm_omni/model_executor/stage_configs/qwen3_tts.yaml Updates default Qwen3-TTS stage config to a 2-stage async_chunk pipeline with shared-memory connector bindings.
vllm_omni/model_executor/stage_configs/qwen3_tts_talker_speech_tokenizer_async_chunk.yaml Adds a dedicated async_chunk stage config variant for Talker→SpeechTokenizer decoding.
vllm_omni/model_executor/models/registry.py Switches registry entries to stage-specific Qwen3-TTS architectures (Talker / Code2Wav).
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_tokenizer.py Adds optional feature-extractor loading and relaxes encode typing for stage-specific usage.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py New vLLM-native AR Talker model with preprocess/postprocess and GPU-side talker_mtp residual prediction.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code_predictor_vllm.py New lightweight residual code predictor with an independent local KV cache and prefill/decode logits APIs.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code2wav.py New Stage-1 “decode-only” model that turns codec windows into waveform via SpeechTokenizer.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py Removes the previous HF-style monolithic generation wrapper.
vllm_omni/model_executor/models/qwen3_tts/processing_qwen3_tts.py Removes HF processor wrapper (no longer used by the new vLLM-native pipeline).
vllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py Removes the large HF reference modeling implementation.
vllm_omni/model_executor/models/qwen3_tts/configuration_qwen3_tts.py Adjusts dummy vision token IDs and stage-specific config behavior (mrope stripping for Code2Wav).
vllm_omni/entrypoints/openai/serving_speech.py Switches TTS requests to prompt_token_ids placeholders + prompt-length estimation; adds streaming chunk concat.
vllm_omni/entrypoints/async_omni.py Ensures next-stage placeholder prompt length is at least 1 token.
vllm_omni/distributed/omni_connectors/adapter.py Updates chunk ingestion to support Code2Wav header + codec code windows and forwards per-chunk metadata.
vllm_omni/core/sched/omni_generation_scheduler.py async_chunk-aware scheduling: only pull new chunks once prior tokens are consumed; propagate per-step additional_information.
vllm_omni/config/model.py Adds codec_frame_rate_hz plumbing from HF config into Omni model config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"Qwen3TTSModelForGeneration",
"qwen3_tts_code2wav",
"Qwen3TTSCode2Wav",
),
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry drops Qwen3TTSForConditionalGeneration in favor of stage-specific Qwen3TTSTalkerForConditionalGeneration/Qwen3TTSCode2Wav, but there are still repo configs referencing the old architecture (e.g. vllm_omni/platforms/npu/stage_configs/qwen3_tts.yaml uses model_arch: Qwen3TTSForConditionalGeneration). Those configs will no longer resolve; please update them (or provide an alias) to avoid broken non-GPU stage configs.

Suggested change
),
),
# Backward-compatible alias for configs still using the generic Qwen3 TTS arch name.
"Qwen3TTSForConditionalGeneration": (
"qwen3_tts",
"qwen3_tts_talker",
"Qwen3TTSTalkerForConditionalGeneration",
),

Copilot uses AI. Check for mistakes.
Comment on lines +403 to +408
"""Decode one new token for residual group `generation_step` (1..Q-1)."""
self._maybe_init_kv_cache(input_ids.device)
assert self._kv_cache is not None
bsz = int(input_ids.shape[0])
if generation_step <= 0:
raise ValueError("generation_step must be >= 1 for decode_logits")
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_logits is documented as accepting generation_step in the range (1..Q-1), but the implementation indexes self.lm_head[generation_step] and embed_idx = generation_step - 1. With the current prefill+decode calling pattern, passing generation_step == len(self.lm_head) would raise an IndexError. Please either tighten the docstring to the actually supported range and/or add an explicit upper-bound check to fail with a clear error message.

Suggested change
"""Decode one new token for residual group `generation_step` (1..Q-1)."""
self._maybe_init_kv_cache(input_ids.device)
assert self._kv_cache is not None
bsz = int(input_ids.shape[0])
if generation_step <= 0:
raise ValueError("generation_step must be >= 1 for decode_logits")
"""Decode one new token for residual group `generation_step` (1..len(self.lm_head)-1).
Note:
- generation_step == 0 is reserved for prefill_logits (uses lm_head[0]).
- Valid decode steps are 1 through len(self.lm_head) - 1, inclusive.
"""
self._maybe_init_kv_cache(input_ids.device)
assert self._kv_cache is not None
bsz = int(input_ids.shape[0])
max_step = len(self.lm_head) - 1
if generation_step < 1 or generation_step > max_step:
raise ValueError(
f"generation_step must be in [1, {max_step}] for decode_logits; "
f"got {generation_step}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +817 to +825
def _load_audio_to_np(self, x: str) -> tuple[np.ndarray, int]:
import librosa

if self._is_url(x):
with urlopen(x) as resp:
audio_bytes = resp.read()
with io.BytesIO(audio_bytes) as f:
audio, sr = sf.read(f, dtype="float32", always_2d=False)
elif self._is_probably_base64(x):
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_audio_to_np fetches arbitrary URLs via urlopen with no timeout, size limit, or allowlisting. Since ref_audio can come from API inputs (Base task), this is a concrete SSRF / DoS risk (e.g., internal network access or huge downloads). Consider disabling URL fetching by default, adding an explicit allowlist, enforcing https, and setting timeouts + maximum download size.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to 260
# Must use prompt_token_ids (not text prompt): the AR Talker
# operates on codec tokens; text token IDs exceed codec vocab.
# model.preprocess replaces all embeddings, so placeholder value
# is irrelevant -- but length must match to avoid excess padding.
tts_params = self._build_tts_params(request)
prompt_text = self._build_tts_prompt(request.input)
ph_len = self._estimate_prompt_len(tts_params)
prompt = {
"prompt": prompt_text,
"prompt_token_ids": [1] * ph_len,
"additional_information": tts_params,
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/entrypoints/openai_api/test_serving_speech.py currently calls OmniOpenAIServingSpeech._build_tts_prompt(), but this method was removed and TTS prompting is now done via prompt_token_ids. This will break unit tests (and any downstream callers relying on the old helper); either restore a compatibility wrapper or update the tests/callers to the new prompt_token_ids-based flow.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +63
gpu_memory_utilization: 0.3
distributed_executor_backend: "mp"
max_num_batched_tokens: 512
max_model_len: 4096
# Stage-0 emits flattened codec codes via async_chunk connector.
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
detokenize: false
repetition_penalty: 1.05
stop_token_ids: [2150]

- stage_id: 1
stage_type: llm
runtime:
devices: "0"
max_batch_size: 1
engine_args:
model_stage: code2wav
model_arch: Qwen3TTSCode2Wav
hf_overrides:
architectures: [Qwen3TTSCode2Wav]
# Stage-1 has no main checkpoint weights (SpeechTokenizer is loaded from
# `speech_tokenizer/` lazily). Avoid probing for model.safetensors.
load_format: dummy
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
enforce_eager: true
trust_remote_code: true
async_scheduling: false
enable_prefix_caching: false
engine_output_type: audio
gpu_memory_utilization: 0.2
distributed_executor_backend: "mp"
# Must be >= num_code_groups * (codec_left_context_frames + codec_chunk_frames).
max_num_batched_tokens: 8192
# async_chunk appends windows per step; max_model_len must cover accumulated stream.
max_model_len: 32768
engine_input_source: [0]
final_output: true
final_output_type: audio
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This async_chunk stage config defines runtime.connectors.connector_of_shared_memory, but neither stage declares output_connectors / input_connectors to bind that named connector to the edge (unlike model_executor/stage_configs/qwen3_tts.yaml, which explicitly maps to_stage_1 / from_stage_0). If the runtime expects explicit connector mapping, the extra codec streaming settings may not be applied; consider adding the connector mappings for stage 0/1 for consistency.

Copilot uses AI. Check for mistakes.
head,
)
except Exception:
pass
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
# Codec statistics logging is best-effort; failures must not break decoding.
logger.debug("Failed to compute or log Code2Wav codec statistics.", exc_info=True)

Copilot uses AI. Check for mistakes.
if arr.size > 0:
wav_candidates.append(arr)
return
except Exception:
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if arr.size > 0:
wav_candidates.append(arr)
return
except Exception:
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
return int(x.numel())
if isinstance(x, list):
return int(len(x))
except Exception:
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# If length computation fails for any reason, treat it as zero-length.

Copilot uses AI. Check for mistakes.
info_update["talker_prefill_offset"] = int(offset + span_len)
else:
# Subsequent prefill chunk: slice from our own running offset.
if not isinstance(prompt_embeds_cpu, torch.Tensor) or prompt_embeds_cpu.ndim != 2:
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.
@linyueqian
Copy link
Copy Markdown
Collaborator

I ran this end-to-end on H200s and the online serving path works well! A few things need fixing before merge though.

Critical:

  1. Offline inference is broken. examples/offline_inference/qwen3_tts/end2end.py crashes immediately with Model architectures ['Qwen3TTSForConditionalGeneration'] are not supported. The HF config still lists the old arch name but it was removed from the registry. Online serving works because stage configs apply hf_overrides, but the Omni() constructor validates architecture before per-stage overrides kick in. Easiest fix: add a compat alias in the registry, or defer arch validation for staged models.

  2. SSRF in _load_audio_to_np (qwen3_tts_talker.py:820). urlopen(x) on user-supplied ref_audio URLs with zero validation. Can hit cloud metadata endpoints, port-scan, or DoS via large downloads. Need host allowlisting + timeout + size limit, or move URL resolution to the serving layer.

  3. NPU stage config still references Qwen3TTSForConditionalGeneration (platforms/npu/stage_configs/qwen3_tts.yaml). Will break NPU deployment. Needs the same 2-stage update as the GPU config.

  4. **req_infos crashes on None (gpu_model_runner.py). req_infos is assigned at lines 840 and 1012 via getattr(req_state, "additional_information_cpu", None) which can return None. Then it's unpacked as **req_infos at line 848 (postprocess) and line 1021 (preprocess), and **None raises TypeError. The postprocess path catches this with except Exception but that silently fails the entire batch. Fix: req_infos = getattr(req_state, "additional_information_cpu", None) or {}.

Other suggestions to consider:

  1. RoPE mismatch (qwen3_tts_talker.py:326). As @tsdocode flagged, the talker uses Qwen3Model (standard RoPE) but the model may have been trained with multimodal RoPE. The dummy vision IDs avoid mrope scanning but if the weights expect mrope, quality will degrade over long sequences. Needs verification against the official HF impl.

  2. Broken test (test_serving_speech.py). test_build_tts_prompt calls _build_tts_prompt() which was removed. Will fail with AttributeError.

  3. Final audio frame dropped at EOS (stage_input_processors/qwen3_tts.py:45). The if not finished guard skips frame extraction when finished=True. If the last codec frame arrives in the same step as EOS, it's silently lost. Extract the frame first, then check finished.

  4. decode_logits missing upper-bound check (qwen3_tts_code_predictor_vllm.py:402). generation_step is validated >= 1 but not bounded above. Out-of-range values will IndexError on self.lm_head[generation_step].

  5. Vocab mask allocated every step (qwen3_tts_talker.py:401). compute_logits() creates a vocab-sized bool mask on GPU per call. It's constant, register as a buffer in init.

  6. Dead code (qwen3_tts_talker.py:537). prompt_embeds is always None so the else branch is unreachable.

  7. 3s blocking wait in scheduler (adapter.py:254). get_through_connector does time.sleep(0.01) x 300 synchronously. With N stalled requests this blocks the scheduler for Nx3s. Wait params are hardcoded rather than from connector config.

  8. OmniOutput.text_hidden_states typed as Tensor but Code2Wav passes None.

  9. No torch.compile for code predictor (Qwen-3 Omni has this).

Runtime testing

Comparison against the old monolithic single-stage pipeline (same hardware, same model):

Test Old (monolithic) PR (disaggregated) Speedup
English, Vivian 4.82s / 4.80s audio / RTF 1.00 3.84s / 4.80s audio / RTF 0.80 1.25x
Chinese, Ryan 4.62s / 4.64s audio / RTF 1.00 2.17s / 3.52s audio / RTF 0.62 1.62x
Long text (~50w), Serena 16.81s / 17.04s audio / RTF 0.99 8.67s / 15.36s audio / RTF 0.56 1.77x

RTF = latency / audio duration, below 1.0 is faster than real-time. Audio durations vary slightly between runs due to stochastic generation (temperature 0.9). The old pipeline sits at ~1.0x RTF (barely real-time). The disaggregated pipeline runs 0.56-0.80x RTF, with larger speedups on longer text thanks to the async chunk streaming between stages.

The PR also adds instruction/emotion control which the old pipeline doesn't support (returns 400 on the instructions field).

One recurring warning: Error concatenating tensor for key sr fires every request because sample rate is a scalar that can't be torch.cat across streaming chunks. Non-fatal but noisy.


num_computed_tokens = request.num_computed_tokens
required_tokens = max(len(request.prompt_token_ids) - num_computed_tokens, 1)
required_tokens = len(request.prompt_token_ids) - num_computed_tokens
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this judgment? Or can we abstract it into a function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In async_chunk mode (the Code2Wav stage of qwen3-tts and qwen3-omni), prompt_token_ids is dynamically appended chunk by chunk by the upstream component. When the upstream has not produced a new chunk yet, len(prompt_token_ids) == num_computed_tokens, so required_tokens should be 0, meaning “no new data to process yet.” In this case, the scheduler should skip this request (i.e., continue) and check again in the next iteration.

If we keep max(..., 1), then when there is no new chunk it will still force scheduling one placeholder token, causing the model to run inference repeatedly on already-processed data and resulting in incorrect audio output.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary. In our design, if no new block is produced upstream, the request should not execute this round of scheduling.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, when enabling or disabling async_chunk, we need to minimize intrusive modifications to the scheduler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move it to adapter.py or other part by using other method? If max(...,1) make efforts then Qwen3_TTS will be frozen when receive a request. Your thought?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @amy-why-3459 , this part of TTS will follow #951. WIP.

Comment thread vllm_omni/distributed/omni_connectors/adapter.py Outdated
Signed-off-by: Sy03 <1370724210@qq.com>
@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Feb 16, 2026

Tested E2E offline inference on the latest revision (b6c1928 + 8fbf458). The disaggregated pipeline works, both stages initialize fine, codec streaming via SharedMemoryConnector is functional, and Code2Wav produces intelligible audio at RTF=0.76. Nice work!

I did run into a few issues with the offline example (examples/offline_inference/qwen3_tts/end2end.py) though:

First, the example only passes 1 SamplingParams but the 2-stage pipeline expects 2 (one per stage), so it throws ValueError: Expected 2 sampling params, got 1.

Second, the SamplingParams in the example is missing stop_token_ids=[2150]. The YAML default config has this, but when you pass your own SamplingParams it fully overrides the default. Without the stop token, the Talker never stops and just runs until max_tokens, producing minutes of garbled audio after the actual speech finishes. Once I added stop_token_ids=[2150], the output was a clean 6-second audio for the Chinese test sentence.

Third, the example uses text prompts ("prompt": "<|im_start|>assistant\n..."), but the Talker operates on codec tokens so tokenized text IDs (e.g. 151644 for <|im_start|>) exceed the codec vocabulary range (~4200), which causes a CUDA device-side assert in apply_penalties when CUDA graphs are enabled. CUDA graphs do work fine when using prompt_token_ids (which is what the online serving path does). So the example should probably switch to prompt_token_ids: [1] * estimated_len to be consistent with the online path and avoid the crash.

The first two are quick fixes. The third one just needs the example to use prompt_token_ids instead of text prompts.

…put handling in offline test

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 16, 2026

Tested E2E offline inference on the latest revision (b6c1928 + 8fbf458). The disaggregated pipeline works, both stages initialize fine, codec streaming via SharedMemoryConnector is functional, and Code2Wav produces intelligible audio at RTF=0.76. Nice work!

I did run into a few issues with the offline example (examples/offline_inference/qwen3_tts/end2end.py) though:

First, the example only passes 1 SamplingParams but the 2-stage pipeline expects 2 (one per stage), so it throws ValueError: Expected 2 sampling params, got 1.

Second, the SamplingParams in the example is missing stop_token_ids=[2150]. The YAML default config has this, but when you pass your own SamplingParams it fully overrides the default. Without the stop token, the Talker never stops and just runs until max_tokens, producing minutes of garbled audio after the actual speech finishes. Once I added stop_token_ids=[2150], the output was a clean 6-second audio for the Chinese test sentence.

Third, the example uses text prompts ("prompt": "<|im_start|>assistant\n..."), but the Talker operates on codec tokens so tokenized text IDs (e.g. 151644 for <|im_start|>) exceed the codec vocabulary range (~4200), which causes a CUDA device-side assert in apply_penalties when CUDA graphs are enabled. CUDA graphs do work fine when using prompt_token_ids (which is what the online serving path does). So the example should probably switch to prompt_token_ids: [1] * estimated_len to be consistent with the online path and avoid the crash.

The first two are quick fixes. The third one just needs the example to use prompt_token_ids instead of text prompts.

Fix offline tests by following your advice and thanks a lot.

@linyueqian
Copy link
Copy Markdown
Collaborator

Re-tested with 5530aaa. The updated end2end.py works correctly now, got clean 4.72s audio with CUDA graphs enabled. LGTM!

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 18, 2026
…wen3_tts

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Feb 20, 2026

Issue Analysis

Due to a previous modification (required_tokens<=0 -> continue to avoid tts generation hanging), we now have a higher probability of triggering the following issue:

vLLM v0.16 enables async_scheduling by default, and EngineCore will go through step_with_batch_queue (batch_queue_size=2). This means:

  • The output of batch N may not have been processed yet when batch N+1 has already started to schedule/execute.
  • Previously, when constructing OmniModelRunnerOutput in the generation runner (GPU/NPU), we used req_ids = self.input_batch.req_ids.
  • However, the next scheduling round will enter GPUModelRunner._update_states(), which modifies self.input_batch in-place based on the set of scheduled/unscheduled requests in the current round (removing unscheduled/finished requests and rebuilding the index).

Otherwise, the following error will occur: a req_id that is considered scheduled in scheduler_output cannot be found in model_output_N.req_id_to_index → triggering a KeyError.

Solution

To avoid the output holding mutable references and prevent the held objects from being corrupted, we make the following modification in b6e6972:

req_ids_output_copy = self.input_batch.req_ids.copy()
req_id_to_index_output_copy = self.input_batch.req_id_to_index.copy()

cc @hsliuustc0106

Comment thread vllm_omni/worker/gpu_generation_model_runner.py
enforce_eager: true
trust_remote_code: true
async_scheduling: false
enable_prefix_caching: false
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you could disable async schedule here. I have met this issue before. And I found that Qwen3-Omni have disabled it in stage-2. So we didn't add the copy in model runner. But copying may be more general.
cc @amy-why-3459 @tzhouam

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, copying here follows how vllm do.vllm code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I missed async_scheduling: false. So even if you have disabled async_scheduling, the key error still happened?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error happened in qwen 2.5 omni CI test. And in qwen2_5_omni_ci.yaml we do not have async_scheduling: false (BTW, I found qwen2_5_omni.yaml has such setting.

Signed-off-by: Sy03 <1370724210@qq.com>
@@ -0,0 +1,92 @@
async_chunk: true
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between qwen3_tts_talker_speech_tokenizer_async_chunk.yaml and qwen3_tts.yaml? And this PR makes async chunk on by default. But from what I knew, we haven't supported offline async chunk. Could we? Or, this PR also takes this feature in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen3_tts_talker_speech_tokenizer_async_chunk.yaml should be the 2stage config files for preliminary test to distinguish with qwen3_tts.yaml. But I should delete it now.

Not support offline async chunk but I am working on this related issue #1193 . I think it will be implemented in another PR.

…speech tokenizer async chunk.

Signed-off-by: Sy03 <1370724210@qq.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit efbe411 into vllm-project:main Feb 20, 2026
6 of 7 checks passed
with1015 added a commit to with1015/vllm-omni that referenced this pull request Apr 6, 2026
* [Frontend][Model] Support batch request with refined OmniDiffusionReq… (#797)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* [Model]: add FLUX.1-dev model (#853)

* [BugFix] ignore mm data from stages to async omni (#954)

Signed-off-by: dengyunyang <584797741@qq.com>

* Revert "[BugFix] ignore mm data from stages to async omni" (#1023)

* [Bugfix] Modify output to model_runner_output (#1026)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Feature] Support cache-dit for Wan 2.2 inference (#1021)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>

* [Doc]Format profiling doc (#993)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Hardware] Support platforms and plugin system (#774)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Core]: KV Cache Transfer Encapsulation (#979)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Test]Delete skip mark for amd ci test and fix CI failure (#927)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Doc]Specify Qwen3-TTS model name for each task type (#1036)

Signed-off-by: Kyle Huang <yellowsea@gmail.com>

* [Misc] pin version of fa3-fwd (#1051)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [CI] [ROCm] Add more AMD CI tests (#1039)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix qwen image layerd in dummy run (#1027)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [BugFix] Fix noisy output without setting a seed in Qwen Image (#1043)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [bugfix] remove vllm speech route (#1060)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Debug] Update GLM-Image Pipeline (#1049)

Co-authored-by: root <root@hk01dgx028.cm.cluster>

* [Diffusion][Bugfix] Fix the flash_attn backends selection logic (#983)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix the accuracy issue of multimodal input. (#1020)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>

* [Bugfix] Set VaeImageProcessor `do_convert_rgb` True (#1032)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [feat]: adapt batch request for flux (#1028)

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

* [CI] Change Qwen3 Omni stage placement strategy  (#1072)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

* [BugFix] Fix to use correct attn backend (#1038)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [Perf] Qwen3 Omni talker mtp optimization (#1005)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Wan2.2] Optimize memory usage with conditional transformer loading (#980)

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Feat] Support XPU Backend in vLLM-Omni (#191)

Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Fix] stabilize diffusion images LoRA E2E across CI drift (#1075)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Bugfix][Test] Re-enable the log simple tests (#1065)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni (#1025)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (#987)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>

* [Fix] Increase max wait time for server readiness to accommodate model loading (#1089)

Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Benchmark] Add vLLM-Omni Omni model online benchmark (#780)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Remove Mooncake/Yuanrong connector import warning (#1091)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* fix: UnboundLocalError for role in streaming audio/image responses (#784)

Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>

* [Misc] update wechat image (#1096)

* [Feature] Support DiT Layerwise (Blockwise) CPU Offloading (#858)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Modify max_tokens and modify the log and fix #1103 (#1097)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task (#1100)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Add supports_torch_inductor interface (#1108)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error (#1104)

Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] fix request_id of image generation in api server (#1112)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf]: CFG parallel abstraction (#851)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#1082)

* [CI] [ROCm] Quick fix amd ci (#1116)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix benchmark audio timing error and add benchmark test (#1109)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration (#1079)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

* [NPU] Align with GPUModelRunner (#1114)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [FEATURE] /v1/images/edit interface (#1101)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Bugfix] Fix NPU SDPA attention mask shape and semantics (#1031)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [TeaCache]: Add Coefficient Estimation (#940)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI]: Bagel E2E Smoked Test (#1074)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Bump version to 0.14.0 (#1128)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Doc] First stable release of vLLM-Omni (#1129)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Misc] Align error handling with upstream vLLM v0.14.0 (#1122)

Signed-off-by: anna <lee.anna@navercorp.com>
Co-authored-by: anna <lee.anna@navercorp.com>

* [Feature] add Tensor Parallelism to LongCat-Image(-Edit) (#926)

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

* [CI] Temporarily remove slow tests. (#1143)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: princepride <wangzhipeng628@gmail.com>

* [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat (#1165)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Dev/rebase v0.15.0 (#1159)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* Docs update paper link (#1169)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>

* [Debug] Clear Dockerfile.ci to accelerate build image (#1172)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Debug] Correct Unreasonable Long Timeout (#1175)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc]Fix - Align with repo. (#1176)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt (#1170)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] fix qwen image oom (#1168)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [Hardware] Disable compile of diffusion on XPU (#1148)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Doc] Fix vLLM version in user docs (#1179)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

* [Refactor] Refactor async chunk and fix the shape mismatch issue (#1151)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* bugfix: /images/edits endpoint fails pipeline data format check (#1141)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf] resolving prolonged `cudastreamsynchronize` execution in z image processing (#1105)

Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Bugfix] modify RTF use audio_e2e/audio_duration (#1157)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Doc] Highlight paper & slides. (#1186)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [chore] Remove zmq context initialize (#1187)

Signed-off-by: xiedeyantu <czjourney@163.com>

* [NPU] Update Dockerfile and docs for v0.14.0 (#671)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature (#1018)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] opt doc (#1118)

Signed-off-by: David Chen <530634352@qq.com>

* [Bugfix] Fix tp+sp accuracy, incorrect process group mapping (#1178)

Signed-off-by: David Chen <530634352@qq.com>

* [Feature] Enable use_audio_in_video for Qwen 3 Omni Online (#1198)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] async_chunk rebase v0.15.0 (#1195)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [feature]: support flux cache_dit (#1145)

Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

* [CI] Add CI branch coverage calculation,  fix statement coverage results and add log before test for buildkite  log group (#1120)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Wan 2.2][Diffusion] Add TP Support (#964)

Signed-off-by: weichen <calvin_zhu0210@outlook.com>

* [Hardware] [Feat] Setup platform dependent package installation (#1046)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [XPU] Fix XPU UTs for basic coverage (#1164)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Test] Add BuildKite test-full script for full CI. (#867)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock (#1202)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] Fix wan2.2 ti2v (#1221)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Fix '--max-generated-image-size' cli args type (#1249)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Bugfix] Ensure seed=0 is correctly handled in image edit (#1248)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Docs] Add example image download step to Image-To-Video examples (#1258)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode (#1241)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached (#1203)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Doc] Update QA relevant to quantization  (#1257)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix Doc link Rrror (#1263)

Signed-off-by: lishunyang <lishunyang12@163.com>

* Process-Scoped GPU Memory Accounting (#1204)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [ComfyUI]: ComfyUI integration (#1113)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* fix: add diffusion offload args to OmniConfig group instead of serve_parser (#1271)

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

* [Doc] Adding models/pipelines/features Tutorial (#1196)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>

* [CI] Add env variable check for nightly CI  (#1281)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [CI] Add pytest markers to current tests and update the doc. (#577)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design (#1275)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>

* [Feature] Opt metrics structure (#891)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Test] Add example test cases for omni online (#1086)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CI] Reduce the time for Diffusion Sequence Parallelism Test (#1283)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni (#1085)

Signed-off-by: Semmer2 <semmer@live.cn>

* [Chore] Update copyright year. (#1256)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [feature]: support Flux.1-dev CFG-Parallel (#1269)

* [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect (#1254)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Doc] Update Qwen3-TTS docs for consistency with Omni examples (#1226)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Fix]Ensure HuggingFace downloads complete before initialization. (#1213)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Fixed the issue where ignore_eos was not working. (#1286)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Test] Add e2e tests for Qwen3-TTS speech endpoint (#1206)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [Feat]: support VAE patch parallelism (#756)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: hsliuustc0106 <liuhongsheng4@huawei.com>

* [CI] Disable Qwen3-TTS E2E Test in pipeline.yml (#1306)

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>

* [Misc] Add per-request generator_device to online image gen and edit (#1183)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bagel]: Support TP (#1293)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Bugfix] Fix image edit RoPE crash when explicit height/width are provided (#1265)

Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] Sync (#1216)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt (#1288)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Debug] Add trigger to concurrent stage init (#1274)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix][Qwen3-TTS] Fix task type (#1317)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

* Unifying CLI Argument Naming Style (#1309)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download (#1318)

* [CI] Run nightly tests. (#1333)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature]: FP8 Quantization Support for DiT  (#1034)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

* Fix yield token metrics and opt metrics record stats (#1292)

* [Test] L2 & L3 Test Case Stratification Design for Omni Model (#1272)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk (#1246)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update qwen3-omni & qwen2.5-onmi openai client (#1304)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API (#1073)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>

* [Feature] add Tensor Parallelism to SD_3.5 (#1336)

Signed-off-by: GG-li <3226868735@qq.com>

* [Feature]async scheduling to overlap chunk IO and compute (#951)

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] reused metrics to modify the API Server token statistics in Stream Response (#1301)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>

* Refactor CPU Offloading Backend Pattern (#1223)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [DOC] Doc for CI test - Details about five level stucture and some other files. (#1167)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: yenuo26 <410167048@qq.com>

* [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci (#1348)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Misc] wechat image update (#1354)

Signed-off-by: David Chen <530634352@qq.com>

* [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker (#764)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* [Feature][Bugfix] Add CFG feature to Bagel (#1310)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feature]: Diffusion sleep to use process level memory calculation (#1276)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* change qwen3-omni open cudagraph by default (#1352)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [XPU] Update Bagel's flash_attn_varlen_func to fa utils (#1295)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Test] Add Omni Model Performance Benchmark Test (#1321)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: Revert utils change (#1369)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Rebase] Rebase to vllm v0.16.0 (#1357)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* [Test] Fix expansion and example test case for qwen3-omni (#1358)

Signed-off-by: yenuo26 <410167048@qq.com>

* [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 (#1401)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [0.16.0] remove cuda hard-code for Hunyuan Image3 (#1402)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [XPU] Add XPU Dockerfile and related docs (#1162)

Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>

* [Bugfix] Fix Hardcoded Datatypes in Z-image (#1393)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Feature] : Support disaggregated inference pipeline for Qwen3_TTS (#1161)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Add automated PR reviewer bot with GLM integration (#1424)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo (#1382)

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

* [misc] Feature/pr reviewer auto trigger&update model (#1431)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[misc] Feature/pr reviewer auto trigger&update model" (#1432)

* [Doc] Update GPU installation commands (#1434)

* [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 (#1380)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. (#1429)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feat]: add ASCII art logo for vLLM-Omni  (#1430)

* [Bug] [Bagel] Fix kv transfer bug (#1437)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Wang Zhipeng: princepride <wangzhipeng628@gmail.com>

* [CI] Set L2 & L3 tests running conditions. (#1344)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature] vLLM-Omni RDMA connector (#1019)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [Minor][Refactor] Pass seq_token_counts explicitly (#1425)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Extend Diffusion Benchmark script to other backends (#875)

Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Support Stage Based Deployment CLI (#939)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Doc] Optimize vLLM-Omni metrics documentation (#1311)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix]  Forward all vllm-omni serve command parameters to model (#985)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc]: Add bagel single/multi node usage with mooncake document (#1450)

* [Qwen3TTS][Feat] Code2Wav batched decoding (#1426)

Signed-off-by: pablo <pablo@agigo.ai>
Co-authored-by: pablo <pablo@agigo.ai>

* [CI] Remove overwhelming debug log (#1463)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Misc] update wechat image (#1464)

Signed-off-by: David Chen <530634352@qq.com>

* [Doc] Refine Diffusion Tutorial Documents (#1305)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix] Robust Audio Data Handling in _create_audio_choice (#1222)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Bugfix]: Fix merging updated additional information to ensure dict type (#1296)

Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>

* [Model]Add new nextstep_1(Diffusion) model(only T2I) (#612)

Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Add TTS configuration options (#1177)

Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>

* [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video (#1433)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration (#1455)

Signed-off-by: Sangchun Ha <seomk9896@gmail.com>

* [BugFix] process request.num_cached_tokens if it equals to the initial value  (#1468)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) (#1349)

Signed-off-by: jader <yjader@foxmail.com>

* [Test] Reduce Perf test case and fix modify stage config (#1449)

Signed-off-by: yenuo26 <410167048@qq.com>

* [NPU] Upgrade to v0.16.0 (#1375)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… (#1491)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements (#1482)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix tuple/list KV cache extraction crash (#1405)

Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] format lora related docs for the user's end (#1009)

Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Feature] Support Wan2.2 output with irregular shapes (#1279)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Misc] Migrate L1 tests to use pytest-mock (#1315)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix LoRA Scaling on Active Adapters (#1421)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Bugfix] fix record audio generated frame in offline infer (#1312)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>

* [Model] Support OmniGen2 (#513)

Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>

* [Bugfix][Qwen3TTS] (#1289)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Use pull through cache image for H100 pool (#1518)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version (#1500)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix offline text_to_image error from #1009 (#1515)

Signed-off-by: David Chen <530634352@qq.com>

* [XPU] Enable FLASH_ATTN on XPU (#1332)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* Revert gpu_1 job to use regular image (#1521)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [Chore] remove unused logger in omni_diffusion (#531) (#1509)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Qwen3TTS][Feat] Streaming output (#1438)

Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler (#1448)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning (#1435)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Performance]Qwen3-Omni performance optimization (#1378)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Feature] Support HSDP for diffusion models (#1339)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI] fixed CI timeout (#1460)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>

* [Bugfix] Use uds for zmq address if not set --stage-id (#1522)

Signed-off-by: wuhang <wuhang6@huawei.com>

* [BugFix] Restore talker's config (#1524)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>

* [XPU] fix qwen_omni after rebase to v0.16.0 (#1416)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Enable layerwise offload on all hardware (#1492)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* diffusion: enable VAE patch parallel for SD3.5 (#1428)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Perf] GLM Image (#920)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [skip ci][Doc] add design docs for async chunk in qwen3-omni (#962)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder (#1205)

Signed-off-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>

* [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Feature]: Native GGUF Quantization Support for DiT (#1285)

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Add benchmark for `v1/audio/speech` non-streaming (#1408)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Version] Auto generate version using `setuptool_scm` (#1224)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feat] : Support Async chunk cleanup (#1087)

Signed-off-by: Sy03 <1370724210@qq.com>

* [Profiler] Support online profiling (#1136)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

* [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler (#1510)

Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: 齐保元 <qibaoyuan@xiaomi.com>

* [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda (#1488)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [Chore] Cleanup dead code in GGUF DiT code path (#1533)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Doc] Update installation instructions for vllm 0.16.0 (#1505)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc] [skip ci]Sync. (#1363)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [CI][skip ci]Update H100 image link based on #1518 (#1538)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* Fix no embed text spk tokens (#1540)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Debug] Merge vllm pull 35368 (#1534)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Docs] update async chunk docs diagram [skip ci] (#1530)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio (#1554)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [NPU][Bugfix] Align GPU side and recover qwen3-tts (#1564)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix unexpected crash when init OmniDiffusion (#1562)

Signed-off-by: Semmer2 <semmer@live.cn>

* [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. (#1543)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: fix a lot of bug (#1565)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* feat: add HyperCLOVAX-SEED-Omni-8B support

Model files:
- vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline
  (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer
- vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline
  (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec
- vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py:
  thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from
  LLM output; truncate/pad vision codes to 729 (27x27) for decoder

Registry:
- vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and
  HyperCLOVAXAudioPipeline with post-process functions

Stage config:
- vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config
  Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4),
  Stage 2: audio decoder (GPU 5)

Bug fixes for HyperCLOVAX compatibility:
- diffusion/request.py: add extra dict field to OmniDiffusionRequest so
  vision_tokens/audio_tokens from stage input processors reach the pipeline
- entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra before creating request
- entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where
  thinker2vision_decoder/thinker2audio_decoder return [])
- entrypoints/async_omni.py: handle skipped sentinel in _process_single_result
  so text-only requests complete without crashing on Stage 1/2

* fix: correct decoder params and HCX porting fixes

- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50
  (matches OmniServe production defaults; 3.5 caused over-amplified
  autoguidance → shrunken/degraded output images)
- omni_stage.py: skip empty engine inputs for text-only requests
- async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra (audio_tokens/vision_tokens)
- registry.py: HCX Omni diffusion model registration fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes

* fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, OmniServe default)

* feat: add audio decoder Stage 2 to hcx_omni pipeline

- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml
- GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder)
- Add runtime edge 0->2 (thinker -> audio decoder)
- Implement post-generation PCM chunk streaming for audio output
  (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded)

Refs: github.com/vllm-project/vllm-omni/pull/869 (already incorporated)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: vllm version compatibility for HyperCLOVAX audio decoder startup

- config/model.py: try/except fallback for AttentionBackendEnum import
  (vllm.v1.attention.backends.registry absent in older vllm builds)
- pipeline_hyperclovax_audio.py: return actual named_parameters() from
  load_weights() when using MAR checkpoint so diffusers_loader strict
  check passes (weights loaded eagerly in __init__ via MAR extraction)
- qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs
  for check_interleaved_audio_video and merge_interleaved_embeddings
  which are absent in older vllm qwen2_5_omni_thinker; these symbols
  are only exercised by Qwen models, not HyperCLOVAX

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add edge 1→2 and correct model key in hcx_omni.yaml Stage 2

- Add runtime edge from:1 to:2 (required for Stage-2 connector init;
  without it AsyncOrchestrator cannot route to audio decoder at runtime)
- Change model_subdir to model for Stage-2 engine_args to match
  total-poc working reference config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: audio S2S output - handle diffusion outputs in _create_audio_choice

HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output
directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output
like LLM pipelines. Fix three locations:

1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output
   when final_res.outputs is empty (diffusion path).
2. Streaming audio path: same fix for _final_res.outputs[0].
3. Both loops (for output in final_res.outputs): fall back to single
   synthetic choice at index 0 when outputs list is empty.
4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process
   (returns WAV bytes, not tensors like Qwen3-Omni).

Also fixes audio input (A2T) regression: skip diffusion prompt extraction
when mm_data has audio content (added in previous session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: parse WAV bytes with soundfile for uniform PCM chunk streaming

HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header.
The previous byte-offset splitting included the header in the first
chunk, corrupting it. Fix: parse with soundfile to get float32 PCM,
then convert to int16 chunks uniformly regardless of source type
(bytes or tensor).

Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: zero-shot TTS with speaker embedding from input audio

- serving_chat.py: extract last input_audio base64 from request messages
  and inject as ref_audio_b64 into engine_prompt dict
- thinker2audio_decoder: read ref_audio_b64 from prompt and pass as
  ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline)
- hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot)
  which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup

Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis
matching the voice characteristics of the original speaker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: wire audio decoder Stage 2 to hcx_omni pipeline and fix S2S flow

- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml
  with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker
- Fix thinker2audio_decoder: correct audio token range (128606-135167),
  remap to [0, 6561) for BigVGAN input, handle empty token case gracefully
- Fix pipeline_hyperclovax_audio.py post_process_func signature and
  incorporate PR#869 BUG FIX patches for stable audio generation

* fix: use finetuned audio decoder and fix transformers_modules deserialization

- hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot,
  ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id).
  Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable
  for text-only requests and incompatible with finetuned decoder path.

- pipeline_hyperclovax_audio.py: guard ref_audio processing with
  'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder,
  so passing ref_audio bytes would crash with 'expected 100 channels'.

- omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules)
  to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles
  outputs containing custom classes from transformers_modules (trust_remote_code),
  but the API server process doesn't have this path, causing deserialization
  failures that silently drop Stage-0 outputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: restore zero-shot speaker cloning with fallback for text-only requests

- hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN)
  for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed
  integer speaker_id and lost the input speaker's voice.

- pipeline_hyperclovax_audio.py: add zero-mel fallback branch for
  finetune=False + ref_audio=None case. When a text-only request arrives
  (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor
  [1, num_mels, 64] instead of crashing with 'expected 100 channels'.
  S2S requests always have ref_audio so the zero-shot cloning path is
  unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add stage config yaml for HCX audio decoder

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HCX audio decoder pipeline

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: modify exception for HCX audio decoder (GAN)

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: default temperature set to 0, and pipeline model evaluation mode

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

---------

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: dengyunyang <584797741@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Kyle Huang <yellowsea@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: anna <lee.anna@navercorp.com>
Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: xiedeyantu <czjourney@163.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: Semmer2 <semmer@live.cn>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>
Signed-off-by: Sangchun Ha <seomk9896@gmail.com>
Signed-off-by: jader <yjader@foxmail.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: xulusjb <fdukeshik@gmail.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Co-authored-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: kYLe <yellowsea@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: NATURE <wzliu@connect.hku.hk>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
Co-authored-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: Fanli Lin <fanli.lin@intel.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Co-authored-by: Pierre LE GUEN <26087574+PierreLeGuen@users.noreply.github.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>
Co-authored-by: Juan Pablo Zuluaga <46724788+JuanPZuluaga@users.noreply.github.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: ceanna93 <fairyanna@naver.com>
Co-authored-by: anna <lee.anna@navercorp.com>
Co-authored-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Jensen <czjourney@163.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: weichen <calvin_zhu0210@outlook.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Jiaping Wu <53215702+ElleElleWu@users.noreply.github.com>
Co-authored-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: yenuo26 <410167048@qq.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: Sy03 <1370724210@qq.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: UsamaKenway <56207634+UsamaKenway@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: wuhang <wuhang6@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com>
Co-authored-by: Yanick Schraner <yanick.schraner@gmail.com>
Co-authored-by: Sangchun Ha <seomk9896@naver.com>
Co-authored-by: 亦瑾 <76905040+yJader@users.noreply.github.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Yupu <feng.yu.pu0330@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: zhumingjue138 <zhumingjue@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Jared Wen <w13431838023@gmail.com>
Co-authored-by: Xu Lu <572605156@qq.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: Baoyuan Qi <qibaoyuan@xiaomi.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Ting FU <futing10@huawei.com>
Co-authored-by: developer-account <irteam@vllm-omni-dev-0.vllm-omni-dev.p-nb13557.svc.cluster.local>
Co-authored-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…llm-project#1161)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants