Skip to content

[Bugfix] Fix transformers 5.x compat issues in online TTS serving#1536

Merged
hsliuustc0106 merged 9 commits into
vllm-project:mainfrom
linyueqian:bugfix/tts-transformers5-compat
Mar 3, 2026
Merged

[Bugfix] Fix transformers 5.x compat issues in online TTS serving#1536
hsliuustc0106 merged 9 commits into
vllm-project:mainfrom
linyueqian:bugfix/tts-transformers5-compat

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Summary

  • Remove fix_mistral_regex=True from AutoTokenizer.from_pretrained (parameter removed in transformers 5.x)
  • Add fallback for 'default' rope_type missing from ROPE_INIT_FUNCTIONS in transformers 5.x (inline standard sinusoidal RoPE)
  • Clamp num_cached_tokens to max(0, ...) in OmniGenerationScheduler to prevent negative value crash

These fixes are required for online TTS serving to work with the current environment (transformers 5.2.0, pinned via uv.lock).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 339b3ddb2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes three compatibility issues with transformers 5.x that were breaking online TTS serving. The changes are minimal, focused, and address real breaking changes in the transformers library.

Pros:

  • Addresses actual breaking changes in transformers 5.x
  • Small, focused fixes (21 additions, 4 deletions)
  • Good inline documentation explaining the 'default' rope_type fallback
  • Defensive programming with the max(0, ...) clamp
  • Clear error message for unsupported rope types

Cons:

  • No test coverage for the new fallback logic
  • The num_cached_tokens negative value issue suggests a deeper problem upstream

Recommendation: Approve with suggestions for follow-up investigation.

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py
def _default_rope_init(config, device=None, seq_len=None, layer_type=None):
head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
inv_freq = 1.0 / (
config.rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32, device=device) / head_dim)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Well-documented fallback

The inline implementation of 'default' RoPE is well-documented and correct. The comment clearly explains why this is needed (transformers 5.x removed 'default' from ROPE_INIT_FUNCTIONS).

Suggestion: Consider adding a reference to the transformers version where this changed:

# transformers>=5.0 removed 'default' from ROPE_INIT_FUNCTIONS (see transformers PR #xxxxx)

f"Unsupported rope_type '{self.rope_type}'. Expected one of {list(ROPE_INIT_FUNCTIONS)} or 'default'."
)

inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Clear error message

The error message provides helpful context about what rope types are supported. This will make debugging easier if an unsupported type is encountered.

events=request.take_events(),
kv_transfer_params=kv_transfer_params,
trace_headers=request.trace_headers,
num_cached_tokens=request.num_cached_tokens,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Symptom fix, not root cause

Clamping num_cached_tokens to max(0, ...) prevents the crash, but it's treating the symptom rather than the root cause. A negative num_cached_tokens suggests:

  1. There's a bug upstream where request.num_cached_tokens is being set to a negative value
  2. Or there's a logic error in how cached tokens are being counted

Recommendation:

  • Add a warning log when clamping occurs to help track down the root cause:
num_cached = request.num_cached_tokens
if num_cached < 0:
    logger.warning(f"Negative num_cached_tokens ({num_cached}) detected for request {request.request_id}, clamping to 0")
    num_cached = 0
num_cached_tokens=num_cached,
  • File a follow-up issue to investigate why num_cached_tokens can be negative

This defensive fix is fine for now, but understanding the root cause would prevent potential issues elsewhere.

num_cached_tokens=max(0, request.num_cached_tokens),
num_external_computed_tokens=request.num_external_computed_tokens,
routed_experts=routed_experts,
num_nans_in_logits=request.num_nans_in_logits,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here

Same recommendation as above - consider adding logging to track when this clamping occurs.

config.rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32, device=device) / head_dim)
)
return inv_freq, 1.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _default_rope_init doesn't close over anything — pull it out to module level so you're not creating a new function object per instance.

events=request.take_events(),
kv_transfer_params=kv_transfer_params,
trace_headers=request.trace_headers,
num_cached_tokens=request.num_cached_tokens,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adding a logger.warning when clamping fires. Silent clamps on negative values will mask whatever upstream bug is producing them.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple minor comments. The fixes look correct overall.

Signed-off-by: linyueqian <linyueqian@outlook.com>
…known types

Signed-off-by: linyueqian <linyueqian@outlook.com>
…hed_tokens

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the bugfix/tts-transformers5-compat branch from 50daf92 to d911ac2 Compare February 28, 2026 04:06
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 check this again?

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Added two more commits to this PR:

1. Fix MRotaryEmbedding import and code predictor dtype/sampling issues (ea3ae58)

  • gpu_model_runner.py: Import OmniMRotaryEmbedding instead of upstream MRotaryEmbedding. After the vLLM 0.16.0 rebase, upstream renamed get_input_positions_tensor to get_next_input_positions_tensor, but OmniMRotaryEmbedding has custom omni-specific position logic in the old method. This only affects the else branch (non-mrope models).
  • qwen3_tts_code_predictor_vllm.py: Replace hardcoded torch.bfloat16 cast in prefill_logits with the model's actual weight dtype. On GPUs without bfloat16 support (compute capability < 8.0), vLLM auto-casts weights to float16 but the hardcoded bfloat16 input causes a dtype mismatch in rms_norm.
  • qwen3_tts_code_predictor_vllm.py: Cast logits to float32 before softmax/multinomial sampling to prevent NaN/Inf from lower-precision dtypes crashing torch.multinomial.

2. Preserve fix_mistral_regex compat for transformers 4.x (622b5df)

  • The previous commit removed fix_mistral_regex=True entirely, but users on transformers 4.x still need it. Now conditionally passes it based on transformers.__version__.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 Ready for another review when you get a chance. Two new commits added addressing bugs found during Qwen3-TTS benchmarking on cc=7.5 GPUs.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

do we need to point out the version dependency in examples?

@linyueqian
Copy link
Copy Markdown
Collaborator Author

No version dependency to document — the fixes are all backward-compatible. The fix_mistral_regex kwarg is now conditionally passed based on transformers.__version__, so it works on both 4.x and 5.x without any user action. The other fixes (RoPE fallback, num_cached_tokens clamp, MRotaryEmbedding import, code predictor dtype/sampling) are transparent to users.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Code review

Found 1 issue:

  1. Missing regression test for recurring num_cached_tokens bug (conventions.md says "Bug fixes: Regression test? Root cause understood?")

This is the second PR fixing the same num_cached_tokens negative value crash:

The fact that a second fix is needed proves the first fix was incomplete. Without a regression test, there's no way to verify all code paths are covered. This bug causes a Prometheus counter crash (ValueError: Counters can only be incremented by non-negative amounts) in production.

Recommend adding a test that:

  1. Verifies num_cached_tokens is never negative after scheduling
  2. Covers both the schedule() and update_from_output() code paths

if new_token_ids or pooler_output is not None or kv_transfer_params or stopped:
# Add EngineCoreOutput for this Request.
num_cached = request.num_cached_tokens
if num_cached < 0:
logger.warning("Negative num_cached_tokens (%d) for request %s, clamping to 0", num_cached, req_id)
num_cached = 0
outputs[request.client_index].append(

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

This is a defensive fix for a negative value coming from upstream vLLM's generation scheduler. Adding a regression test here would mean mocking the full scheduler pipeline, which is brittle across vLLM rebases. The warning log helps us track occurrences and investigate the root cause upstream.

Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hsliuustc0106 hsliuustc0106 merged commit a3240fd into vllm-project:main Mar 3, 2026
7 checks passed
@linyueqian linyueqian deleted the bugfix/tts-transformers5-compat branch March 4, 2026 04:48
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 4, 2026
### vllm-omni-perf
- Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context
- Changes:
  - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context

### vllm-omni-contrib
- Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc
- Changes:
  - Bug fix: [Doc] Fix links in the configuration doc

### vllm-omni-perf
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-image-gen
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Additions:
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image

### vllm-omni-api
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-serving
- Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni
- Changes:
  - Bug fix: [Bugfix] fix kernel error for qwen3-omni

### vllm-omni-perf
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-image-gen
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Additions:
  - HunyuanImage3
  - HunyuanImage3Pipeline
  - HunyuanImage3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage-3

### vllm-omni-quantization
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-distributed
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-contrib
- Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release

### vllm-omni-audio-tts
- Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio
- Changes:
  - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio

### vllm-omni-api
- Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer
- Changes:
  - Bug fix: [Bugfix] Import InputPreprocessor into Renderer

### vllm-omni-perf
- Source: [PR #1565](vllm-project/vllm-omni#1565) - [BugFix]: fix a lot of bug
- Changes:
  - Bug fix: [BugFix]: fix a lot of bug

### vllm-omni-contrib
- Source: [PR #1564](vllm-project/vllm-omni#1564) - [NPU][Bugfix] Align GPU side and recover qwen3-tts
- Changes:
  - Bug fix: [NPU][Bugfix] Align GPU side and recover qwen3-tts

### vllm-omni-audio-tts
- Source: [PR #1564](vllm-project/vllm-omni#1564) - [NPU][Bugfix] Align GPU side and recover qwen3-tts
- Changes:
  - Bug fix: [NPU][Bugfix] Align GPU side and recover qwen3-tts

### vllm-omni-perf
- Source: [PR #1562](vllm-project/vllm-omni#1562) - [BugFix] Fix unexpected crash when init OmniDiffusion
- Changes:
  - Bug fix: [BugFix] Fix unexpected crash when init OmniDiffusion

### vllm-omni-api
- Source: [PR #1562](vllm-project/vllm-omni#1562) - [BugFix] Fix unexpected crash when init OmniDiffusion
- Changes:
  - Bug fix: [BugFix] Fix unexpected crash when init OmniDiffusion

### vllm-omni-quantization
- Source: [PR #1562](vllm-project/vllm-omni#1562) - [BugFix] Fix unexpected crash when init OmniDiffusion
- Changes:
  - Bug fix: [BugFix] Fix unexpected crash when init OmniDiffusion

### vllm-omni-distributed
- Source: [PR #1562](vllm-project/vllm-omni#1562) - [BugFix] Fix unexpected crash when init OmniDiffusion
- Changes:
  - Bug fix: [BugFix] Fix unexpected crash when init OmniDiffusion

### vllm-omni-api
- Source: [PR #1554](vllm-project/vllm-omni#1554) - fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio
- Changes:
  - Bug fix: fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio

### vllm-omni-cicd
- Source: [PR #1543](vllm-project/vllm-omni#1543) - [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage.

### vllm-omni-perf
- Source: [PR #1540](vllm-project/vllm-omni#1540) - Fix no embed text spk tokens
- Changes:
  - Bug fix: Fix no embed text spk tokens

### vllm-omni-distributed
- Source: [PR #1540](vllm-project/vllm-omni#1540) - Fix no embed text spk tokens
- Changes:
  - Bug fix: Fix no embed text spk tokens

### vllm-omni-perf
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-quantization
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-distributed
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-image-gen
- Source: [PR #1538](vllm-project/vllm-omni#1538) - [CI][skip ci]Update H100 image link based on #1518

### vllm-omni-perf
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving

### vllm-omni-serving
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving

### vllm-omni-cicd
- Source: [PR #1534](vllm-project/vllm-omni#1534) - [Debug] Merge vllm pull 35368

### vllm-omni-contrib
- Source: [PR #1530](vllm-project/vllm-omni#1530) - [Docs] update async chunk docs diagram [skip ci]

### vllm-omni-distributed
- Source: [PR #1524](vllm-project/vllm-omni#1524) - [BugFix] Restore talker's config
- Changes:
  - Bug fix: [BugFix] Restore talker's config

### vllm-omni-api
- Source: [PR #1522](vllm-project/vllm-omni#1522) - [Bugfix] Use uds for zmq address if not set --stage-id
- Changes:
  - New feature: [Bugfix] Use uds for zmq address if not set --stage-id

### vllm-omni-perf
- Source: [PR #1521](vllm-project/vllm-omni#1521) - Revert gpu_1 job to use regular image

### vllm-omni-perf
- Source: [PR #1518](vllm-project/vllm-omni#1518) - Use pull through cache image for H100 pool

### vllm-omni-perf
- Source: [PR #1515](vllm-project/vllm-omni#1515) - [Bugfix] fix offline text_to_image error from #1009
- Changes:
  - Bug fix: [Bugfix] fix offline text_to_image error from #1009

### vllm-omni-image-gen
- Source: [PR #1515](vllm-project/vllm-omni#1515) - [Bugfix] fix offline text_to_image error from #1009
- Changes:
  - Bug fix: [Bugfix] fix offline text_to_image error from #1009
- Additions:
  - num-images-per-prompt

### vllm-omni-quantization
- Source: [PR #1515](vllm-project/vllm-omni#1515) - [Bugfix] fix offline text_to_image error from #1009
- Changes:
  - Bug fix: [Bugfix] fix offline text_to_image error from #1009

### vllm-omni-distributed
- Source: [PR #1515](vllm-project/vllm-omni#1515) - [Bugfix] fix offline text_to_image error from #1009
- Changes:
  - Bug fix: [Bugfix] fix offline text_to_image error from #1009

### vllm-omni-api
- Source: [PR #1509](vllm-project/vllm-omni#1509) - [Chore] remove unused logger in omni_diffusion (#531)

### vllm-omni-perf
- Source: [PR #1505](vllm-project/vllm-omni#1505) - [Doc] Update installation instructions for vllm 0.16.0

### vllm-omni-quantization
- Source: [PR #1505](vllm-project/vllm-omni#1505) - [Doc] Update installation instructions for vllm 0.16.0

### vllm-omni-distributed
- Source: [PR #1505](vllm-project/vllm-omni#1505) - [Doc] Update installation instructions for vllm 0.16.0

### vllm-omni-contrib
- Source: [PR #1505](vllm-project/vllm-omni#1505) - [Doc] Update installation instructions for vllm 0.16.0

### vllm-omni-video-gen
- Source: [PR #1504](vllm-project/vllm-omni#1504) - [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading
- Changes:
  - New feature: [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading

### vllm-omni-perf
- Source: [PR #1504](vllm-project/vllm-omni#1504) - [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading
- Changes:
  - New feature: [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading

### vllm-omni-api
- Source: [PR #1504](vllm-project/vllm-omni#1504) - [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading
- Changes:
  - New feature: [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading

### vllm-omni-cicd
- Source: [PR #1504](vllm-project/vllm-omni#1504) - [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading
- Changes:
  - New feature: [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading

### vllm-omni-contrib
- Source: [PR #1500](vllm-project/vllm-omni#1500) - [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version

### vllm-omni-cicd
- Source: [PR #1492](vllm-project/vllm-omni#1492) - [Platform] Enable layerwise offload on all hardware

### vllm-omni-image-gen
- Source: [PR #1491](vllm-project/vllm-omni#1491) - [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep…

### vllm-omni-cicd
- Source: [PR #1488](vllm-project/vllm-omni#1488) - [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda

### vllm-omni-audio-tts
- Source: [PR #1482](vllm-project/vllm-omni#1482) - [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements
- Changes:
  - Bug fix: [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements

### vllm-omni-perf
- Source: [PR #1468](vllm-project/vllm-omni#1468) - [BugFix] process request.num_cached_tokens if it equals to the initial value
- Changes:
  - Bug fix: [BugFix] process request.num_cached_tokens if it equals to the initial value

### vllm-omni-audio-tts
- Source: [PR #1455](vllm-project/vllm-omni#1455) - [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration
- Changes:
  - Bug fix: [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration

### vllm-omni-cicd
- Source: [PR #1449](vllm-project/vllm-omni#1449) - [Test] Reduce Perf test case and fix modify stage config
- Changes:
  - Bug fix: [Test] Reduce Perf test case and fix modify stage config

### vllm-omni-cicd
- Source: [PR #1448](vllm-project/vllm-omni#1448) - [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler
- Changes:
  - Bug fix: [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler

### vllm-omni-cicd
- Source: [PR #1438](vllm-project/vllm-omni#1438) - [Qwen3TTS][Feat] Streaming output
- Changes:
  - New feature: [Qwen3TTS][Feat] Streaming output

### vllm-omni-api
- Source: [PR #1438](vllm-project/vllm-omni#1438) - [Qwen3TTS][Feat] Streaming output
- Changes:
  - New feature: [Qwen3TTS][Feat] Streaming output

### vllm-omni-contrib
- Source: [PR #1438](vllm-project/vllm-omni#1438) - [Qwen3TTS][Feat] Streaming output
- Changes:
  - New feature: [Qwen3TTS][Feat] Streaming output

### vllm-omni-audio-tts
- Source: [PR #1438](vllm-project/vllm-omni#1438) - [Qwen3TTS][Feat] Streaming output
- Changes:
  - New feature: [Qwen3TTS][Feat] Streaming output

### vllm-omni-cicd
- Source: [PR #1435](vllm-project/vllm-omni#1435) - [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning

### vllm-omni-video-gen
- Source: [PR #1433](vllm-project/vllm-omni#1433) - [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video

### vllm-omni-audio-tts
- Source: [PR #1433](vllm-project/vllm-omni#1433) - [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 7, 2026
### vllm-omni-api
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-contrib
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-api
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide
- Changes:
  - New feature: [skip CI][Docs] Add TTS model developer guide

### vllm-omni-audio-tts
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-distributed
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-distributed
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-api
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Additions:
  - `/v1/audio/speech`

### vllm-omni-quantization
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-cicd
- Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed.
- Changes:
  - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed.

### vllm-omni-audio-tts
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-perf
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-serving
- Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have …

### vllm-omni-cicd
- Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory

### vllm-omni-audio-tts
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-cicd
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-distributed
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-contrib
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-quantization
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-perf
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-distributed
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-quantization
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-perf
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-contrib
- Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat

### vllm-omni-perf
- Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support
- Changes:
  - New feature: [chore] add _repeated_blocks for regional compilation support

### vllm-omni-api
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-cicd
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-image-gen
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - text-to-image
  - Text-to-Image
  - Flux

### vllm-omni-quantization
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - FP8 support or improvements

### vllm-omni-contrib
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-perf
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-contrib
- Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup
- Changes:
  - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup

### vllm-omni-cicd
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context
- Changes:
  - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context

### vllm-omni-perf
- Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph
- Changes:
  - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph

### vllm-omni-contrib
- Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc
- Changes:
  - Bug fix: [Doc] Fix links in the configuration doc

### vllm-omni-audio-tts
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-perf
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-image-gen
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Additions:
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image

### vllm-omni-api
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-perf
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-contrib
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-perf
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-serving
- Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni
- Changes:
  - Bug fix: [Bugfix] fix kernel error for qwen3-omni

### vllm-omni-distributed
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-image-gen
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Additions:
  - HunyuanImage3
  - HunyuanImage3Pipeline
  - HunyuanImage3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage-3

### vllm-omni-quantization
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-perf
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-audio-tts
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-cicd
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-contrib
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-serving
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-distributed
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-cicd
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-perf
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-contrib
- Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release

### vllm-omni-audio-tts
- Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio
- Changes:
  - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio

### vllm-omni-api
- Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer
- Changes:
  - Bug fix: [Bugfix] Import InputPreprocessor into Renderer

### vllm-omni-distributed
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-quantization
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-perf
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-image-gen
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln
- Additions:
  - mindiesd
  - mindiesd
  - Qwen-Image-Edit-2509
  - mindiesd
  - mindiesd
  - mindiesd
  - mindiesd

### vllm-omni-perf
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln

### vllm-omni-serving
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving

### vllm-omni-perf
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: When running Qwen3 TTS, encountered version compatibility issues between vllm-omni 0.16.0 and transformers.

4 participants