Feat/hyperclovax omni ad#2
Merged
with1015 merged 305 commits intomodel/hyperclovax-audiofrom Apr 6, 2026
Merged
Conversation
vllm-project#797) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: dengyunyang <584797741@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
…ct#927) Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…oject#1036) Signed-off-by: Kyle Huang <yellowsea@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…project#1043) Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
…-project#983) Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…llm-project#980) Signed-off-by: Lin, Fanli <fanli.lin@intel.com> Signed-off-by: Samit <285365963@qq.com> Co-authored-by: Samit <285365963@qq.com>
Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ct#1075) Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…omni (vllm-project#1025) Signed-off-by: dengyunyang <584797741@qq.com>
…e configuration (vllm-project#987) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: jzz <e1583181@u.nus.edu>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
…llm-project#1554) Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…0 resource usage. (vllm-project#1543) Signed-off-by: yenuo26 <410167048@qq.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Model files: - vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer - vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec - vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py: thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from LLM output; truncate/pad vision codes to 729 (27x27) for decoder Registry: - vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and HyperCLOVAXAudioPipeline with post-process functions Stage config: - vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4), Stage 2: audio decoder (GPU 5) Bug fixes for HyperCLOVAX compatibility: - diffusion/request.py: add extra dict field to OmniDiffusionRequest so vision_tokens/audio_tokens from stage input processors reach the pipeline - entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra before creating request - entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where thinker2vision_decoder/thinker2audio_decoder return []) - entrypoints/async_omni.py: handle skipped sentinel in _process_single_result so text-only requests complete without crashing on Stage 1/2
- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50 (matches OmniServe production defaults; 3.5 caused over-amplified autoguidance → shrunken/degraded output images) - omni_stage.py: skip empty engine inputs for text-only requests - async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra (audio_tokens/vision_tokens) - registry.py: HCX Omni diffusion model registration fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml - GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder) - Add runtime edge 0->2 (thinker -> audio decoder) - Implement post-generation PCM chunk streaming for audio output (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded) Refs: github.com/vllm-project/pull/869 (already incorporated) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config/model.py: try/except fallback for AttentionBackendEnum import (vllm.v1.attention.backends.registry absent in older vllm builds) - pipeline_hyperclovax_audio.py: return actual named_parameters() from load_weights() when using MAR checkpoint so diffusers_loader strict check passes (weights loaded eagerly in __init__ via MAR extraction) - qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs for check_interleaved_audio_video and merge_interleaved_embeddings which are absent in older vllm qwen2_5_omni_thinker; these symbols are only exercised by Qwen models, not HyperCLOVAX Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add runtime edge from:1 to:2 (required for Stage-2 connector init; without it AsyncOrchestrator cannot route to audio decoder at runtime) - Change model_subdir to model for Stage-2 engine_args to match total-poc working reference config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output like LLM pipelines. Fix three locations: 1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output when final_res.outputs is empty (diffusion path). 2. Streaming audio path: same fix for _final_res.outputs[0]. 3. Both loops (for output in final_res.outputs): fall back to single synthetic choice at index 0 when outputs list is empty. 4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process (returns WAV bytes, not tensors like Qwen3-Omni). Also fixes audio input (A2T) regression: skip diffusion prompt extraction when mm_data has audio content (added in previous session). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header. The previous byte-offset splitting included the header in the first chunk, corrupting it. Fix: parse with soundfile to get float32 PCM, then convert to int16 chunks uniformly regardless of source type (bytes or tensor). Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- serving_chat.py: extract last input_audio base64 from request messages and inject as ref_audio_b64 into engine_prompt dict - thinker2audio_decoder: read ref_audio_b64 from prompt and pass as ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline) - hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot) which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis matching the voice characteristics of the original speaker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker - Fix thinker2audio_decoder: correct audio token range (128606-135167), remap to [0, 6561) for BigVGAN input, handle empty token case gracefully - Fix pipeline_hyperclovax_audio.py post_process_func signature and incorporate PR#869 BUG FIX patches for stable audio generation
…lization - hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot, ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id). Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable for text-only requests and incompatible with finetuned decoder path. - pipeline_hyperclovax_audio.py: guard ref_audio processing with 'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder, so passing ref_audio bytes would crash with 'expected 100 channels'. - omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules) to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles outputs containing custom classes from transformers_modules (trust_remote_code), but the API server process doesn't have this path, causing deserialization failures that silently drop Stage-0 outputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quests - hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN) for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed integer speaker_id and lost the input speaker's voice. - pipeline_hyperclovax_audio.py: add zero-mel fallback branch for finetune=False + ref_audio=None case. When a text-only request arrives (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor [1, num_mels, 64] instead of crashing with 'expected 100 channels'. S2S requests always have ref_audio so the zero-shot cloning path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)