feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni#613
feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni#613KilJaeeun wants to merge 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 92e3b18bc6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # 5. Initialize transformer2 for autoguidance (if available) | ||
| transformer2_path = os.path.join(model, "transformer2") | ||
| if os.path.exists(transformer2_path): | ||
| self.transformer2 = HyperCLOVAXVisionTransformer2DModel(od_config=od_config) | ||
| else: |
There was a problem hiding this comment.
Instantiate transformer2 when using remote checkpoints
Autoguidance is effectively disabled for Hugging Face model IDs because transformer2 is only constructed when a local path <model>/transformer2 exists. When od_config.model is a remote repo (the common case), os.path.exists returns false, transformer2 remains None, and its weights are never added to weights_sources, so guidance_scale>0 has no effect even if the repo includes the transformer2 folder. This silently removes the advertised autoguidance unless users pre-download the repo locally.
Useful? React with 👍 / 👎.
92e3b18 to
88a6ebb
Compare
|
is there any related issue for this model? |
|
@hsliuustc0106 |
a75d7b1 to
4137133
Compare
|
@KilJaeeun Hey, the HyperCLOVAX Vision Decoder with single-stream transformer and autoguidance looks well-structured. Have you been able to validate end-to-end image generation with this? What's needed to get it merge-ready? |
| if os.path.exists(vae_config_path): | ||
| with open(vae_config_path) as f: | ||
| config = json.load(f) | ||
| # Use scaling_factor from config, default to 8 for AutoencoderKL |
There was a problem hiding this comment.
I might be misunderstanding the config here, but scaling_factor (the latent scaling float, e.g., 0.13025) seems different from vae_scale_factor (the spatial downsampling ratio, typically 8). VaeImageProcessor expects the spatial one. Would hardcoding 8 or reading a different field make more sense?
There was a problem hiding this comment.
Fixed — the pipeline now reads block_out_channels from vae/config.json to compute the spatial downsampling ratio (2^(len-1)), and passes that to VaeImageProcessor(vae_scale_factor=...). The latent scaling_factor is read separately as self.vae_scaling_factor for use in _decode_latents. Both values are now distinct and correctly sourced.
| vocab_size=65536, | ||
| embedding_dim=1536, | ||
| token_length=729, | ||
| ) |
There was a problem hiding this comment.
These values (vocab_size=65536, embedding_dim=1536, token_length=729) are hardcoded — would it be possible to read them from a config file so the pipeline stays flexible if the model changes?
There was a problem hiding this comment.
Fixed — VisionTokenEmbedder initialization now reads all three values from token_embedder/config.json via _load_component_config("token_embedder"). The hardcoded values remain as fallbacks only when the config file is absent.
| from collections.abc import Iterable | ||
|
|
||
| import numpy as np | ||
| import torch |
There was a problem hiding this comment.
Nit: a few unused imports here (Any, Dict, List, Optional, Union). Easy cleanup.
There was a problem hiding this comment.
Fixed — all unused imports (Any, Dict, List, Optional, Union) have been removed. Ruff F401 passes cleanly.
| FLASH_ATTN_AVAILABLE = torch.cuda.is_available() | ||
| except ImportError: | ||
| FLASH_ATTN_AVAILABLE = False | ||
| sdpa_kernel = None |
There was a problem hiding this comment.
I noticed FLASH_ATTN_AVAILABLE checks torch.cuda.is_available() rather than actual flash attention support. On older CUDA devices (compute capability < 8.0), the sdpa_kernel(SDPBackend.FLASH_ATTENTION) call might fail. Would a more specific check be better?
There was a problem hiding this comment.
Fixed — _flash_attn_available() now checks torch.cuda.get_device_capability() and requires major >= 8 (Ampere+), rather than just cuda.is_available().
| def timestep_embedding( | ||
| t: torch.Tensor, | ||
| dim: int, | ||
| max_period: float = 10000, |
There was a problem hiding this comment.
Both branches of the if/else seem to do the same thing (F.scaled_dot_product_attention(q, k, v)). Maybe simplify by removing the branching and letting PyTorch pick the backend? Or add a real availability check if the intent is to prefer flash attention.
There was a problem hiding this comment.
Good catch on the clarity — the two branches are semantically different: the if branch wraps the call in sdpa_kernel(SDPBackend.FLASH_ATTENTION) which explicitly requests the Flash Attention kernel, while the else branch lets PyTorch select any available backend (math, memory-efficient, etc.). Added clarifying comments to make this distinction explicit.
| """ | ||
| mod, _ = block.modulation(vec) | ||
| x_mod = (1 + mod.scale) * block.pre_norm(x) + mod.shift | ||
| qkv, mlp = torch.split( |
There was a problem hiding this comment.
I'm trying to follow the USP flow — it looks like _parallelize_attention_blocks replaces each block's forward, but _usp_transformer_forward also calls _usp_single_block_forward directly. So if something calls block.forward, it might go through USP twice? Could be intentional, but wanted to flag it.
There was a problem hiding this comment.
The parallelize_transformer call in __init__ replaces self.transformer in-place — forward then calls the already-parallelized module directly. There is no second wrapping; the call site in forward always goes through the (possibly parallelized) self.transformer reference set during init.
| block._original_forward = block.forward | ||
|
|
||
| # Create new forward that uses USP | ||
| def make_usp_block_forward(blk): |
There was a problem hiding this comment.
Is create_parallel_transformer used somewhere I'm not seeing? I couldn't find any callers in the codebase. If it's planned for later that's fine, just wanted to check.
There was a problem hiding this comment.
Removed — create_parallel_transformer had no callers and has been deleted from transformer_usp.py.
| """ | ||
| embeddings = torch.from_numpy(np.load(npy_path)).float() | ||
| vocab_size, embedding_dim = embeddings.shape | ||
|
|
There was a problem hiding this comment.
Small thing — token_length=729 is hardcoded in from_numpy. Could this be inferred from the embeddings shape instead?
There was a problem hiding this comment.
Fixed — token_length is now read from token_embedder/config.json and stored as self.token_length; the from_numpy method uses self.token_length instead of the literal 729.
| height = req.height or 768 | ||
| width = req.width or 768 | ||
| num_steps = req.num_inference_steps or 50 | ||
| guidance_scale = req.guidance_scale or 0.0 |
There was a problem hiding this comment.
I could be wrong about this, but I think FlowMatchEulerDiscreteScheduler already returns timesteps as float sigmas in [0, 1], so dividing by num_train_timesteps again might give a very small number. Could you double-check what self.scheduler.timesteps returns in this context?
There was a problem hiding this comment.
FlowMatchEulerDiscreteScheduler.timesteps already returns normalized sigma values in [0, 1], so there is no division by num_train_timesteps. The fix passes t.item() directly as a float sigma. Added an explicit comment clarifying this.
|
any progress |
485a3af to
c1a2475
Compare
|
I haven’t been able to contribute recently, as I’ve been busy with model production work at Naver Cloud. For now, I’ll first upload the changes up to vLLM version 16. |
c1a2475 to
727614e
Compare
|
Thanks for the detailed review @lishunyang12! All feedback has been addressed in the latest push:
Re: the USP double-call question — |
7d4c063 to
f947e96
Compare
f947e96 to
e56d534
Compare
|
Thank you @lishunyang12 for the thorough review! All 10 comments have now been addressed. Here is a full summary:
Could you please take another look when you get a chance? @lishunyang12 |
2f51caf to
52551db
Compare
|
can you help fix the conflicts please? I will review it again right after it |
…upport Stacks on top of vllm-project#869 (HyperCLOVAX audio decoder). - Add HCXOmniForCausalLM thinker model (LLM stage, extends HCXVisionV2) - Add HyperCLOVAXVisionPipeline diffusion model (TA-Tok decoder, 27×27 image tokens) - Add hcx_omni.yaml 3-stage pipeline config (thinker TP=4 + vision/audio decoders) - Add thinker2vision_decoder and thinker2audio_decoder stage input processors - Add fan-out async pipeline topology (stage 0 → stage 1 AND stage 0 → stage 2) - Add _stage0_is_llm guard in serving_chat to preserve HCX multimodal inputs - Fix vLLM 0.18.0 compatibility (AttentionBackendEnum, _RUNNER_TASKS, TaskOption) - Add E2E tests, unit tests, client demo, and benchmark scripts Co-Authored-By: Hyunjoon Cho <with1015@github.com> Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
The if/else branches in attention() are semantically distinct: - if: sdpa_kernel(FLASH_ATTENTION) forces the Flash Attention kernel (Ampere+) - else: lets PyTorch select any available SDPA backend Added clarifying comments to make this explicit, addressing reviewer feedback. Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
- Fan-out routing in omni.py: replace linear stage_id+1 with connector-key-based routing so thinker can send to stage-1 (vision) and stage-2 (audio) independently - Bridge OmniTokensPrompt.additional_information to OmniDiffusionRequest.extra in omni_diffusion.py so vision_tokens and audio_tokens reach decoder pipelines - Guard empty batch in omni_stage.py to avoid "empty request list" error when the thinker produces no tokens for a modality - Add renderer sync and output_type param in omni_llm.py for base-class compatibility - Add optional ExtractHiddenStatesProposer import and routed_experts_initialized attribute in ar/generation/model runners - Fix clear_metadata kwarg (renamed from defer_finalize) in model runners - Add _model. prefix in load_weights return set for strict-loading check - Add embed_multimodal delegation in hcx_omni_thinker - Add dummy tokens for HyperCLOVAX pipelines in diffusion_engine
Cover the three bugs fixed for HyperCLOVAX-SEED-Omni-8B e2e inference: 1. Fan-out routing (omni.py): Verify that downstream_stage_ids is computed from connector keys so stage-0 fans out to stage-1 AND stage-2 independently, leaf stages return no downstream IDs, and linear topologies still work. 2. additional_information → extra bridge (omni_diffusion.py): Verify that vision_tokens / audio_tokens stored in OmniTokensPrompt. additional_information are propagated to OmniDiffusionRequest.extra, first-occurrence wins for batched prompts, and non-dict values are skipped gracefully. 3. Empty batch guard: Verify that text-only thinker output results in any_forwarded=False (no diffusion stages invoked), and that per-modality skipping works correctly when only one of vision/audio tokens is present.
f025500 to
09ff661
Compare
Summary
Add HyperCLOVAX-SEED-Omni-8B vision decoder and full 3-stage pipeline support.

Changes
New Files
vllm_omni/diffusion/models/hyperclovax_vision/pipeline_hyperclovax_vision.py— TA-Tok 729 discrete codes → 512×512 image via single-stream diffusion transformerhyperclovax_vision_transformer.py— HyperCLOVAXVisionTransformer2DModellayers.py— Flash-attention compatible attention layersvision_token_embedder.py— Discrete token → continuous embeddingtransformer_usp.py— USP (Unified Sequence Parallel) transformervllm_omni/model_executor/models/hcx_omni/hcx_omni.py— HCXOmniForCausalLM (thinker LLM stage, extends HCXVisionV2)hcx_omni_thinker.py— Vision/audio encoder integrationOther
vllm_omni/model_executor/stage_configs/hcx_omni.yaml— 3-stage pipeline config (thinker TP=4 + vision/audio decoders)vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py— Stage I/O processorsbenchmarks/hcx-omni/— Latency/throughput benchmark scriptsexamples/online_serving/hcx_omni/— Client demo and server launch scriptstests/e2e/,tests/unit/— E2E and unit testsModified Files
vllm_omni/diffusion/registry.py— Register HyperCLOVAXVisionPipeline (audio entry is in [Model]: Add HyperCLOVAX Audio Decoder support to vllm-omni #869)PR dependency
Co-Authored-By: Hyunjoon Cho with1015@github.com
Test Plan
pytest tests/unit/model_executor/test_hcx_omni_processing.pytests/e2e/online_serving/test_hcx_omni.py