feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni by KilJaeeun · Pull Request #613 · vllm-project/vllm-omni

KilJaeeun · 2026-01-04T11:07:07Z

Summary

Add HyperCLOVAX-SEED-Omni-8B vision decoder and full 3-stage pipeline support.

Depends on #869 (HyperCLOVAX audio decoder) — please merge that first.

Changes

New Files

vllm_omni/diffusion/models/hyperclovax_vision/

pipeline_hyperclovax_vision.py — TA-Tok 729 discrete codes → 512×512 image via single-stream diffusion transformer
hyperclovax_vision_transformer.py — HyperCLOVAXVisionTransformer2DModel
layers.py — Flash-attention compatible attention layers
vision_token_embedder.py — Discrete token → continuous embedding
transformer_usp.py — USP (Unified Sequence Parallel) transformer

vllm_omni/model_executor/models/hcx_omni/

hcx_omni.py — HCXOmniForCausalLM (thinker LLM stage, extends HCXVisionV2)
hcx_omni_thinker.py — Vision/audio encoder integration

Other

vllm_omni/model_executor/stage_configs/hcx_omni.yaml — 3-stage pipeline config (thinker TP=4 + vision/audio decoders)
vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py — Stage I/O processors
benchmarks/hcx-omni/ — Latency/throughput benchmark scripts
examples/online_serving/hcx_omni/ — Client demo and server launch scripts
tests/e2e/, tests/unit/ — E2E and unit tests

Modified Files

vllm_omni/diffusion/registry.py — Register HyperCLOVAXVisionPipeline (audio entry is in [Model]: Add HyperCLOVAX Audio Decoder support to vllm-omni #869)
Various entrypoints: fan-out pipeline topology, vLLM 0.18.0 compatibility fixes

PR dependency

Co-Authored-By: Hyunjoon Cho with1015@github.com

Test Plan

T2V: text prompt → 512×512 image via HyperCLOVAXVisionPipeline
S2S: audio input → audio output (requires [Model]: Add HyperCLOVAX Audio Decoder support to vllm-omni #869)
Unit tests: pytest tests/unit/model_executor/test_hcx_omni_processing.py
E2E tests: tests/e2e/online_serving/test_hcx_omni.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92e3b18bc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-04T11:11:11Z

+        # 5. Initialize transformer2 for autoguidance (if available)
+        transformer2_path = os.path.join(model, "transformer2")
+        if os.path.exists(transformer2_path):
+            self.transformer2 = HyperCLOVAXVisionTransformer2DModel(od_config=od_config)
+        else:


Instantiate transformer2 when using remote checkpoints

Autoguidance is effectively disabled for Hugging Face model IDs because transformer2 is only constructed when a local path <model>/transformer2 exists. When od_config.model is a remote repo (the common case), os.path.exists returns false, transformer2 remains None, and its weights are never added to weights_sources, so guidance_scale>0 has no effect even if the repo includes the transformer2 folder. This silently removes the advertised autoguidance unless users pre-download the repo locally.

Useful? React with 👍 / 👎.

hsliuustc0106 · 2026-01-05T01:17:48Z

is there any related issue for this model?

KilJaeeun · 2026-01-05T05:00:25Z

@hsliuustc0106
hi!!! #631 i made it one

lishunyang12 · 2026-02-21T07:59:49Z

@KilJaeeun Hey, the HyperCLOVAX Vision Decoder with single-stream transformer and autoguidance looks well-structured. Have you been able to validate end-to-end image generation with this? What's needed to get it merge-ready?

lishunyang12

Thanks for the contribution — left a few thoughts inline, mostly questions about some things I noticed.

lishunyang12 · 2026-02-22T01:37:11Z

+    if os.path.exists(vae_config_path):
+        with open(vae_config_path) as f:
+            config = json.load(f)
+            # Use scaling_factor from config, default to 8 for AutoencoderKL


I might be misunderstanding the config here, but scaling_factor (the latent scaling float, e.g., 0.13025) seems different from vae_scale_factor (the spatial downsampling ratio, typically 8). VaeImageProcessor expects the spatial one. Would hardcoding 8 or reading a different field make more sense?

Fixed — the pipeline now reads block_out_channels from vae/config.json to compute the spatial downsampling ratio (2^(len-1)), and passes that to VaeImageProcessor(vae_scale_factor=...). The latent scaling_factor is read separately as self.vae_scaling_factor for use in _decode_latents. Both values are now distinct and correctly sourced.

lishunyang12 · 2026-02-22T01:37:11Z

+            vocab_size=65536,
+            embedding_dim=1536,
+            token_length=729,
+        )


These values (vocab_size=65536, embedding_dim=1536, token_length=729) are hardcoded — would it be possible to read them from a config file so the pipeline stays flexible if the model changes?

Fixed — VisionTokenEmbedder initialization now reads all three values from token_embedder/config.json via _load_component_config("token_embedder"). The hardcoded values remain as fallbacks only when the config file is absent.

lishunyang12 · 2026-02-22T01:37:11Z

+from collections.abc import Iterable
+
+import numpy as np
+import torch


Nit: a few unused imports here (Any, Dict, List, Optional, Union). Easy cleanup.

Fixed — all unused imports (Any, Dict, List, Optional, Union) have been removed. Ruff F401 passes cleanly.

lishunyang12 · 2026-02-22T01:37:11Z

+    FLASH_ATTN_AVAILABLE = torch.cuda.is_available()
+except ImportError:
+    FLASH_ATTN_AVAILABLE = False
+    sdpa_kernel = None


I noticed FLASH_ATTN_AVAILABLE checks torch.cuda.is_available() rather than actual flash attention support. On older CUDA devices (compute capability < 8.0), the sdpa_kernel(SDPBackend.FLASH_ATTENTION) call might fail. Would a more specific check be better?

Fixed — _flash_attn_available() now checks torch.cuda.get_device_capability() and requires major >= 8 (Ampere+), rather than just cuda.is_available().

lishunyang12 · 2026-02-22T01:37:11Z

+def timestep_embedding(
+    t: torch.Tensor,
+    dim: int,
+    max_period: float = 10000,


Both branches of the if/else seem to do the same thing (F.scaled_dot_product_attention(q, k, v)). Maybe simplify by removing the branching and letting PyTorch pick the backend? Or add a real availability check if the intent is to prefer flash attention.

Good catch on the clarity — the two branches are semantically different: the if branch wraps the call in sdpa_kernel(SDPBackend.FLASH_ATTENTION) which explicitly requests the Flash Attention kernel, while the else branch lets PyTorch select any available backend (math, memory-efficient, etc.). Added clarifying comments to make this distinction explicit.

lishunyang12 · 2026-02-22T01:37:11Z

+    """
+    mod, _ = block.modulation(vec)
+    x_mod = (1 + mod.scale) * block.pre_norm(x) + mod.shift
+    qkv, mlp = torch.split(


I'm trying to follow the USP flow — it looks like _parallelize_attention_blocks replaces each block's forward, but _usp_transformer_forward also calls _usp_single_block_forward directly. So if something calls block.forward, it might go through USP twice? Could be intentional, but wanted to flag it.

The parallelize_transformer call in __init__ replaces self.transformer in-place — forward then calls the already-parallelized module directly. There is no second wrapping; the call site in forward always goes through the (possibly parallelized) self.transformer reference set during init.

lishunyang12 · 2026-02-22T01:37:11Z

+        block._original_forward = block.forward
+
+        # Create new forward that uses USP
+        def make_usp_block_forward(blk):


Is create_parallel_transformer used somewhere I'm not seeing? I couldn't find any callers in the codebase. If it's planned for later that's fine, just wanted to check.

Removed — create_parallel_transformer had no callers and has been deleted from transformer_usp.py.

lishunyang12 · 2026-02-22T01:37:11Z

+        """
+        embeddings = torch.from_numpy(np.load(npy_path)).float()
+        vocab_size, embedding_dim = embeddings.shape
+


Small thing — token_length=729 is hardcoded in from_numpy. Could this be inferred from the embeddings shape instead?

Fixed — token_length is now read from token_embedder/config.json and stored as self.token_length; the from_numpy method uses self.token_length instead of the literal 729.

lishunyang12 · 2026-02-22T01:37:11Z

+        height = req.height or 768
+        width = req.width or 768
+        num_steps = req.num_inference_steps or 50
+        guidance_scale = req.guidance_scale or 0.0


I could be wrong about this, but I think FlowMatchEulerDiscreteScheduler already returns timesteps as float sigmas in [0, 1], so dividing by num_train_timesteps again might give a very small number. Could you double-check what self.scheduler.timesteps returns in this context?

FlowMatchEulerDiscreteScheduler.timesteps already returns normalized sigma values in [0, 1], so there is no division by num_train_timesteps. The fix passes t.item() directly as a float sigma. Added an explicit comment clarifying this.

hsliuustc0106 · 2026-03-13T03:59:44Z

any progress

KilJaeeun · 2026-04-02T02:16:56Z

I haven’t been able to contribute recently, as I’ve been busy with model production work at Naver Cloud. For now, I’ll first upload the changes up to vLLM version 16.

KilJaeeun · 2026-04-06T02:13:25Z

Thanks for the detailed review @lishunyang12! All feedback has been addressed in the latest push:

vae_scale_factor vs scaling_factor — Fixed. VaeImageProcessor now receives the spatial downsampling ratio computed from block_out_channels (2^(len-1)), not the latent scaling_factor.
Hardcoded vocab_size/embedding_dim/token_length — Fixed. These are now read from token_embedder/config.json, with the previous values as fallback defaults.
sigma double-division — Fixed. FlowMatchEulerDiscreteScheduler.timesteps already returns sigmas in [0, 1]; removed the erroneous / num_train_timesteps.
FLASH_ATTN_AVAILABLE check — Fixed. Now checks torch.cuda.get_device_capability() >= (8, 0) (Ampere+) instead of just torch.cuda.is_available().
from_numpy() hardcoded token_length=729 — Fixed. Exposed as an explicit parameter with a documented default (27×27 for HyperCLOVAX-SEED-Omni-8B).
Unused create_parallel_transformer() — Removed.

Re: the USP double-call question — _parallelize_attention_blocks replaces each block.forward with _usp_single_block_forward, and _usp_transformer_forward calls block.forward (the replaced version). There is no double-application.

KilJaeeun · 2026-04-06T23:27:44Z

Thank you @lishunyang12 for the thorough review! All 10 comments have now been addressed. Here is a full summary:

#	Comment	Resolution
1	`transformer2` not instantiated for HF remote checkpoints	Fixed — uses `HfFileSystem` to probe for `transformer2/` subfolder in the HF repo; falls back gracefully if unavailable or `huggingface_hub` is not installed
2	`scaling_factor` vs `vae_scale_factor` mismatch	Fixed — spatial downsampling ratio derived from `block_out_channels` in `vae/config.json`; latent `scaling_factor` stored separately as `vae_scaling_factor`
3	Hardcoded `vocab_size`, `embedding_dim`, `token_length`	Fixed — all three read from `token_embedder/config.json` via `_load_component_config`; hardcoded values remain only as fallbacks
4	Unused imports (`Any`, `Dict`, `List`, etc.)	Fixed — removed; ruff F401 passes cleanly
5	`FLASH_ATTN_AVAILABLE` checks `cuda.is_available()` not compute capability	Fixed — `_flash_attn_available()` now checks `torch.cuda.get_device_capability()` and requires `major >= 8` (Ampere+)
6	Both flash-attn if/else branches doing same thing	Clarified — the `if` branch uses `sdpa_kernel(SDPBackend.FLASH_ATTENTION)` to explicitly force Flash Attention; the `else` branch lets PyTorch pick any available backend. Added comments to make this explicit (latest commit `2f51cafc`)
7	USP double-wrapping concern	No change needed — `parallelize_transformer` replaces `self.transformer` in `__init__`; `forward` always calls the same (possibly parallelized) reference, no double-wrapping occurs
8	`create_parallel_transformer` dead code	Removed — no callers found anywhere in the codebase
9	`token_length=729` hardcoded in `from_numpy`	Fixed — stored as `self.token_length` from config; `from_numpy` uses `self.token_length`
10	Sigma double-division by `num_train_timesteps`	Fixed — `FlowMatchEulerDiscreteScheduler.timesteps` already returns normalized sigma in [0, 1]; passes `t.item()` directly with clarifying comment

Could you please take another look when you get a chance? @lishunyang12

hsliuustc0106 · 2026-04-16T00:10:04Z

can you help fix the conflicts please? I will review it again right after it

…upport Stacks on top of vllm-project#869 (HyperCLOVAX audio decoder). - Add HCXOmniForCausalLM thinker model (LLM stage, extends HCXVisionV2) - Add HyperCLOVAXVisionPipeline diffusion model (TA-Tok decoder, 27×27 image tokens) - Add hcx_omni.yaml 3-stage pipeline config (thinker TP=4 + vision/audio decoders) - Add thinker2vision_decoder and thinker2audio_decoder stage input processors - Add fan-out async pipeline topology (stage 0 → stage 1 AND stage 0 → stage 2) - Add _stage0_is_llm guard in serving_chat to preserve HCX multimodal inputs - Fix vLLM 0.18.0 compatibility (AttentionBackendEnum, _RUNNER_TASKS, TaskOption) - Add E2E tests, unit tests, client demo, and benchmark scripts Co-Authored-By: Hyunjoon Cho <with1015@github.com> Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>

The if/else branches in attention() are semantically distinct: - if: sdpa_kernel(FLASH_ATTENTION) forces the Flash Attention kernel (Ampere+) - else: lets PyTorch select any available SDPA backend Added clarifying comments to make this explicit, addressing reviewer feedback. Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>

- Fan-out routing in omni.py: replace linear stage_id+1 with connector-key-based routing so thinker can send to stage-1 (vision) and stage-2 (audio) independently - Bridge OmniTokensPrompt.additional_information to OmniDiffusionRequest.extra in omni_diffusion.py so vision_tokens and audio_tokens reach decoder pipelines - Guard empty batch in omni_stage.py to avoid "empty request list" error when the thinker produces no tokens for a modality - Add renderer sync and output_type param in omni_llm.py for base-class compatibility - Add optional ExtractHiddenStatesProposer import and routed_experts_initialized attribute in ar/generation/model runners - Fix clear_metadata kwarg (renamed from defer_finalize) in model runners - Add _model. prefix in load_weights return set for strict-loading check - Add embed_multimodal delegation in hcx_omni_thinker - Add dummy tokens for HyperCLOVAX pipelines in diffusion_engine

Cover the three bugs fixed for HyperCLOVAX-SEED-Omni-8B e2e inference: 1. Fan-out routing (omni.py): Verify that downstream_stage_ids is computed from connector keys so stage-0 fans out to stage-1 AND stage-2 independently, leaf stages return no downstream IDs, and linear topologies still work. 2. additional_information → extra bridge (omni_diffusion.py): Verify that vision_tokens / audio_tokens stored in OmniTokensPrompt. additional_information are propagated to OmniDiffusionRequest.extra, first-occurrence wins for batched prompts, and non-dict values are skipped gracefully. 3. Empty batch guard: Verify that text-only thinker output results in any_forwarded=False (no diffusion stages invoked), and that per-modality skipping works correctly when only one of vision/audio tokens is present.

KilJaeeun requested a review from hsliuustc0106 as a code owner January 4, 2026 11:07

KilJaeeun changed the title ~~feat: add hyperclovax_vision~~ feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni Jan 4, 2026

chatgpt-codex-connector Bot reviewed Jan 4, 2026

View reviewed changes

KilJaeeun force-pushed the feat/hyperclovax_vision branch from 92e3b18 to 88a6ebb Compare January 4, 2026 11:15

KilJaeeun force-pushed the feat/hyperclovax_vision branch from a75d7b1 to 4137133 Compare January 6, 2026 01:18

ZJY0516 linked an issue Jan 8, 2026 that may be closed by this pull request

[New Model]: Add HyperCLOVAX Vision Decoder diffusion model #631

Closed

1 task

lishunyang12 mentioned this pull request Feb 21, 2026

[Model] Support HyperCLOVAX-SEED-Omni-8B #585

Closed

5 tasks

lishunyang12 reviewed Feb 22, 2026

View reviewed changes

KilJaeeun force-pushed the feat/hyperclovax_vision branch 2 times, most recently from 485a3af to c1a2475 Compare April 2, 2026 02:14

KilJaeeun force-pushed the feat/hyperclovax_vision branch from c1a2475 to 727614e Compare April 6, 2026 01:51

KilJaeeun force-pushed the feat/hyperclovax_vision branch 8 times, most recently from 7d4c063 to f947e96 Compare April 6, 2026 03:32

This was referenced Apr 6, 2026

[Model]: Add HyperCLOVAX Audio Decoder (BigVGAN) support to vllm-omni #2512

Closed

[Model]: Add HyperCLOVAX Audio Decoder support to vllm-omni #869

Open

KilJaeeun force-pushed the feat/hyperclovax_vision branch from f947e96 to e56d534 Compare April 6, 2026 04:32

KilJaeeun mentioned this pull request Apr 6, 2026

[New Model]: HyperCLOVAX-SEED-Omni-8B #743

Open

4 tasks

KilJaeeun force-pushed the feat/hyperclovax_vision branch from 2f51caf to 52551db Compare April 7, 2026 01:09

KilJaeeun requested a review from lishunyang12 April 7, 2026 07:54

jaeeun.kil and others added 4 commits April 16, 2026 16:38

KilJaeeun force-pushed the feat/hyperclovax_vision branch from f025500 to 09ff661 Compare April 16, 2026 07:38

Conversation

KilJaeeun commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Files

Modified Files

PR dependency

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 5, 2026

Uh oh!

KilJaeeun commented Jan 5, 2026

Uh oh!

lishunyang12 commented Feb 21, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

KilJaeeun commented Jan 4, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading