Skip to content

feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni#613

Open
KilJaeeun wants to merge 4 commits intovllm-project:mainfrom
KilJaeeun:feat/hyperclovax_vision
Open

feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni#613
KilJaeeun wants to merge 4 commits intovllm-project:mainfrom
KilJaeeun:feat/hyperclovax_vision

Conversation

@KilJaeeun
Copy link
Copy Markdown

@KilJaeeun KilJaeeun commented Jan 4, 2026

Summary

Add HyperCLOVAX-SEED-Omni-8B vision decoder and full 3-stage pipeline support.
image

Depends on #869 (HyperCLOVAX audio decoder) — please merge that first.

Changes

New Files

vllm_omni/diffusion/models/hyperclovax_vision/

  • pipeline_hyperclovax_vision.py — TA-Tok 729 discrete codes → 512×512 image via single-stream diffusion transformer
  • hyperclovax_vision_transformer.py — HyperCLOVAXVisionTransformer2DModel
  • layers.py — Flash-attention compatible attention layers
  • vision_token_embedder.py — Discrete token → continuous embedding
  • transformer_usp.py — USP (Unified Sequence Parallel) transformer

vllm_omni/model_executor/models/hcx_omni/

  • hcx_omni.py — HCXOmniForCausalLM (thinker LLM stage, extends HCXVisionV2)
  • hcx_omni_thinker.py — Vision/audio encoder integration

Other

  • vllm_omni/model_executor/stage_configs/hcx_omni.yaml — 3-stage pipeline config (thinker TP=4 + vision/audio decoders)
  • vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py — Stage I/O processors
  • benchmarks/hcx-omni/ — Latency/throughput benchmark scripts
  • examples/online_serving/hcx_omni/ — Client demo and server launch scripts
  • tests/e2e/, tests/unit/ — E2E and unit tests

Modified Files

PR dependency

Co-Authored-By: Hyunjoon Cho with1015@github.com

Test Plan

@KilJaeeun KilJaeeun changed the title feat: add hyperclovax_vision feat: Add HyperCLOVAX Vision Decoder diffusion model support to vllm-omni Jan 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92e3b18bc6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +125 to +129
# 5. Initialize transformer2 for autoguidance (if available)
transformer2_path = os.path.join(model, "transformer2")
if os.path.exists(transformer2_path):
self.transformer2 = HyperCLOVAXVisionTransformer2DModel(od_config=od_config)
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Instantiate transformer2 when using remote checkpoints

Autoguidance is effectively disabled for Hugging Face model IDs because transformer2 is only constructed when a local path <model>/transformer2 exists. When od_config.model is a remote repo (the common case), os.path.exists returns false, transformer2 remains None, and its weights are never added to weights_sources, so guidance_scale>0 has no effect even if the repo includes the transformer2 folder. This silently removes the advertised autoguidance unless users pre-download the repo locally.

Useful? React with 👍 / 👎.

@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch from 92e3b18 to 88a6ebb Compare January 4, 2026 11:15
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

is there any related issue for this model?

@KilJaeeun
Copy link
Copy Markdown
Author

@hsliuustc0106
hi!!! #631 i made it one

@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch from a75d7b1 to 4137133 Compare January 6, 2026 01:18
@ZJY0516 ZJY0516 linked an issue Jan 8, 2026 that may be closed by this pull request
1 task
@lishunyang12
Copy link
Copy Markdown
Collaborator

@KilJaeeun Hey, the HyperCLOVAX Vision Decoder with single-stream transformer and autoguidance looks well-structured. Have you been able to validate end-to-end image generation with this? What's needed to get it merge-ready?

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — left a few thoughts inline, mostly questions about some things I noticed.

if os.path.exists(vae_config_path):
with open(vae_config_path) as f:
config = json.load(f)
# Use scaling_factor from config, default to 8 for AutoencoderKL
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding the config here, but scaling_factor (the latent scaling float, e.g., 0.13025) seems different from vae_scale_factor (the spatial downsampling ratio, typically 8). VaeImageProcessor expects the spatial one. Would hardcoding 8 or reading a different field make more sense?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the pipeline now reads block_out_channels from vae/config.json to compute the spatial downsampling ratio (2^(len-1)), and passes that to VaeImageProcessor(vae_scale_factor=...). The latent scaling_factor is read separately as self.vae_scaling_factor for use in _decode_latents. Both values are now distinct and correctly sourced.

vocab_size=65536,
embedding_dim=1536,
token_length=729,
)
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values (vocab_size=65536, embedding_dim=1536, token_length=729) are hardcoded — would it be possible to read them from a config file so the pipeline stays flexible if the model changes?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — VisionTokenEmbedder initialization now reads all three values from token_embedder/config.json via _load_component_config("token_embedder"). The hardcoded values remain as fallbacks only when the config file is absent.

from collections.abc import Iterable

import numpy as np
import torch
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a few unused imports here (Any, Dict, List, Optional, Union). Easy cleanup.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — all unused imports (Any, Dict, List, Optional, Union) have been removed. Ruff F401 passes cleanly.

FLASH_ATTN_AVAILABLE = torch.cuda.is_available()
except ImportError:
FLASH_ATTN_AVAILABLE = False
sdpa_kernel = None
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed FLASH_ATTN_AVAILABLE checks torch.cuda.is_available() rather than actual flash attention support. On older CUDA devices (compute capability < 8.0), the sdpa_kernel(SDPBackend.FLASH_ATTENTION) call might fail. Would a more specific check be better?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — _flash_attn_available() now checks torch.cuda.get_device_capability() and requires major >= 8 (Ampere+), rather than just cuda.is_available().

def timestep_embedding(
t: torch.Tensor,
dim: int,
max_period: float = 10000,
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both branches of the if/else seem to do the same thing (F.scaled_dot_product_attention(q, k, v)). Maybe simplify by removing the branching and letting PyTorch pick the backend? Or add a real availability check if the intent is to prefer flash attention.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the clarity — the two branches are semantically different: the if branch wraps the call in sdpa_kernel(SDPBackend.FLASH_ATTENTION) which explicitly requests the Flash Attention kernel, while the else branch lets PyTorch select any available backend (math, memory-efficient, etc.). Added clarifying comments to make this distinction explicit.

"""
mod, _ = block.modulation(vec)
x_mod = (1 + mod.scale) * block.pre_norm(x) + mod.shift
qkv, mlp = torch.split(
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to follow the USP flow — it looks like _parallelize_attention_blocks replaces each block's forward, but _usp_transformer_forward also calls _usp_single_block_forward directly. So if something calls block.forward, it might go through USP twice? Could be intentional, but wanted to flag it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallelize_transformer call in __init__ replaces self.transformer in-place — forward then calls the already-parallelized module directly. There is no second wrapping; the call site in forward always goes through the (possibly parallelized) self.transformer reference set during init.

block._original_forward = block.forward

# Create new forward that uses USP
def make_usp_block_forward(blk):
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is create_parallel_transformer used somewhere I'm not seeing? I couldn't find any callers in the codebase. If it's planned for later that's fine, just wanted to check.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed — create_parallel_transformer had no callers and has been deleted from transformer_usp.py.

"""
embeddings = torch.from_numpy(np.load(npy_path)).float()
vocab_size, embedding_dim = embeddings.shape

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing — token_length=729 is hardcoded in from_numpy. Could this be inferred from the embeddings shape instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — token_length is now read from token_embedder/config.json and stored as self.token_length; the from_numpy method uses self.token_length instead of the literal 729.

height = req.height or 768
width = req.width or 768
num_steps = req.num_inference_steps or 50
guidance_scale = req.guidance_scale or 0.0
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong about this, but I think FlowMatchEulerDiscreteScheduler already returns timesteps as float sigmas in [0, 1], so dividing by num_train_timesteps again might give a very small number. Could you double-check what self.scheduler.timesteps returns in this context?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlowMatchEulerDiscreteScheduler.timesteps already returns normalized sigma values in [0, 1], so there is no division by num_train_timesteps. The fix passes t.item() directly as a float sigma. Added an explicit comment clarifying this.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any progress

@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch 2 times, most recently from 485a3af to c1a2475 Compare April 2, 2026 02:14
@KilJaeeun
Copy link
Copy Markdown
Author

I haven’t been able to contribute recently, as I’ve been busy with model production work at Naver Cloud. For now, I’ll first upload the changes up to vLLM version 16.

@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch from c1a2475 to 727614e Compare April 6, 2026 01:51
@KilJaeeun
Copy link
Copy Markdown
Author

Thanks for the detailed review @lishunyang12! All feedback has been addressed in the latest push:

  1. vae_scale_factor vs scaling_factor — Fixed. VaeImageProcessor now receives the spatial downsampling ratio computed from block_out_channels (2^(len-1)), not the latent scaling_factor.

  2. Hardcoded vocab_size/embedding_dim/token_length — Fixed. These are now read from token_embedder/config.json, with the previous values as fallback defaults.

  3. sigma double-division — Fixed. FlowMatchEulerDiscreteScheduler.timesteps already returns sigmas in [0, 1]; removed the erroneous / num_train_timesteps.

  4. FLASH_ATTN_AVAILABLE check — Fixed. Now checks torch.cuda.get_device_capability() >= (8, 0) (Ampere+) instead of just torch.cuda.is_available().

  5. from_numpy() hardcoded token_length=729 — Fixed. Exposed as an explicit parameter with a documented default (27×27 for HyperCLOVAX-SEED-Omni-8B).

  6. Unused create_parallel_transformer() — Removed.

Re: the USP double-call question — _parallelize_attention_blocks replaces each block.forward with _usp_single_block_forward, and _usp_transformer_forward calls block.forward (the replaced version). There is no double-application.

@KilJaeeun
Copy link
Copy Markdown
Author

Thank you @lishunyang12 for the thorough review! All 10 comments have now been addressed. Here is a full summary:

# Comment Resolution
1 transformer2 not instantiated for HF remote checkpoints Fixed — uses HfFileSystem to probe for transformer2/ subfolder in the HF repo; falls back gracefully if unavailable or huggingface_hub is not installed
2 scaling_factor vs vae_scale_factor mismatch Fixed — spatial downsampling ratio derived from block_out_channels in vae/config.json; latent scaling_factor stored separately as vae_scaling_factor
3 Hardcoded vocab_size, embedding_dim, token_length Fixed — all three read from token_embedder/config.json via _load_component_config; hardcoded values remain only as fallbacks
4 Unused imports (Any, Dict, List, etc.) Fixed — removed; ruff F401 passes cleanly
5 FLASH_ATTN_AVAILABLE checks cuda.is_available() not compute capability Fixed — _flash_attn_available() now checks torch.cuda.get_device_capability() and requires major >= 8 (Ampere+)
6 Both flash-attn if/else branches doing same thing Clarified — the if branch uses sdpa_kernel(SDPBackend.FLASH_ATTENTION) to explicitly force Flash Attention; the else branch lets PyTorch pick any available backend. Added comments to make this explicit (latest commit 2f51cafc)
7 USP double-wrapping concern No change needed — parallelize_transformer replaces self.transformer in __init__; forward always calls the same (possibly parallelized) reference, no double-wrapping occurs
8 create_parallel_transformer dead code Removed — no callers found anywhere in the codebase
9 token_length=729 hardcoded in from_numpy Fixed — stored as self.token_length from config; from_numpy uses self.token_length
10 Sigma double-division by num_train_timesteps Fixed — FlowMatchEulerDiscreteScheduler.timesteps already returns normalized sigma in [0, 1]; passes t.item() directly with clarifying comment

Could you please take another look when you get a chance? @lishunyang12

@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch from 2f51caf to 52551db Compare April 7, 2026 01:09
@KilJaeeun KilJaeeun requested a review from lishunyang12 April 7, 2026 07:54
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you help fix the conflicts please? I will review it again right after it

jaeeun.kil and others added 4 commits April 16, 2026 16:38
…upport

Stacks on top of vllm-project#869 (HyperCLOVAX audio decoder).

- Add HCXOmniForCausalLM thinker model (LLM stage, extends HCXVisionV2)
- Add HyperCLOVAXVisionPipeline diffusion model (TA-Tok decoder, 27×27 image tokens)
- Add hcx_omni.yaml 3-stage pipeline config (thinker TP=4 + vision/audio decoders)
- Add thinker2vision_decoder and thinker2audio_decoder stage input processors
- Add fan-out async pipeline topology (stage 0 → stage 1 AND stage 0 → stage 2)
- Add _stage0_is_llm guard in serving_chat to preserve HCX multimodal inputs
- Fix vLLM 0.18.0 compatibility (AttentionBackendEnum, _RUNNER_TASKS, TaskOption)
- Add E2E tests, unit tests, client demo, and benchmark scripts

Co-Authored-By: Hyunjoon Cho <with1015@github.com>

Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
The if/else branches in attention() are semantically distinct:
- if: sdpa_kernel(FLASH_ATTENTION) forces the Flash Attention kernel (Ampere+)
- else: lets PyTorch select any available SDPA backend

Added clarifying comments to make this explicit, addressing reviewer feedback.

Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
- Fan-out routing in omni.py: replace linear stage_id+1 with
  connector-key-based routing so thinker can send to stage-1 (vision)
  and stage-2 (audio) independently
- Bridge OmniTokensPrompt.additional_information to
  OmniDiffusionRequest.extra in omni_diffusion.py so vision_tokens and
  audio_tokens reach decoder pipelines
- Guard empty batch in omni_stage.py to avoid "empty request list" error
  when the thinker produces no tokens for a modality
- Add renderer sync and output_type param in omni_llm.py for base-class
  compatibility
- Add optional ExtractHiddenStatesProposer import and
  routed_experts_initialized attribute in ar/generation/model runners
- Fix clear_metadata kwarg (renamed from defer_finalize) in model runners
- Add _model. prefix in load_weights return set for strict-loading check
- Add embed_multimodal delegation in hcx_omni_thinker
- Add dummy tokens for HyperCLOVAX pipelines in diffusion_engine
Cover the three bugs fixed for HyperCLOVAX-SEED-Omni-8B e2e inference:

1. Fan-out routing (omni.py):
   Verify that downstream_stage_ids is computed from connector keys so
   stage-0 fans out to stage-1 AND stage-2 independently, leaf stages
   return no downstream IDs, and linear topologies still work.

2. additional_information → extra bridge (omni_diffusion.py):
   Verify that vision_tokens / audio_tokens stored in OmniTokensPrompt.
   additional_information are propagated to OmniDiffusionRequest.extra,
   first-occurrence wins for batched prompts, and non-dict values are
   skipped gracefully.

3. Empty batch guard:
   Verify that text-only thinker output results in any_forwarded=False
   (no diffusion stages invoked), and that per-modality skipping works
   correctly when only one of vision/audio tokens is present.
@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax_vision branch from f025500 to 09ff661 Compare April 16, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Add HyperCLOVAX Vision Decoder diffusion model

3 participants