Skip to content

Sky reels model#1266

Open
gDINESH13 wants to merge 6 commits into
vllm-project:mainfrom
gDINESH13:SkyReels_model
Open

Sky reels model#1266
gDINESH13 wants to merge 6 commits into
vllm-project:mainfrom
gDINESH13:SkyReels_model

Conversation

@gDINESH13
Copy link
Copy Markdown
Contributor

@gDINESH13 gDINESH13 commented Feb 8, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Closes #1093
Add new model SkyReels-V3

Test Plan

I'm currently unable to test the full model due to hardware limitations (MacBook M1 without NVIDIA GPU support).

What I've Verified

  • Code compiles without errors
  • All imports resolve correctly
  • Pipeline structure matches existing models (Wan2.2)
  • Configuration files are valid
  • Example scripts are syntactically correct

Testing Request

I would appreciate if reviewers with access to NVIDIA GPUs could help test the implementation using:

python examples/offline_inference/skyreels_v3/image_to_video.py \
--model Skywork/SkyReels-V3-R2V-14B \
--image test_image.jpg \
--prompt "A cinematic video of a person walking" \
--output-format mp4

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial support for the Skywork SkyReels-V3 R2V (image-to-video) diffusion model to vllm-omni, including a new pipeline/transformer implementation, registry wiring, a stage config, and offline inference examples.

Changes:

  • Added a new SkyReels-V3 R2V diffusion pipeline + transformer implementation under vllm_omni/diffusion/models/skyreels_v3/.
  • Registered the pipeline and its pre/post-process hooks in the diffusion registry.
  • Added a stage config YAML and offline inference example + README for SkyReels-V3.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
vllm_omni/model_executor/stage_configs/skyreels_v3_r2v.yaml New single-stage diffusion config for running SkyReels-V3 R2V.
vllm_omni/diffusion/registry.py Registers the SkyReels pipeline and pre/post-process function hooks.
vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py New transformer + rotary embedding implementation for SkyReels-V3.
vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py New R2V diffusion pipeline using UMT5 + CLIP + Wan VAE + FlowUniPC scheduler.
vllm_omni/diffusion/models/skyreels_v3/__init__.py Exposes the new pipeline and helper functions.
examples/offline_inference/skyreels_v3/image_to_video.py Adds an offline inference script for SkyReels-V3 R2V.
examples/offline_inference/skyreels_v3/README.md Documentation for running the new offline example(s).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +426 to +427
latents = latents / self.vae.config.scaling_factor
video = self.vae.decode(latents).sample
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoding uses latents / self.vae.config.scaling_factor followed by self.vae.decode(latents).sample, but AutoencoderKLWan decoding elsewhere in the repo uses latents_mean/latents_std normalization and decode(..., return_dict=False). Using the wrong normalization will produce incorrect outputs (or fail if scaling_factor isn't present). Align the decode path with the established Wan2.2 decode logic for AutoencoderKLWan.

Suggested change
latents = latents / self.vae.config.scaling_factor
video = self.vae.decode(latents).sample
latents_mean = torch.as_tensor(
self.vae.config.latents_mean, device=latents.device, dtype=latents.dtype
)
latents_std = torch.as_tensor(
self.vae.config.latents_std, device=latents.device, dtype=latents.dtype
)
latents = latents * latents_std + latents_mean
video = self.vae.decode(latents, return_dict=False)[0]

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +245

# Load scheduler
self.scheduler = loader.load_scheduler(FlowUniPCMultistepScheduler, "scheduler")

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stage config sets flow_shift, but this pipeline loads the scheduler from disk and never applies od_config.flow_shift. If flow_shift is meant to be configurable (as it is for Wan2.2), initialize/override the scheduler with shift=od_config.flow_shift (or remove the setting from the stage config to avoid a no-op knob).

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +148
outputs = model.generate(
prompts=[
{
"prompt": args.prompt,
"multi_modal_data": {"image": image},
}
],
sampling_params={
"height": args.height,
"width": args.width,
"num_frames": args.num_frames,
"num_inference_steps": args.num_inference_steps,
"guidance_scale": args.guidance_scale,
"seed": args.seed,
},
)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example passes a raw dict to OmniDiffusion.generate(..., sampling_params=...), but OmniDiffusionRequest expects an OmniDiffusionSamplingParams dataclass; passing a dict will crash in OmniDiffusionRequest.__post_init__. Construct OmniDiffusionSamplingParams(**{...}) (or otherwise follow the patterns used by other offline examples).

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
video_frames = output.outputs[0] # Get the video frames

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OmniDiffusion.generate returns OmniRequestOutput objects; in diffusion mode the generated frames/images are exposed via output.images or output.multimodal_output, not output.outputs. Accessing output.outputs[0] will always be empty here. Update the example to read the generated video data from the diffusion output fields (and handle the actual type returned by the SkyReels post-processor).

Suggested change
video_frames = output.outputs[0] # Get the video frames
# OmniDiffusion returns OmniRequestOutput; in diffusion mode the
# generated frames are exposed via `images` or `multimodal_output`,
# not `outputs`.
video_data = None
if getattr(output, "images", None):
# `images` is typically a list where each element is a sequence of frames.
video_data = output.images[0]
elif getattr(output, "multimodal_output", None):
# Fallback: some pipelines may populate `multimodal_output` instead.
video_data = output.multimodal_output[0]
else:
raise ValueError("No video data found in diffusion output.")
# The SkyReels post-processor may wrap frames, e.g. {"video": frames}.
if isinstance(video_data, dict) and "video" in video_data:
video_frames = video_data["video"]
else:
video_frames = video_data

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +136
# Get position indices
seq_len = t * h * w
freqs_cos = self.freqs_cos[:seq_len] # type: ignore
freqs_sin = self.freqs_sin[:seq_len] # type: ignore

return apply_rotary_emb_skyreels(hidden_states, freqs_cos, freqs_sin)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SkyReelsRotaryPosEmbed.forward() currently ignores patch_size and the (t, h, w) decomposition and just slices the first t*h*w positions from precomputed buffers. For 3D rotary this is likely incompatible with the established approach in WanRotaryPosEmbed (which splits temporal/height/width dims and expands them into a (t,h,w) grid). If SkyReels-V3 expects Wan-style 3D rotary, this will produce incorrect position encodings and break weight compatibility; consider reusing the Wan implementation pattern to construct cos/sin for (T',H',W') and return them for attention application.

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +387
latents_shape = (
batch_size * num_videos_per_prompt,
num_channels_latents,
num_frames,
height // self.vae_scale_factor,
width // self.vae_scale_factor,
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latent shape and VAE decoding look inconsistent with AutoencoderKLWan: the temporal latent length should be (num_frames - 1) // vae_scale_factor_temporal + 1 (and spatial uses scale_factor_spatial), but the code allocates latents with num_frames directly and only uses a single vae_scale_factor. This is likely to break decoding or produce incorrect frame counts. Use separate temporal/spatial scale factors like the Wan2.2 pipelines and compute num_latent_frames accordingly.

Suggested change
latents_shape = (
batch_size * num_videos_per_prompt,
num_channels_latents,
num_frames,
height // self.vae_scale_factor,
width // self.vae_scale_factor,
# Use separate temporal and spatial VAE scale factors, as required by AutoencoderKLWan.
vae_scale_factor_temporal = getattr(self.vae.config, "scale_factor_temporal", 1)
vae_scale_factor_spatial = getattr(
self.vae.config,
"scale_factor_spatial",
getattr(self, "vae_scale_factor", 1),
)
# Latent temporal length is typically shorter than the decoded frame count.
num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1
latents_shape = (
batch_size * num_videos_per_prompt,
num_channels_latents,
num_latent_frames,
height // vae_scale_factor_spatial,
width // vae_scale_factor_spatial,

Copilot uses AI. Check for mistakes.
freqs = torch.outer(t, freqs)
# Repeat interleave for real representation
freqs_cos = freqs.cos().repeat_interleave(2, dim=-1)
freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SkyReelsRotaryPosEmbed._get_1d_rotary_pos_embed returns freqs_cos/sin in the original freqs_dtype (float64 on CUDA), unlike the existing WanRotaryPosEmbed implementation which casts these buffers to float32. Keeping float64 rotary buffers will upcast rotary multiply ops and can significantly hurt performance/memory. Cast the generated cos/sin to float32 (or to the model dtype) before registering/using them, matching the Wan2.2 implementation.

Suggested change
freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)
freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)
# Cast to float32 to avoid float64 rotary buffers (matches Wan2.2 behavior)
freqs_cos = freqs_cos.to(torch.float32)
freqs_sin = freqs_sin.to(torch.float32)

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +183
# Resize image to target dimensions
image = image.resize(
(request.sampling_params.width, request.sampling_params.height), # type: ignore
PIL.Image.Resampling.LANCZOS,
)
prompt["multi_modal_data"]["image"] = image # type: ignore

# Preprocess for VAE
prompt["additional_information"]["preprocessed_image"] = video_processor.preprocess(
image, height=request.sampling_params.height, width=request.sampling_params.width
)
request.prompts[i] = prompt
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-process function stores preprocessed_image using diffusers.video_processor.VideoProcessor.preprocess(...), but the pipeline later feeds that tensor into CLIPVisionModel. CLIP expects inputs prepared by CLIPImageProcessor (size/normalization), so passing video-processor tensors (likely full-res 480x832) will either error (positional embeddings) or produce incorrect embeddings. Store CLIP pixel values (via self.image_processor) or keep the PIL image and apply CLIPImageProcessor inside encode_image.

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +276
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=self.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids.to(device)

# Encode
prompt_embeds = self.text_encoder(text_input_ids)[0]

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_prompt tokenizes with padding/truncation but does not pass attention_mask into UMT5EncoderModel. This makes the encoder attend to pad tokens and diverges from the pattern used in other video pipelines (e.g., Wan2.2), degrading prompt embeddings. Pass attention_mask (and ideally trim/pack to seq_lens if desired).

Copilot uses AI. Check for mistakes.
model = od_config.model

# Load model components
loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword argument 'local_files_only' is not a supported parameter name of DiffusersPipelineLoader.init.

Suggested change
loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only)
loader = DiffusersPipelineLoader(model)

Copilot uses AI. Check for mistakes.
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
@david6666666 david6666666 mentioned this pull request Feb 9, 2026
63 tasks
Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated
@@ -0,0 +1,195 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this file different from examples/offline_inference/image_to_video/image_to_video.py?

Is it necessary to create a new script for SkyReel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @wtomin I thought since am adding a new model It would be much easier to have implementation ready to try out.
Do you suggest removing this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If skyreel v3 falls into one of [image-to-video, text-to-video], I think we can reuse the offline inference script to avoid repetition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will check remove it in a while.

@tzhouam tzhouam added the new model add new model label Feb 9, 2026
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
@gDINESH13 gDINESH13 requested a review from wtomin February 9, 2026 14:40
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please post some generation results in the PR description?

@gDINESH13
Copy link
Copy Markdown
Contributor Author

Hi @Gaohan123 thank you for taking a look on this PR, I had access to a cloud GPU when I raised a PR but to my bad luck my I don't have it anymore.
I have attached a steps carried out when I tried executing the inference example if you have time can you please try that script?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py Outdated
Comment thread vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py Outdated
Comment thread examples/offline_inference/skyreels_v3/image_to_video.py Outdated
Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated
Comment thread vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py
Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated
Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated
Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
@gDINESH13 gDINESH13 requested a review from Gaohan123 February 19, 2026 01:38
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@gDINESH13
Copy link
Copy Markdown
Contributor Author

Hi all — just a gentle follow-up on this PR. I wanted to check if anyone had a chance to take a further look on this.
Happy to make changes if needed. Thanks!

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 9, 2026

I feel like this PR is still under progress. Since you mentioned, Skyreels models support:

  • Image-to-Video (I2V): Generate videos from reference images
  • Video-to-Video (V2V): Transform existing videos
  • Audio-to-Video (A2V): Generate videos guided by audio

This PR only contains Image-to-Video pipelines and documents. What about other tasks?

BTW, please also update the online serving documents.

@Gaohan123
Copy link
Copy Markdown
Collaborator

@gDINESH13 Sorry for late reply. Do you have resources now? Any updates?

@gDINESH13
Copy link
Copy Markdown
Contributor Author

Hi @Gaohan123 sorry, I was completely offline for a while. I don't have resources. It would be great if I get some help here. @wtomin yea it does, but I can't test those with my device. It's the hardware blocker, I have right now with m1 Mac.
If possible can we open a new issue to create scripts for those cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model add new model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Skywork / SkyReels

6 participants