Sky reels model by gDINESH13 · Pull Request #1266 · vllm-project/vllm-omni

gDINESH13 · 2026-02-08T06:14:13Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Closes #1093
Add new model SkyReels-V3

Test Plan

I'm currently unable to test the full model due to hardware limitations (MacBook M1 without NVIDIA GPU support).

What I've Verified

Code compiles without errors
All imports resolve correctly
Pipeline structure matches existing models (Wan2.2)
Configuration files are valid
Example scripts are syntactically correct

Testing Request

I would appreciate if reviewers with access to NVIDIA GPUs could help test the implementation using:

python examples/offline_inference/skyreels_v3/image_to_video.py \
--model Skywork/SkyReels-V3-R2V-14B \
--image test_image.jpg \
--prompt "A cinematic video of a person walking" \
--output-format mp4

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

Copilot

Pull request overview

Adds initial support for the Skywork SkyReels-V3 R2V (image-to-video) diffusion model to vllm-omni, including a new pipeline/transformer implementation, registry wiring, a stage config, and offline inference examples.

Changes:

Added a new SkyReels-V3 R2V diffusion pipeline + transformer implementation under vllm_omni/diffusion/models/skyreels_v3/.
Registered the pipeline and its pre/post-process hooks in the diffusion registry.
Added a stage config YAML and offline inference example + README for SkyReels-V3.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`vllm_omni/model_executor/stage_configs/skyreels_v3_r2v.yaml`	New single-stage diffusion config for running SkyReels-V3 R2V.
`vllm_omni/diffusion/registry.py`	Registers the SkyReels pipeline and pre/post-process function hooks.
`vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py`	New transformer + rotary embedding implementation for SkyReels-V3.
`vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py`	New R2V diffusion pipeline using UMT5 + CLIP + Wan VAE + FlowUniPC scheduler.
`vllm_omni/diffusion/models/skyreels_v3/__init__.py`	Exposes the new pipeline and helper functions.
`examples/offline_inference/skyreels_v3/image_to_video.py`	Adds an offline inference script for SkyReels-V3 R2V.
`examples/offline_inference/skyreels_v3/README.md`	Documentation for running the new offline example(s).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-08T06:23:36Z

+        latents = latents / self.vae.config.scaling_factor
+        video = self.vae.decode(latents).sample


Decoding uses latents / self.vae.config.scaling_factor followed by self.vae.decode(latents).sample, but AutoencoderKLWan decoding elsewhere in the repo uses latents_mean/latents_std normalization and decode(..., return_dict=False). Using the wrong normalization will produce incorrect outputs (or fail if scaling_factor isn't present). Align the decode path with the established Wan2.2 decode logic for AutoencoderKLWan.

Suggested change

latents = latents / self.vae.config.scaling_factor

video = self.vae.decode(latents).sample

latents_mean = torch.as_tensor(

self.vae.config.latents_mean, device=latents.device, dtype=latents.dtype

)

latents_std = torch.as_tensor(

self.vae.config.latents_std, device=latents.device, dtype=latents.dtype

)

latents = latents * latents_std + latents_mean

video = self.vae.decode(latents, return_dict=False)[0]

Copilot · 2026-02-08T06:23:36Z

+
+        # Load scheduler
+        self.scheduler = loader.load_scheduler(FlowUniPCMultistepScheduler, "scheduler")
+


The stage config sets flow_shift, but this pipeline loads the scheduler from disk and never applies od_config.flow_shift. If flow_shift is meant to be configurable (as it is for Wan2.2), initialize/override the scheduler with shift=od_config.flow_shift (or remove the setting from the stage config to avoid a no-op knob).

Copilot · 2026-02-08T06:23:36Z

+    outputs = model.generate(
+        prompts=[
+            {
+                "prompt": args.prompt,
+                "multi_modal_data": {"image": image},
+            }
+        ],
+        sampling_params={
+            "height": args.height,
+            "width": args.width,
+            "num_frames": args.num_frames,
+            "num_inference_steps": args.num_inference_steps,
+            "guidance_scale": args.guidance_scale,
+            "seed": args.seed,
+        },
+    )


This example passes a raw dict to OmniDiffusion.generate(..., sampling_params=...), but OmniDiffusionRequest expects an OmniDiffusionSamplingParams dataclass; passing a dict will crash in OmniDiffusionRequest.__post_init__. Construct OmniDiffusionSamplingParams(**{...}) (or otherwise follow the patterns used by other offline examples).

Copilot · 2026-02-08T06:23:37Z

+        video_frames = output.outputs[0]  # Get the video frames
+


OmniDiffusion.generate returns OmniRequestOutput objects; in diffusion mode the generated frames/images are exposed via output.images or output.multimodal_output, not output.outputs. Accessing output.outputs[0] will always be empty here. Update the example to read the generated video data from the diffusion output fields (and handle the actual type returned by the SkyReels post-processor).

Suggested change

video_frames = output.outputs[0] # Get the video frames

# OmniDiffusion returns OmniRequestOutput; in diffusion mode the

# generated frames are exposed via `images` or `multimodal_output`,

# not `outputs`.

video_data = None

if getattr(output, "images", None):

# `images` is typically a list where each element is a sequence of frames.

video_data = output.images[0]

elif getattr(output, "multimodal_output", None):

# Fallback: some pipelines may populate `multimodal_output` instead.

video_data = output.multimodal_output[0]

else:

raise ValueError("No video data found in diffusion output.")

# The SkyReels post-processor may wrap frames, e.g. {"video": frames}.

if isinstance(video_data, dict) and "video" in video_data:

video_frames = video_data["video"]

else:

video_frames = video_data

Copilot · 2026-02-08T06:23:37Z

+        # Get position indices
+        seq_len = t * h * w
+        freqs_cos = self.freqs_cos[:seq_len]  # type: ignore
+        freqs_sin = self.freqs_sin[:seq_len]  # type: ignore
+
+        return apply_rotary_emb_skyreels(hidden_states, freqs_cos, freqs_sin)


SkyReelsRotaryPosEmbed.forward() currently ignores patch_size and the (t, h, w) decomposition and just slices the first t*h*w positions from precomputed buffers. For 3D rotary this is likely incompatible with the established approach in WanRotaryPosEmbed (which splits temporal/height/width dims and expands them into a (t,h,w) grid). If SkyReels-V3 expects Wan-style 3D rotary, this will produce incorrect position encodings and break weight compatibility; consider reusing the Wan implementation pattern to construct cos/sin for (T',H',W') and return them for attention application.

Copilot · 2026-02-08T06:23:38Z

+        latents_shape = (
+            batch_size * num_videos_per_prompt,
+            num_channels_latents,
+            num_frames,
+            height // self.vae_scale_factor,
+            width // self.vae_scale_factor,


Latent shape and VAE decoding look inconsistent with AutoencoderKLWan: the temporal latent length should be (num_frames - 1) // vae_scale_factor_temporal + 1 (and spatial uses scale_factor_spatial), but the code allocates latents with num_frames directly and only uses a single vae_scale_factor. This is likely to break decoding or produce incorrect frame counts. Use separate temporal/spatial scale factors like the Wan2.2 pipelines and compute num_latent_frames accordingly.

Suggested change

latents_shape = (

batch_size * num_videos_per_prompt,

num_channels_latents,

num_frames,

height // self.vae_scale_factor,

width // self.vae_scale_factor,

# Use separate temporal and spatial VAE scale factors, as required by AutoencoderKLWan.

vae_scale_factor_temporal = getattr(self.vae.config, "scale_factor_temporal", 1)

vae_scale_factor_spatial = getattr(

self.vae.config,

"scale_factor_spatial",

getattr(self, "vae_scale_factor", 1),

)

# Latent temporal length is typically shorter than the decoded frame count.

num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1

latents_shape = (

batch_size * num_videos_per_prompt,

num_channels_latents,

num_latent_frames,

height // vae_scale_factor_spatial,

width // vae_scale_factor_spatial,

Copilot · 2026-02-08T06:23:38Z

+        freqs = torch.outer(t, freqs)
+        # Repeat interleave for real representation
+        freqs_cos = freqs.cos().repeat_interleave(2, dim=-1)
+        freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)


SkyReelsRotaryPosEmbed._get_1d_rotary_pos_embed returns freqs_cos/sin in the original freqs_dtype (float64 on CUDA), unlike the existing WanRotaryPosEmbed implementation which casts these buffers to float32. Keeping float64 rotary buffers will upcast rotary multiply ops and can significantly hurt performance/memory. Cast the generated cos/sin to float32 (or to the model dtype) before registering/using them, matching the Wan2.2 implementation.

Suggested change

freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)

freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)

# Cast to float32 to avoid float64 rotary buffers (matches Wan2.2 behavior)

freqs_cos = freqs_cos.to(torch.float32)

freqs_sin = freqs_sin.to(torch.float32)

Copilot · 2026-02-08T06:23:38Z

+            # Resize image to target dimensions
+            image = image.resize(
+                (request.sampling_params.width, request.sampling_params.height),  # type: ignore
+                PIL.Image.Resampling.LANCZOS,
+            )
+            prompt["multi_modal_data"]["image"] = image  # type: ignore
+
+            # Preprocess for VAE
+            prompt["additional_information"]["preprocessed_image"] = video_processor.preprocess(
+                image, height=request.sampling_params.height, width=request.sampling_params.width
+            )
+            request.prompts[i] = prompt


The pre-process function stores preprocessed_image using diffusers.video_processor.VideoProcessor.preprocess(...), but the pipeline later feeds that tensor into CLIPVisionModel. CLIP expects inputs prepared by CLIPImageProcessor (size/normalization), so passing video-processor tensors (likely full-res 480x832) will either error (positional embeddings) or produce incorrect embeddings. Store CLIP pixel values (via self.image_processor) or keep the PIL image and apply CLIPImageProcessor inside encode_image.

Copilot · 2026-02-08T06:23:38Z

+        text_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=self.tokenizer.model_max_length,
+            truncation=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids.to(device)
+
+        # Encode
+        prompt_embeds = self.text_encoder(text_input_ids)[0]
+


encode_prompt tokenizes with padding/truncation but does not pass attention_mask into UMT5EncoderModel. This makes the encoder attend to pad tokens and diverges from the pattern used in other video pipelines (e.g., Wan2.2), degrading prompt embeddings. Pass attention_mask (and ideally trim/pack to seq_lens if desired).

Copilot · 2026-02-08T06:23:39Z

+        model = od_config.model
+
+        # Load model components
+        loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only)


Keyword argument 'local_files_only' is not a supported parameter name of DiffusersPipelineLoader.init.

Suggested change

loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only)

loader = DiffusersPipelineLoader(model)

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

wtomin · 2026-02-09T03:24:32Z

@@ -0,0 +1,195 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: Apache-2.0


How is this file different from examples/offline_inference/image_to_video/image_to_video.py?

Is it necessary to create a new script for SkyReel?

Hey, @wtomin I thought since am adding a new model It would be much easier to have implementation ready to try out.
Do you suggest removing this?

If skyreel v3 falls into one of [image-to-video, text-to-video], I think we can reuse the offline inference script to avoid repetition.

Ok I will check remove it in a while.

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

Gaohan123

Could you please post some generation results in the PR description?

gDINESH13 · 2026-02-10T15:41:55Z

Hi @Gaohan123 thank you for taking a look on this PR, I had access to a cloud GPU when I raised a PR but to my bad luck my I don't have it anymore.
I have attached a steps carried out when I tried executing the inference example if you have time can you please try that script?

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

hsliuustc0106 · 2026-02-24T08:02:31Z

@vllm-omni-reviewer

gDINESH13 · 2026-03-08T19:13:16Z

Hi all — just a gentle follow-up on this PR. I wanted to check if anyone had a chance to take a further look on this.
Happy to make changes if needed. Thanks!

wtomin · 2026-03-09T06:17:52Z

I feel like this PR is still under progress. Since you mentioned, Skyreels models support:

Image-to-Video (I2V): Generate videos from reference images
Video-to-Video (V2V): Transform existing videos
Audio-to-Video (A2V): Generate videos guided by audio

This PR only contains Image-to-Video pipelines and documents. What about other tasks?

BTW, please also update the online serving documents.

Gaohan123 · 2026-03-17T15:41:55Z

@gDINESH13 Sorry for late reply. Do you have resources now? Any updates?

gDINESH13 · 2026-03-27T06:13:52Z

Hi @Gaohan123 sorry, I was completely offline for a while. I don't have resources. It would be great if I get some help here. @wtomin yea it does, but I can't test those with my device. It's the hardware blocker, I have right now with m1 Mac.
If possible can we open a new issue to create scripts for those cases?

gDINESH13 added 2 commits February 8, 2026 11:35

New Model SkyReels added to vllm-omni

cdc74ca

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

pre-commit formatting

4b0e05e

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

gDINESH13 requested a review from hsliuustc0106 as a code owner February 8, 2026 06:14

hsliuustc0106 requested a review from Copilot February 8, 2026 06:16

Copilot started reviewing on behalf of hsliuustc0106 February 8, 2026 06:17 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

feat(add new model): Review comments

2e72647

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

david6666666 mentioned this pull request Feb 9, 2026

vLLM-Omni Model Support #808

Open

63 tasks

wtomin reviewed Feb 9, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py Outdated

wtomin reviewed Feb 9, 2026

View reviewed changes

tzhouam added the new model add new model label Feb 9, 2026

Removed CFG Parallel Mixin

635e304

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

gDINESH13 requested a review from wtomin February 9, 2026 14:40

Gaohan123 reviewed Feb 10, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot February 10, 2026 20:56

Copilot started reviewing on behalf of hsliuustc0106 February 10, 2026 20:56 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Applied review comments

fc4d64a

Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

gDINESH13 requested a review from Gaohan123 February 19, 2026 01:38

Merge branch 'main' into SkyReels_model

a7a2156

		latents = latents / self.vae.config.scaling_factor
		video = self.vae.decode(latents).sample

-        latents = latents / self.vae.config.scaling_factor
-        video = self.vae.decode(latents).sample
+        latents_mean = torch.as_tensor(
+            self.vae.config.latents_mean, device=latents.device, dtype=latents.dtype
+        )
+        latents_std = torch.as_tensor(
+            self.vae.config.latents_std, device=latents.device, dtype=latents.dtype
+        )
+        latents = latents * latents_std + latents_mean
+        video = self.vae.decode(latents, return_dict=False)[0]


		# Load scheduler
		self.scheduler = loader.load_scheduler(FlowUniPCMultistepScheduler, "scheduler")

-        video_frames = output.outputs[0]  # Get the video frames
+        # OmniDiffusion returns OmniRequestOutput; in diffusion mode the
+        # generated frames are exposed via `images` or `multimodal_output`,
+        # not `outputs`.
+        video_data = None
+        if getattr(output, "images", None):
+            # `images` is typically a list where each element is a sequence of frames.
+            video_data = output.images[0]
+        elif getattr(output, "multimodal_output", None):
+            # Fallback: some pipelines may populate `multimodal_output` instead.
+            video_data = output.multimodal_output[0]
+        else:
+            raise ValueError("No video data found in diffusion output.")
+        # The SkyReels post-processor may wrap frames, e.g. {"video": frames}.
+        if isinstance(video_data, dict) and "video" in video_data:
+            video_frames = video_data["video"]
+        else:
+            video_frames = video_data

-        latents_shape = (
-            batch_size * num_videos_per_prompt,
-            num_channels_latents,
-            num_frames,
-            height // self.vae_scale_factor,
-            width // self.vae_scale_factor,
+        # Use separate temporal and spatial VAE scale factors, as required by AutoencoderKLWan.
+        vae_scale_factor_temporal = getattr(self.vae.config, "scale_factor_temporal", 1)
+        vae_scale_factor_spatial = getattr(
+            self.vae.config,
+            "scale_factor_spatial",
+            getattr(self, "vae_scale_factor", 1),
+        )
+        # Latent temporal length is typically shorter than the decoded frame count.
+        num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1
+        latents_shape = (
+            batch_size * num_videos_per_prompt,
+            num_channels_latents,
+            num_latent_frames,
+            height // vae_scale_factor_spatial,
+            width // vae_scale_factor_spatial,

-        freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)
+        freqs_sin = freqs.sin().repeat_interleave(2, dim=-1)
+        # Cast to float32 to avoid float64 rotary buffers (matches Wan2.2 behavior)
+        freqs_cos = freqs_cos.to(torch.float32)
+        freqs_sin = freqs_sin.to(torch.float32)

	loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only)
	loader = DiffusersPipelineLoader(model)

		@@ -0,0 +1,195 @@
		#!/usr/bin/env python3
		# SPDX-License-Identifier: Apache-2.0

Conversation

gDINESH13 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

What I've Verified

Testing Request

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wtomin Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gDINESH13 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

wtomin Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gDINESH13 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

gDINESH13 commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Feb 24, 2026

Uh oh!

gDINESH13 commented Mar 8, 2026

Uh oh!

wtomin commented Mar 9, 2026

Uh oh!

gDINESH13 commented Feb 8, 2026 •

edited

Loading