Sky reels model#1266
Conversation
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds initial support for the Skywork SkyReels-V3 R2V (image-to-video) diffusion model to vllm-omni, including a new pipeline/transformer implementation, registry wiring, a stage config, and offline inference examples.
Changes:
- Added a new SkyReels-V3 R2V diffusion pipeline + transformer implementation under
vllm_omni/diffusion/models/skyreels_v3/. - Registered the pipeline and its pre/post-process hooks in the diffusion registry.
- Added a stage config YAML and offline inference example + README for SkyReels-V3.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_omni/model_executor/stage_configs/skyreels_v3_r2v.yaml |
New single-stage diffusion config for running SkyReels-V3 R2V. |
vllm_omni/diffusion/registry.py |
Registers the SkyReels pipeline and pre/post-process function hooks. |
vllm_omni/diffusion/models/skyreels_v3/skyreels_v3_transformer.py |
New transformer + rotary embedding implementation for SkyReels-V3. |
vllm_omni/diffusion/models/skyreels_v3/pipeline_skyreels_v3_r2v.py |
New R2V diffusion pipeline using UMT5 + CLIP + Wan VAE + FlowUniPC scheduler. |
vllm_omni/diffusion/models/skyreels_v3/__init__.py |
Exposes the new pipeline and helper functions. |
examples/offline_inference/skyreels_v3/image_to_video.py |
Adds an offline inference script for SkyReels-V3 R2V. |
examples/offline_inference/skyreels_v3/README.md |
Documentation for running the new offline example(s). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| latents = latents / self.vae.config.scaling_factor | ||
| video = self.vae.decode(latents).sample |
There was a problem hiding this comment.
Decoding uses latents / self.vae.config.scaling_factor followed by self.vae.decode(latents).sample, but AutoencoderKLWan decoding elsewhere in the repo uses latents_mean/latents_std normalization and decode(..., return_dict=False). Using the wrong normalization will produce incorrect outputs (or fail if scaling_factor isn't present). Align the decode path with the established Wan2.2 decode logic for AutoencoderKLWan.
| latents = latents / self.vae.config.scaling_factor | |
| video = self.vae.decode(latents).sample | |
| latents_mean = torch.as_tensor( | |
| self.vae.config.latents_mean, device=latents.device, dtype=latents.dtype | |
| ) | |
| latents_std = torch.as_tensor( | |
| self.vae.config.latents_std, device=latents.device, dtype=latents.dtype | |
| ) | |
| latents = latents * latents_std + latents_mean | |
| video = self.vae.decode(latents, return_dict=False)[0] |
|
|
||
| # Load scheduler | ||
| self.scheduler = loader.load_scheduler(FlowUniPCMultistepScheduler, "scheduler") | ||
|
|
There was a problem hiding this comment.
The stage config sets flow_shift, but this pipeline loads the scheduler from disk and never applies od_config.flow_shift. If flow_shift is meant to be configurable (as it is for Wan2.2), initialize/override the scheduler with shift=od_config.flow_shift (or remove the setting from the stage config to avoid a no-op knob).
| outputs = model.generate( | ||
| prompts=[ | ||
| { | ||
| "prompt": args.prompt, | ||
| "multi_modal_data": {"image": image}, | ||
| } | ||
| ], | ||
| sampling_params={ | ||
| "height": args.height, | ||
| "width": args.width, | ||
| "num_frames": args.num_frames, | ||
| "num_inference_steps": args.num_inference_steps, | ||
| "guidance_scale": args.guidance_scale, | ||
| "seed": args.seed, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
This example passes a raw dict to OmniDiffusion.generate(..., sampling_params=...), but OmniDiffusionRequest expects an OmniDiffusionSamplingParams dataclass; passing a dict will crash in OmniDiffusionRequest.__post_init__. Construct OmniDiffusionSamplingParams(**{...}) (or otherwise follow the patterns used by other offline examples).
| video_frames = output.outputs[0] # Get the video frames | ||
|
|
There was a problem hiding this comment.
OmniDiffusion.generate returns OmniRequestOutput objects; in diffusion mode the generated frames/images are exposed via output.images or output.multimodal_output, not output.outputs. Accessing output.outputs[0] will always be empty here. Update the example to read the generated video data from the diffusion output fields (and handle the actual type returned by the SkyReels post-processor).
| video_frames = output.outputs[0] # Get the video frames | |
| # OmniDiffusion returns OmniRequestOutput; in diffusion mode the | |
| # generated frames are exposed via `images` or `multimodal_output`, | |
| # not `outputs`. | |
| video_data = None | |
| if getattr(output, "images", None): | |
| # `images` is typically a list where each element is a sequence of frames. | |
| video_data = output.images[0] | |
| elif getattr(output, "multimodal_output", None): | |
| # Fallback: some pipelines may populate `multimodal_output` instead. | |
| video_data = output.multimodal_output[0] | |
| else: | |
| raise ValueError("No video data found in diffusion output.") | |
| # The SkyReels post-processor may wrap frames, e.g. {"video": frames}. | |
| if isinstance(video_data, dict) and "video" in video_data: | |
| video_frames = video_data["video"] | |
| else: | |
| video_frames = video_data |
| # Get position indices | ||
| seq_len = t * h * w | ||
| freqs_cos = self.freqs_cos[:seq_len] # type: ignore | ||
| freqs_sin = self.freqs_sin[:seq_len] # type: ignore | ||
|
|
||
| return apply_rotary_emb_skyreels(hidden_states, freqs_cos, freqs_sin) |
There was a problem hiding this comment.
SkyReelsRotaryPosEmbed.forward() currently ignores patch_size and the (t, h, w) decomposition and just slices the first t*h*w positions from precomputed buffers. For 3D rotary this is likely incompatible with the established approach in WanRotaryPosEmbed (which splits temporal/height/width dims and expands them into a (t,h,w) grid). If SkyReels-V3 expects Wan-style 3D rotary, this will produce incorrect position encodings and break weight compatibility; consider reusing the Wan implementation pattern to construct cos/sin for (T',H',W') and return them for attention application.
| latents_shape = ( | ||
| batch_size * num_videos_per_prompt, | ||
| num_channels_latents, | ||
| num_frames, | ||
| height // self.vae_scale_factor, | ||
| width // self.vae_scale_factor, |
There was a problem hiding this comment.
Latent shape and VAE decoding look inconsistent with AutoencoderKLWan: the temporal latent length should be (num_frames - 1) // vae_scale_factor_temporal + 1 (and spatial uses scale_factor_spatial), but the code allocates latents with num_frames directly and only uses a single vae_scale_factor. This is likely to break decoding or produce incorrect frame counts. Use separate temporal/spatial scale factors like the Wan2.2 pipelines and compute num_latent_frames accordingly.
| latents_shape = ( | |
| batch_size * num_videos_per_prompt, | |
| num_channels_latents, | |
| num_frames, | |
| height // self.vae_scale_factor, | |
| width // self.vae_scale_factor, | |
| # Use separate temporal and spatial VAE scale factors, as required by AutoencoderKLWan. | |
| vae_scale_factor_temporal = getattr(self.vae.config, "scale_factor_temporal", 1) | |
| vae_scale_factor_spatial = getattr( | |
| self.vae.config, | |
| "scale_factor_spatial", | |
| getattr(self, "vae_scale_factor", 1), | |
| ) | |
| # Latent temporal length is typically shorter than the decoded frame count. | |
| num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1 | |
| latents_shape = ( | |
| batch_size * num_videos_per_prompt, | |
| num_channels_latents, | |
| num_latent_frames, | |
| height // vae_scale_factor_spatial, | |
| width // vae_scale_factor_spatial, |
| freqs = torch.outer(t, freqs) | ||
| # Repeat interleave for real representation | ||
| freqs_cos = freqs.cos().repeat_interleave(2, dim=-1) | ||
| freqs_sin = freqs.sin().repeat_interleave(2, dim=-1) |
There was a problem hiding this comment.
SkyReelsRotaryPosEmbed._get_1d_rotary_pos_embed returns freqs_cos/sin in the original freqs_dtype (float64 on CUDA), unlike the existing WanRotaryPosEmbed implementation which casts these buffers to float32. Keeping float64 rotary buffers will upcast rotary multiply ops and can significantly hurt performance/memory. Cast the generated cos/sin to float32 (or to the model dtype) before registering/using them, matching the Wan2.2 implementation.
| freqs_sin = freqs.sin().repeat_interleave(2, dim=-1) | |
| freqs_sin = freqs.sin().repeat_interleave(2, dim=-1) | |
| # Cast to float32 to avoid float64 rotary buffers (matches Wan2.2 behavior) | |
| freqs_cos = freqs_cos.to(torch.float32) | |
| freqs_sin = freqs_sin.to(torch.float32) |
| # Resize image to target dimensions | ||
| image = image.resize( | ||
| (request.sampling_params.width, request.sampling_params.height), # type: ignore | ||
| PIL.Image.Resampling.LANCZOS, | ||
| ) | ||
| prompt["multi_modal_data"]["image"] = image # type: ignore | ||
|
|
||
| # Preprocess for VAE | ||
| prompt["additional_information"]["preprocessed_image"] = video_processor.preprocess( | ||
| image, height=request.sampling_params.height, width=request.sampling_params.width | ||
| ) | ||
| request.prompts[i] = prompt |
There was a problem hiding this comment.
The pre-process function stores preprocessed_image using diffusers.video_processor.VideoProcessor.preprocess(...), but the pipeline later feeds that tensor into CLIPVisionModel. CLIP expects inputs prepared by CLIPImageProcessor (size/normalization), so passing video-processor tensors (likely full-res 480x832) will either error (positional embeddings) or produce incorrect embeddings. Store CLIP pixel values (via self.image_processor) or keep the PIL image and apply CLIPImageProcessor inside encode_image.
| text_inputs = self.tokenizer( | ||
| prompt, | ||
| padding="max_length", | ||
| max_length=self.tokenizer.model_max_length, | ||
| truncation=True, | ||
| return_tensors="pt", | ||
| ) | ||
| text_input_ids = text_inputs.input_ids.to(device) | ||
|
|
||
| # Encode | ||
| prompt_embeds = self.text_encoder(text_input_ids)[0] | ||
|
|
There was a problem hiding this comment.
encode_prompt tokenizes with padding/truncation but does not pass attention_mask into UMT5EncoderModel. This makes the encoder attend to pad tokens and diverges from the pattern used in other video pipelines (e.g., Wan2.2), degrading prompt embeddings. Pass attention_mask (and ideally trim/pack to seq_lens if desired).
| model = od_config.model | ||
|
|
||
| # Load model components | ||
| loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only) |
There was a problem hiding this comment.
Keyword argument 'local_files_only' is not a supported parameter name of DiffusersPipelineLoader.init.
| loader = DiffusersPipelineLoader(model, local_files_only=od_config.local_files_only) | |
| loader = DiffusersPipelineLoader(model) |
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
| @@ -0,0 +1,195 @@ | |||
| #!/usr/bin/env python3 | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
How is this file different from examples/offline_inference/image_to_video/image_to_video.py?
Is it necessary to create a new script for SkyReel?
There was a problem hiding this comment.
Hey, @wtomin I thought since am adding a new model It would be much easier to have implementation ready to try out.
Do you suggest removing this?
There was a problem hiding this comment.
If skyreel v3 falls into one of [image-to-video, text-to-video], I think we can reuse the offline inference script to avoid repetition.
There was a problem hiding this comment.
Ok I will check remove it in a while.
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Gaohan123
left a comment
There was a problem hiding this comment.
Could you please post some generation results in the PR description?
|
Hi @Gaohan123 thank you for taking a look on this PR, I had access to a cloud GPU when I raised a PR but to my bad luck my I don't have it anymore. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
|
@vllm-omni-reviewer |
|
Hi all — just a gentle follow-up on this PR. I wanted to check if anyone had a chance to take a further look on this. |
|
I feel like this PR is still under progress. Since you mentioned, Skyreels models support:
This PR only contains Image-to-Video pipelines and documents. What about other tasks? BTW, please also update the online serving documents. |
|
@gDINESH13 Sorry for late reply. Do you have resources now? Any updates? |
|
Hi @Gaohan123 sorry, I was completely offline for a while. I don't have resources. It would be great if I get some help here. @wtomin yea it does, but I can't test those with my device. It's the hardware blocker, I have right now with m1 Mac. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Closes #1093
Add new model SkyReels-V3
Test Plan
I'm currently unable to test the full model due to hardware limitations (MacBook M1 without NVIDIA GPU support).
What I've Verified
Testing Request
I would appreciate if reviewers with access to NVIDIA GPUs could help test the implementation using:
python examples/offline_inference/skyreels_v3/image_to_video.py \ --model Skywork/SkyReels-V3-R2V-14B \ --image test_image.jpg \ --prompt "A cinematic video of a person walking" \ --output-format mp4Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)