-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Sky reels model #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Sky reels model #1266
Changes from all commits
cdc74ca
4b0e05e
2e72647
635e304
fc4d64a
a7a2156
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| # SkyReels-V3 Offline Inference Examples | ||
|
|
||
| This directory contains examples for using the SkyReels-V3 multimodal video generation models with vLLM-Omni. | ||
|
|
||
| ## Models | ||
|
|
||
| SkyReels-V3 is a family of multimodal video generation models that support: | ||
|
|
||
| - **Image-to-Video (R2V)**: Generate videos from reference images | ||
| - **Video-to-Video (V2V)**: Transform existing videos | ||
| - **Audio-to-Video (A2V)**: Generate videos guided by audio | ||
|
|
||
| ### Available Models | ||
|
|
||
| - `Skywork/SkyReels-V3-R2V-14B`: Image-to-Video (14B parameters) | ||
| - `Skywork/SkyReels-V3-V2V-14B`: Video-to-Video (14B parameters) | ||
| - `Skywork/SkyReels-V3-A2V-19B`: Audio-to-Video (19B parameters) | ||
|
|
||
| ## Installation | ||
|
|
||
| Install the required dependencies: | ||
|
|
||
| ```bash | ||
| pip install vllm-omni | ||
| pip install imageio imageio-ffmpeg # For video I/O | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Image-to-Video (R2V) | ||
|
|
||
| Generate a video from a reference image: | ||
|
|
||
| ```bash | ||
| python image_to_video.py \ | ||
| --model Skywork/SkyReels-V3-R2V-14B \ | ||
| --image path/to/your/image.jpg \ | ||
| --prompt "A person walking through a beautiful garden" \ | ||
| --height 480 \ | ||
| --width 832 \ | ||
| --num-frames 81 \ | ||
| --num-inference-steps 50 \ | ||
| --guidance-scale 7.5 \ | ||
| --seed 42 \ | ||
| --output-dir ./outputs/skyreels_v3 \ | ||
| --output-format mp4 | ||
| ``` | ||
|
|
||
| ### Parameters | ||
|
|
||
| - `--model`: Model name or path (default: `Skywork/SkyReels-V3-R2V-14B`) | ||
| - `--image`: Path to the reference image (required) | ||
| - `--prompt`: Text prompt describing the desired video | ||
| - `--negative-prompt`: Negative prompt to avoid certain content (optional) | ||
| - `--height`: Video height in pixels (default: 480) | ||
| - `--width`: Video width in pixels (default: 832) | ||
| - `--num-frames`: Number of frames to generate (default: 81) | ||
| - `--num-inference-steps`: Number of denoising steps (default: 50, higher = better quality but slower) | ||
| - `--guidance-scale`: Classifier-free guidance scale (default: 7.5, higher = more prompt adherence) | ||
| - `--seed`: Random seed for reproducibility (default: 42) | ||
| - `--output-dir`: Output directory for generated videos | ||
| - `--output-format`: Output format: `mp4`, `gif`, or `frames` | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Basic Image-to-Video | ||
|
|
||
| ```bash | ||
| python image_to_video.py \ | ||
| --image examples/sample_image.jpg \ | ||
| --prompt "A cinematic video of the scene" | ||
| ``` | ||
|
|
||
| ### High-Quality Generation | ||
|
|
||
| ```bash | ||
| python image_to_video.py \ | ||
| --image examples/sample_image.jpg \ | ||
| --prompt "A dramatic video with dynamic camera movement" \ | ||
| --num-inference-steps 100 \ | ||
| --guidance-scale 9.0 \ | ||
| --num-frames 121 | ||
| ``` | ||
|
|
||
| ### Generate GIF | ||
|
|
||
| ```bash | ||
| python image_to_video.py \ | ||
| --image examples/sample_image.jpg \ | ||
| --prompt "A looping animation" \ | ||
| --output-format gif \ | ||
| --num-frames 49 | ||
| ``` | ||
|
|
||
| ## Tips | ||
|
|
||
| 1. **Image Quality**: Use high-quality reference images for best results | ||
| 2. **Aspect Ratio**: The model works best with 16:9 aspect ratio (e.g., 832x480) | ||
| 3. **Frame Count**: More frames = longer videos but slower generation | ||
| 4. **Guidance Scale**: | ||
| - Lower (3-5): More creative, less adherence to prompt | ||
| - Medium (7-9): Balanced | ||
| - Higher (10+): Strong prompt adherence, may reduce quality | ||
| 5. **Inference Steps**: 50 steps is usually sufficient; 100+ for highest quality | ||
|
|
||
| ## Performance | ||
|
|
||
| - **GPU Memory**: ~24GB VRAM required for R2V-14B model | ||
| - **Generation Time**: ~2-5 minutes for 81 frames on A100 GPU | ||
| - **Batch Size**: Currently supports batch size of 1 | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Out of Memory | ||
|
|
||
| If you encounter OOM errors: | ||
| - Reduce `--num-frames` | ||
| - Reduce `--height` and `--width` | ||
| - Use a smaller model variant if available | ||
|
|
||
| ### Poor Quality | ||
|
|
||
| If the output quality is poor: | ||
| - Increase `--num-inference-steps` (try 75-100) | ||
| - Adjust `--guidance-scale` (try 8-10) | ||
| - Use a higher quality reference image | ||
| - Refine your prompt to be more specific | ||
|
|
||
| ## Citation | ||
|
|
||
| If you use SkyReels-V3 in your research, please cite: | ||
|
|
||
| ```bibtex | ||
| @article{skyreels2025, | ||
| title={SkyReels-V3: Multimodal Video Generation with Unified In-Context Learning}, | ||
| author={Skywork Team}, | ||
| journal={arXiv preprint}, | ||
| year={2025} | ||
| } | ||
| ``` | ||
|
|
||
| ## License | ||
|
|
||
| SkyReels-V3 models are released under the Skywork License. Please refer to the model card on Hugging Face for details. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,194 @@ | ||
| #!/usr/bin/env python3 | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| """ | ||
| SkyReels-V3 Image-to-Video (R2V) Offline Inference Example. | ||
|
|
||
| This script demonstrates how to use the SkyReels-V3 R2V model to generate | ||
| videos from reference images using the vLLM-Omni framework. | ||
|
|
||
| Usage: | ||
| python image_to_video.py --model Skywork/SkyReels-V3-R2V-14B \ | ||
| --image path/to/image.jpg \ | ||
| --prompt "A person walking in the park" | ||
| """ | ||
|
|
||
| import argparse | ||
| import os | ||
| from pathlib import Path | ||
|
|
||
| from PIL import Image | ||
|
|
||
| from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
| from vllm_omni.outputs import OmniRequestOutput | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description="SkyReels-V3 Image-to-Video Generation") | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| default="Skywork/SkyReels-V3-R2V-14B", | ||
| help="Model name or path (default: Skywork/SkyReels-V3-R2V-14B)", | ||
| ) | ||
| parser.add_argument( | ||
| "--image", | ||
| type=str, | ||
| required=True, | ||
| help="Path to the reference image", | ||
| ) | ||
| parser.add_argument( | ||
| "--prompt", | ||
| type=str, | ||
| default="A cinematic video", | ||
| help="Text prompt describing the desired video", | ||
| ) | ||
| parser.add_argument( | ||
| "--negative-prompt", | ||
| type=str, | ||
| default="", | ||
| help="Negative prompt (optional)", | ||
| ) | ||
| parser.add_argument( | ||
| "--height", | ||
| type=int, | ||
| default=480, | ||
| help="Video height (default: 480)", | ||
| ) | ||
| parser.add_argument( | ||
| "--width", | ||
| type=int, | ||
| default=832, | ||
| help="Video width (default: 832)", | ||
| ) | ||
| parser.add_argument( | ||
| "--num-frames", | ||
| type=int, | ||
| default=81, | ||
| help="Number of frames to generate (default: 81)", | ||
| ) | ||
| parser.add_argument( | ||
| "--num-inference-steps", | ||
| type=int, | ||
| default=50, | ||
| help="Number of denoising steps (default: 50)", | ||
| ) | ||
| parser.add_argument( | ||
| "--guidance-scale", | ||
| type=float, | ||
| default=7.5, | ||
| help="Guidance scale for classifier-free guidance (default: 7.5)", | ||
| ) | ||
| parser.add_argument( | ||
| "--seed", | ||
| type=int, | ||
| default=42, | ||
| help="Random seed for reproducibility (default: 42)", | ||
| ) | ||
| parser.add_argument( | ||
| "--output-dir", | ||
| type=str, | ||
| default="./outputs/skyreels_v3", | ||
| help="Output directory for generated videos (default: ./outputs/skyreels_v3)", | ||
| ) | ||
| parser.add_argument( | ||
| "--output-format", | ||
| type=str, | ||
| default="mp4", | ||
| choices=["mp4", "gif", "frames"], | ||
| help="Output format: mp4, gif, or frames (default: mp4)", | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # Create output directory | ||
| output_dir = Path(args.output_dir) | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # Load reference image | ||
| if not os.path.exists(args.image): | ||
| raise FileNotFoundError(f"Image not found: {args.image}") | ||
|
|
||
| image = Image.open(args.image).convert("RGB") | ||
| print(f"Loaded reference image: {args.image} ({image.size})") | ||
|
|
||
| # Initialize the model | ||
| print(f"Loading SkyReels-V3 model: {args.model}") | ||
| model = OmniDiffusion( | ||
| model=args.model, | ||
| model_class_name="SkyReelsV3R2VPipeline", | ||
| trust_remote_code=True, | ||
| ) | ||
|
|
||
| # Prepare the request | ||
| print(f"\nGenerating video with prompt: '{args.prompt}'") | ||
| print("Parameters:") | ||
| print(f" - Resolution: {args.width}x{args.height}") | ||
| print(f" - Frames: {args.num_frames}") | ||
| print(f" - Steps: {args.num_inference_steps}") | ||
| print(f" - Guidance Scale: {args.guidance_scale}") | ||
| print(f" - Seed: {args.seed}") | ||
|
|
||
| # Generate video | ||
| outputs = model.generate( | ||
| prompts=[ | ||
| { | ||
| "prompt": args.prompt, | ||
| "multi_modal_data": {"image": image}, | ||
| } | ||
| ], | ||
| sampling_params=OmniDiffusionSamplingParams( | ||
| height=args.height, | ||
| width=args.width, | ||
| num_frames=args.num_frames, | ||
| num_inference_steps=args.num_inference_steps, | ||
| guidance_scale=args.guidance_scale, | ||
| seed=args.seed, | ||
| ), | ||
| ) | ||
|
Comment on lines
+135
to
+150
|
||
|
|
||
| # Save the generated video | ||
| for idx, output in enumerate(outputs): | ||
| # Extract video frames from OmniRequestOutput | ||
| video_frames = None | ||
| if isinstance(output, OmniRequestOutput): | ||
| # In diffusion mode, output.images is the full list of frames | ||
| if hasattr(output, "images") and output.images: | ||
| video_frames = output.images | ||
| else: | ||
| raise ValueError("No video data found in diffusion output.") | ||
| else: | ||
| raise TypeError(f"Unexpected output type: {type(output)}") | ||
|
|
||
| if args.output_format == "mp4": | ||
| output_path = output_dir / f"video_{idx:04d}.mp4" | ||
| # Save as MP4 video | ||
| import imageio | ||
|
|
||
| imageio.mimsave(output_path, video_frames, fps=24, codec="libx264") | ||
| print(f"\nSaved video to: {output_path}") | ||
|
|
||
| elif args.output_format == "gif": | ||
| output_path = output_dir / f"video_{idx:04d}.gif" | ||
| # Save as GIF | ||
| import imageio | ||
|
|
||
| imageio.mimsave(output_path, video_frames, fps=12) | ||
| print(f"\nSaved GIF to: {output_path}") | ||
|
|
||
| else: # frames | ||
| frames_dir = output_dir / f"video_{idx:04d}_frames" | ||
| frames_dir.mkdir(exist_ok=True) | ||
| # Save individual frames | ||
| for frame_idx, frame in enumerate(video_frames): | ||
| frame_path = frames_dir / f"frame_{frame_idx:04d}.png" | ||
| Image.fromarray(frame).save(frame_path) | ||
| print(f"\nSaved {len(video_frames)} frames to: {frames_dir}") | ||
|
|
||
| print("\nGeneration complete!") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| """SkyReels-V3 multimodal video generation models.""" | ||
|
|
||
| from .pipeline_skyreels_v3_r2v import ( | ||
| SkyReelsV3R2VPipeline, | ||
| get_skyreels_v3_r2v_post_process_func, | ||
| get_skyreels_v3_r2v_pre_process_func, | ||
| ) | ||
| from .skyreels_v3_transformer import SkyReelsTransformer3DModel | ||
|
|
||
| __all__ = [ | ||
| "SkyReelsV3R2VPipeline", | ||
| "get_skyreels_v3_r2v_post_process_func", | ||
| "get_skyreels_v3_r2v_pre_process_func", | ||
| "SkyReelsTransformer3DModel", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this file different from
examples/offline_inference/image_to_video/image_to_video.py?Is it necessary to create a new script for SkyReel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @wtomin I thought since am adding a new model It would be much easier to have implementation ready to try out.
Do you suggest removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If skyreel v3 falls into one of [image-to-video, text-to-video], I think we can reuse the offline inference script to avoid repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will check remove it in a while.