Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions examples/offline_inference/skyreels_v3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# SkyReels-V3 Offline Inference Examples

This directory contains examples for using the SkyReels-V3 multimodal video generation models with vLLM-Omni.

## Models

SkyReels-V3 is a family of multimodal video generation models that support:

- **Image-to-Video (R2V)**: Generate videos from reference images
- **Video-to-Video (V2V)**: Transform existing videos
- **Audio-to-Video (A2V)**: Generate videos guided by audio

### Available Models

- `Skywork/SkyReels-V3-R2V-14B`: Image-to-Video (14B parameters)
- `Skywork/SkyReels-V3-V2V-14B`: Video-to-Video (14B parameters)
- `Skywork/SkyReels-V3-A2V-19B`: Audio-to-Video (19B parameters)

## Installation

Install the required dependencies:

```bash
pip install vllm-omni
pip install imageio imageio-ffmpeg # For video I/O
```

## Usage

### Image-to-Video (R2V)

Generate a video from a reference image:

```bash
python image_to_video.py \
--model Skywork/SkyReels-V3-R2V-14B \
--image path/to/your/image.jpg \
--prompt "A person walking through a beautiful garden" \
--height 480 \
--width 832 \
--num-frames 81 \
--num-inference-steps 50 \
--guidance-scale 7.5 \
--seed 42 \
--output-dir ./outputs/skyreels_v3 \
--output-format mp4
```

### Parameters

- `--model`: Model name or path (default: `Skywork/SkyReels-V3-R2V-14B`)
- `--image`: Path to the reference image (required)
- `--prompt`: Text prompt describing the desired video
- `--negative-prompt`: Negative prompt to avoid certain content (optional)
- `--height`: Video height in pixels (default: 480)
- `--width`: Video width in pixels (default: 832)
- `--num-frames`: Number of frames to generate (default: 81)
- `--num-inference-steps`: Number of denoising steps (default: 50, higher = better quality but slower)
- `--guidance-scale`: Classifier-free guidance scale (default: 7.5, higher = more prompt adherence)
- `--seed`: Random seed for reproducibility (default: 42)
- `--output-dir`: Output directory for generated videos
- `--output-format`: Output format: `mp4`, `gif`, or `frames`

## Examples

### Basic Image-to-Video

```bash
python image_to_video.py \
--image examples/sample_image.jpg \
--prompt "A cinematic video of the scene"
```

### High-Quality Generation

```bash
python image_to_video.py \
--image examples/sample_image.jpg \
--prompt "A dramatic video with dynamic camera movement" \
--num-inference-steps 100 \
--guidance-scale 9.0 \
--num-frames 121
```

### Generate GIF

```bash
python image_to_video.py \
--image examples/sample_image.jpg \
--prompt "A looping animation" \
--output-format gif \
--num-frames 49
```

## Tips

1. **Image Quality**: Use high-quality reference images for best results
2. **Aspect Ratio**: The model works best with 16:9 aspect ratio (e.g., 832x480)
3. **Frame Count**: More frames = longer videos but slower generation
4. **Guidance Scale**:
- Lower (3-5): More creative, less adherence to prompt
- Medium (7-9): Balanced
- Higher (10+): Strong prompt adherence, may reduce quality
5. **Inference Steps**: 50 steps is usually sufficient; 100+ for highest quality

## Performance

- **GPU Memory**: ~24GB VRAM required for R2V-14B model
- **Generation Time**: ~2-5 minutes for 81 frames on A100 GPU
- **Batch Size**: Currently supports batch size of 1

## Troubleshooting

### Out of Memory

If you encounter OOM errors:
- Reduce `--num-frames`
- Reduce `--height` and `--width`
- Use a smaller model variant if available

### Poor Quality

If the output quality is poor:
- Increase `--num-inference-steps` (try 75-100)
- Adjust `--guidance-scale` (try 8-10)
- Use a higher quality reference image
- Refine your prompt to be more specific

## Citation

If you use SkyReels-V3 in your research, please cite:

```bibtex
@article{skyreels2025,
title={SkyReels-V3: Multimodal Video Generation with Unified In-Context Learning},
author={Skywork Team},
journal={arXiv preprint},
year={2025}
}
```

## License

SkyReels-V3 models are released under the Skywork License. Please refer to the model card on Hugging Face for details.
194 changes: 194 additions & 0 deletions examples/offline_inference/skyreels_v3/image_to_video.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this file different from examples/offline_inference/image_to_video/image_to_video.py?

Is it necessary to create a new script for SkyReel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @wtomin I thought since am adding a new model It would be much easier to have implementation ready to try out.
Do you suggest removing this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If skyreel v3 falls into one of [image-to-video, text-to-video], I think we can reuse the offline inference script to avoid repetition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will check remove it in a while.

# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

"""
SkyReels-V3 Image-to-Video (R2V) Offline Inference Example.

This script demonstrates how to use the SkyReels-V3 R2V model to generate
videos from reference images using the vLLM-Omni framework.

Usage:
python image_to_video.py --model Skywork/SkyReels-V3-R2V-14B \
--image path/to/image.jpg \
--prompt "A person walking in the park"
"""

import argparse
import os
from pathlib import Path

from PIL import Image

from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput


def main():
parser = argparse.ArgumentParser(description="SkyReels-V3 Image-to-Video Generation")
parser.add_argument(
"--model",
type=str,
default="Skywork/SkyReels-V3-R2V-14B",
help="Model name or path (default: Skywork/SkyReels-V3-R2V-14B)",
)
parser.add_argument(
"--image",
type=str,
required=True,
help="Path to the reference image",
)
parser.add_argument(
"--prompt",
type=str,
default="A cinematic video",
help="Text prompt describing the desired video",
)
parser.add_argument(
"--negative-prompt",
type=str,
default="",
help="Negative prompt (optional)",
)
parser.add_argument(
"--height",
type=int,
default=480,
help="Video height (default: 480)",
)
parser.add_argument(
"--width",
type=int,
default=832,
help="Video width (default: 832)",
)
parser.add_argument(
"--num-frames",
type=int,
default=81,
help="Number of frames to generate (default: 81)",
)
parser.add_argument(
"--num-inference-steps",
type=int,
default=50,
help="Number of denoising steps (default: 50)",
)
parser.add_argument(
"--guidance-scale",
type=float,
default=7.5,
help="Guidance scale for classifier-free guidance (default: 7.5)",
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Random seed for reproducibility (default: 42)",
)
parser.add_argument(
"--output-dir",
type=str,
default="./outputs/skyreels_v3",
help="Output directory for generated videos (default: ./outputs/skyreels_v3)",
)
parser.add_argument(
"--output-format",
type=str,
default="mp4",
choices=["mp4", "gif", "frames"],
help="Output format: mp4, gif, or frames (default: mp4)",
)

args = parser.parse_args()

# Create output directory
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

# Load reference image
if not os.path.exists(args.image):
raise FileNotFoundError(f"Image not found: {args.image}")

image = Image.open(args.image).convert("RGB")
print(f"Loaded reference image: {args.image} ({image.size})")

# Initialize the model
print(f"Loading SkyReels-V3 model: {args.model}")
model = OmniDiffusion(
model=args.model,
model_class_name="SkyReelsV3R2VPipeline",
trust_remote_code=True,
)

# Prepare the request
print(f"\nGenerating video with prompt: '{args.prompt}'")
print("Parameters:")
print(f" - Resolution: {args.width}x{args.height}")
print(f" - Frames: {args.num_frames}")
print(f" - Steps: {args.num_inference_steps}")
print(f" - Guidance Scale: {args.guidance_scale}")
print(f" - Seed: {args.seed}")

# Generate video
outputs = model.generate(
prompts=[
{
"prompt": args.prompt,
"multi_modal_data": {"image": image},
}
],
sampling_params=OmniDiffusionSamplingParams(
height=args.height,
width=args.width,
num_frames=args.num_frames,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
seed=args.seed,
),
)
Comment on lines +135 to +150

Copilot AI Feb 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example passes a raw dict to OmniDiffusion.generate(..., sampling_params=...), but OmniDiffusionRequest expects an OmniDiffusionSamplingParams dataclass; passing a dict will crash in OmniDiffusionRequest.__post_init__. Construct OmniDiffusionSamplingParams(**{...}) (or otherwise follow the patterns used by other offline examples).

Copilot uses AI. Check for mistakes.

# Save the generated video
for idx, output in enumerate(outputs):
# Extract video frames from OmniRequestOutput
video_frames = None
if isinstance(output, OmniRequestOutput):
# In diffusion mode, output.images is the full list of frames
if hasattr(output, "images") and output.images:
video_frames = output.images
else:
raise ValueError("No video data found in diffusion output.")
else:
raise TypeError(f"Unexpected output type: {type(output)}")

if args.output_format == "mp4":
output_path = output_dir / f"video_{idx:04d}.mp4"
# Save as MP4 video
import imageio

imageio.mimsave(output_path, video_frames, fps=24, codec="libx264")
print(f"\nSaved video to: {output_path}")

elif args.output_format == "gif":
output_path = output_dir / f"video_{idx:04d}.gif"
# Save as GIF
import imageio

imageio.mimsave(output_path, video_frames, fps=12)
print(f"\nSaved GIF to: {output_path}")

else: # frames
frames_dir = output_dir / f"video_{idx:04d}_frames"
frames_dir.mkdir(exist_ok=True)
# Save individual frames
for frame_idx, frame in enumerate(video_frames):
frame_path = frames_dir / f"frame_{frame_idx:04d}.png"
Image.fromarray(frame).save(frame_path)
print(f"\nSaved {len(video_frames)} frames to: {frames_dir}")

print("\nGeneration complete!")


if __name__ == "__main__":
main()
15 changes: 15 additions & 0 deletions vllm_omni/diffusion/models/skyreels_v3/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""SkyReels-V3 multimodal video generation models."""

from .pipeline_skyreels_v3_r2v import (
SkyReelsV3R2VPipeline,
get_skyreels_v3_r2v_post_process_func,
get_skyreels_v3_r2v_pre_process_func,
)
from .skyreels_v3_transformer import SkyReelsTransformer3DModel

__all__ = [
"SkyReelsV3R2VPipeline",
"get_skyreels_v3_r2v_post_process_func",
"get_skyreels_v3_r2v_pre_process_func",
"SkyReelsTransformer3DModel",
]
Loading