Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/diffusion/performance/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ cache/index
profiling
```

## Current Baseline Snapshot

For Ring SP benchmark details, see:

- [Ring SP Performance](ring_sp_performance.md)

## References

- [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
Expand Down
67 changes: 67 additions & 0 deletions docs/diffusion/performance/ring_sp_performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)

This page reports Ring-SP performance for `Wan2.2-TI2V-5B-Diffusers` using:

- Parallel config: `sp=2, ulysses=1, ring=2` (short: `u1r2`)
- Baseline config: `sp=1, ulysses=1, ring=1` (short: `u1r1`)

## Benchmark Setup

- Model: `Wan2.2-TI2V-5B-Diffusers`
- GPU: `48G RTX40 series * 2`

## Online Serving

### Ring SP (`u1r2`)

```bash
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
--port 8898
```

### Baseline (`u1r1`)

```bash
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
--port 8898
```

## Benchmarks

### Benchmark Disclaimer

These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.

### Stage Time Breakdown

| Stage / Metric | `u1r2` (s) | `u1r1` baseline (s) | Speedup |
|---|---:|---:|---:|
| InputValidation | 0.1060 | 0.1029 | 0.97x |
| TextEncoding | 1.3965 | 2.2261 | 1.59x |
| LatentPreparation | 0.0002 | 0.0002 | 1.00x |
| TimestepPreparation | 0.0003 | 0.0004 | 1.33x |
| Denoising | 52.6358 | 71.6785 | 1.36x |
| Decoding | 7.6708 | 13.4314 | 1.75x |
| **Total** | **63.74** | **90.63** | **1.42x** |

### Memory Usage

| Memory Metric | `u1r2` (GB) | `u1r1` baseline (GB) | Delta |
|---|---:|---:|---:|
| Peak GPU Memory | 20.07 | 27.40 | -7.33 |
| Peak Allocated | 13.35 | 20.40 | -7.05 |
| Memory Overhead | 6.72 | 7.00 | -0.28 |
| Overhead Ratio | 33.5% | 25.6% | +7.9pp |

## Summary

- End-to-end latency improves from `90.63s` to `63.74s` (`1.42x`).
- Main gains come from `Denoising` (`1.36x`) and `Decoding` (`1.75x`).
- Absolute memory usage drops noticeably on Ring-SP (`Peak GPU Memory -7.33GB`, `Peak Allocated -7.05GB`).
- Overhead ratio rises (`+7.9pp`), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ Its core features include:
diffusion/compatibility_matrix
diffusion/api/cli
diffusion/api/openai_api
diffusion/performance/index
diffusion/performance/ring_sp_performance
diffusion/performance/attention_backends
diffusion/performance/cache/index
diffusion/quantization
Expand Down
Loading