sgl-project · ping1jing2 · Mar 30, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 30, 2026
diff --git a/docs/diffusion/performance/index.md b/docs/diffusion/performance/index.md
@@ -30,6 +30,12 @@ cache/index
 profiling
 ```
 
+## Current Baseline Snapshot
+
+For Ring SP benchmark details, see:
+
+- [Ring SP Performance](ring_sp_performance.md)
+
 ## References
 
 - [Cache-DiT Repository](https://github.com/vipshop/cache-dit)

diff --git a/docs/diffusion/performance/ring_sp_performance.md b/docs/diffusion/performance/ring_sp_performance.md
@@ -0,0 +1,67 @@
+# Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)
+
+This page reports Ring-SP performance for `Wan2.2-TI2V-5B-Diffusers` using:
+
+- Parallel config: `sp=2, ulysses=1, ring=2` (short: `u1r2`)
+- Baseline config: `sp=1, ulysses=1, ring=1` (short: `u1r1`)
+
+## Benchmark Setup
+
+- Model: `Wan2.2-TI2V-5B-Diffusers`
+- GPU: `48G RTX40 series * 2`
+
+## Online Serving
+
+### Ring SP (`u1r2`)
+
+```bash
+sglang serve \
+  --model-type diffusion \
+  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
+  --num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
+  --port 8898
+```
+
+### Baseline (`u1r1`)
+
+```bash
+sglang serve \
+  --model-type diffusion \
+  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
+  --num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
+  --port 8898
+```
+
+## Benchmarks
+
+### Benchmark Disclaimer
+
+These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.
+
+### Stage Time Breakdown
+
+| Stage / Metric | `u1r2` (s) | `u1r1` baseline (s) | Speedup |
+|---|---:|---:|---:|
+| InputValidation | 0.1060 | 0.1029 | 0.97x |
+| TextEncoding | 1.3965 | 2.2261 | 1.59x |
+| LatentPreparation | 0.0002 | 0.0002 | 1.00x |
+| TimestepPreparation | 0.0003 | 0.0004 | 1.33x |
+| Denoising | 52.6358 | 71.6785 | 1.36x |
+| Decoding | 7.6708 | 13.4314 | 1.75x |
+| **Total** | **63.74** | **90.63** | **1.42x** |
+
+### Memory Usage
+
+| Memory Metric | `u1r2` (GB) | `u1r1` baseline (GB) | Delta |
+|---|---:|---:|---:|
+| Peak GPU Memory | 20.07 | 27.40 | -7.33 |
+| Peak Allocated | 13.35 | 20.40 | -7.05 |
+| Memory Overhead | 6.72 | 7.00 | -0.28 |
+| Overhead Ratio | 33.5% | 25.6% | +7.9pp |
+
+## Summary
+
+- End-to-end latency improves from `90.63s` to `63.74s` (`1.42x`).
+- Main gains come from `Denoising` (`1.36x`) and `Decoding` (`1.75x`).
+- Absolute memory usage drops noticeably on Ring-SP (`Peak GPU Memory -7.33GB`, `Peak Allocated -7.05GB`).
+- Overhead ratio rises (`+7.9pp`), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.
diff --git a/docs/index.rst b/docs/index.rst
@@ -86,6 +86,8 @@ Its core features include:
    diffusion/compatibility_matrix
    diffusion/api/cli
    diffusion/api/openai_api
+   diffusion/performance/index
+   diffusion/performance/ring_sp_performance
    diffusion/performance/attention_backends
    diffusion/performance/cache/index
    diffusion/quantization