sgl-project · hnyls2002 · Feb 12, 2026 · Feb 2, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/docs/advanced_features/server_arguments.md b/docs/advanced_features/server_arguments.md
@@ -373,6 +373,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
 | `--kt-max-deferred-experts-per-token` | [ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. | `None` | Type: int |
 
 ## Diffusion LLM
+
 | Argument | Description | Defaults | Options |
 | --- | --- | --- | --- |
 | `--dllm-algorithm` | The diffusion LLM algorithm, such as LowConfidence. | `None` | Type: str |

diff --git a/docs/basic_usage/diffusion.md b/docs/basic_usage/diffusion.md
@@ -4,7 +4,7 @@ SGLang supports two categories of diffusion models for different use cases. This
 
 ## Image & Video Generation Models
 
-For generating images and videos from text prompts, SGLang supports [many](../supported_models/image_generation/diffusion_models.md#image-generation-models) models like:
+For generating images and videos from text prompts, SGLang supports [many](../diffusion/compatibility_matrix.md) models like:
 
 - **FLUX, Qwen-Image** - High-quality image generation
 - **Wan 2.2, HunyuanVideo** - Video generation
@@ -16,4 +16,4 @@ python3 -m sglang.launch_server \
   --host 0.0.0.0 --port 30000
 ```
 
-**Full model list:** [Diffusion Models](../supported_models/image_generation/diffusion_models.md)
+**Full model list:** [Diffusion Models](../diffusion/compatibility_matrix.md)
diff --git a/python/sglang/multimodal_gen/docs/cli.md → docs/diffusion/api/cli.md b/python/sglang/multimodal_gen/docs/cli.md → docs/diffusion/api/cli.md
@@ -5,7 +5,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
 ## Prerequisites
 
 - A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
-- Python 3.11+ if you plan to use the OpenAI Python SDK.
 
 
 ## Supported Arguments
@@ -35,15 +34,15 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
 - `--seed {SEED}`: Random seed for reproducible generation
 
 
-#### Image/Video Configuration
+**Image/Video Configuration**
 
 - `--height {HEIGHT}`: Height of the generated output
 - `--width {WIDTH}`: Width of the generated output
 - `--num-frames {NUM_FRAMES}`: Number of frames to generate
 - `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
 
 
-#### Output Options
+**Output Options**
 
 - `--output-path {PATH}`: Directory to save the generated video
 - `--save-output`: Whether to save the image/video to disk
@@ -168,7 +167,7 @@ When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
 3. Upon successful upload, the local file is deleted.
 4. The API response returns the public URL of the uploaded object.
 
-#### Configuration
+**Configuration**
 
 Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
 
@@ -183,7 +182,7 @@ export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
 export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
 ```
 
-See [Environment Variables Documentation](environment_variables.md) for more details.
+See [Environment Variables Documentation](../environment_variables.md) for more details.
 
 ## Generate
 

diff --git a/.../sglang/multimodal_gen/docs/openai_api.md → docs/diffusion/api/openai_api.md b/.../sglang/multimodal_gen/docs/openai_api.md → docs/diffusion/api/openai_api.md
@@ -2,6 +2,10 @@
 
 The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.
 
+## Prerequisites
+
+- Python 3.11+ if you plan to use the OpenAI Python SDK.
+
 ## Serve
 
 Launch the server using the `sglang serve` command.
@@ -25,7 +29,7 @@ sglang serve "${SERVER_ARGS[@]}"
 - **--model-path**: Path to the model or model ID.
 - **--port**: HTTP port to listen on (default: `30000`).
 
-#### Get Model Information
+**Get Model Information**
 
 **Endpoint:** `GET /models`
 
@@ -59,7 +63,7 @@ curl -sS -X GET "http://localhost:30010/models"
 
 The server implements an OpenAI-compatible Images API under the `/v1/images` namespace.
 
-#### Create an image
+**Create an image**
 
 **Endpoint:** `POST /v1/images/generations`
 
@@ -100,7 +104,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/generations" \
 > **Note**
 > The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error.
 
-#### Edit an image
+**Edit an image**
 
 **Endpoint:** `POST /v1/images/edits`
 
@@ -130,7 +134,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
   -F "response_format=url"
 ```
 
-#### Download image content
+**Download image content**
 
 When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images/<IMAGE_ID>/content`.
 
@@ -148,7 +152,7 @@ curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
 
 The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace.
 
-#### Create a video
+**Create a video**
 
 **Endpoint:** `POST /v1/videos`
 
@@ -178,7 +182,7 @@ curl -sS -X POST "http://localhost:30010/v1/videos" \
       }'
 ```
 
-#### List videos
+**List videos**
 
 **Endpoint:** `GET /v1/videos`
 
@@ -197,7 +201,7 @@ curl -sS -X GET "http://localhost:30010/v1/videos" \
   -H "Authorization: Bearer sk-proj-1234567890"
 ```
 
-#### Download video content
+**Download video content**
 
 **Endpoint:** `GET /v1/videos/{video_id}/content`
 
@@ -239,7 +243,7 @@ The server supports dynamic loading, merging, and unmerging of LoRA adapters.
 - Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one
 - Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost
 
-#### Set LoRA Adapter
+**Set LoRA Adapter**
 
 Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.
 
@@ -301,7 +305,7 @@ curl -X POST http://localhost:30010/v1/set_lora \
 > - Multiple LoRAs applied to the same target will be merged in order
 
 
-#### Merge LoRA Weights
+**Merge LoRA Weights**
 
 Manually merges the currently set LoRA weights into the base model.
 
@@ -323,7 +327,7 @@ curl -X POST http://localhost:30010/v1/merge_lora_weights \
 ```
 
 
-#### Unmerge LoRA Weights
+**Unmerge LoRA Weights**
 
 Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA.
 
@@ -336,7 +340,7 @@ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
   -H "Content-Type: application/json"
 ```
 
-#### List LoRA Adapters
+**List LoRA Adapters**
 
 Returns loaded LoRA adapters and current application status per module.
 

diff --git a/python/sglang/multimodal_gen/docs/ci_perf.md → docs/diffusion/ci_perf.md b/python/sglang/multimodal_gen/docs/ci_perf.md → docs/diffusion/ci_perf.md
@@ -1,5 +1,4 @@
-
-## Perf baseline generation script
+## Perf Baseline Generation Script
 
 `python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`.
 

diff --git a/...ang/multimodal_gen/docs/support_matrix.md → docs/diffusion/compatibility_matrix.md b/...ang/multimodal_gen/docs/support_matrix.md → docs/diffusion/compatibility_matrix.md
@@ -16,7 +16,7 @@ default parameters when initializing and generating videos.
 
 ### Video Generation Models
 
-| Model Name                   | Hugging Face Model ID                             | Resolutions         | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention（SLA）| Sage Sparse Linear Attention（SageSLA）|
+| Model Name                   | Hugging Face Model ID                             | Resolutions         | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) |
 |:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|
 | FastWan2.1 T2V 1.3B          | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers`         | 480p                |    ⭕     |         ⭕         |     ⭕     |              ✅               |              ❌               |              ❌               |
 | FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p                |    ⭕     |         ⭕         |     ⭕     |              ✅               |              ❌               |              ❌               |
@@ -34,8 +34,8 @@ default parameters when initializing and generating videos.
 | TurboWan2.1 T2V 14B 720P     | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers`  | 720p                |    ✅     |         ❌         |     ❌     |              ❌               |              ✅               |              ✅               |
 | TurboWan2.2 I2V A14B         | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers`      | 720p                |    ✅     |         ❌         |     ❌     |              ❌               |              ✅               |              ✅               |
 
-**Note**: <br>
-1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.<br>
+**Note**:
+1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
 2.SageSLA Based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`
 
 ### Image Generation Models
@@ -55,7 +55,7 @@ default parameters when initializing and generating videos.
 
 This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline.
 
-> Important: \
+> Important:
 > LoRAs that are not listed here are not necessarily incompatible.
 > In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions.
 > The entries below simply reflect configurations that have been manually validated by the SGLang team.

diff --git a/...glang/multimodal_gen/docs/contributing.md → docs/diffusion/contributing.md b/...glang/multimodal_gen/docs/contributing.md → docs/diffusion/contributing.md
@@ -2,7 +2,7 @@
 
 This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`).
 
-## 1. Commit Message Convention
+## Commit Message Convention
 
 We follow a structured commit message format to maintain a clean history.
 
@@ -21,7 +21,7 @@ We follow a structured commit message format to maintain a clean history.
 - **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc.
 - **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature").
 
-## 2. Performance Reporting
+## Performance Reporting
 
 For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report.
 
@@ -45,7 +45,7 @@ For PRs that impact **latency**, **throughput**, or **memory usage**, you **shou
     ```
 4. **Paste**: paste the table into the PR description
 
-## 3. CI-Based Change Protection
+## CI-Based Change Protection
 
 Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that:
 

diff --git a/...timodal_gen/docs/environment_variables.md → docs/diffusion/environment_variables.md b/...timodal_gen/docs/environment_variables.md → docs/diffusion/environment_variables.md
@@ -1,11 +1,11 @@
 ## Caching Acceleration
 
 These variables configure caching acceleration for Diffusion Transformer (DiT) models.
-SGLang supports multiple caching strategies - see [caching documentation](cache/caching.md) for an overview.
+SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview.
 
 ### Cache-DiT Configuration
 
-See [cache-dit documentation](cache/cache_dit.md) for detailed configuration.
+See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration.
 
 | Environment Variable                | Default | Description                              |
 |-------------------------------------|---------|------------------------------------------|

diff --git a/docs/diffusion/index.md b/docs/diffusion/index.md
@@ -0,0 +1,98 @@
+# SGLang Diffusion
+
+SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels and an efficient scheduler loop.
+
+## Key Features
+
+- **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
+- **Fast Inference**: Optimized kernels, efficient scheduler loop, and Cache-DiT acceleration
+- **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK
+- **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X)
+
+---
+
+## Quick Start
+
+### Installation
+
+```bash
+uv pip install "sglang[diffusion]" --prerelease=allow
+```
+
+See [Installation Guide](installation.md) for more installation methods and ROCm-specific instructions.
+
+### Basic Usage
+
+Generate an image with the CLI:
+
+```bash
+sglang generate --model-path Qwen/Qwen-Image \
+    --prompt "A beautiful sunset over the mountains" \
+    --save-output
+```
+
+Or start a server with the OpenAI-compatible API:
+
+```bash
+sglang serve --model-path Qwen/Qwen-Image --port 30010
+```
+
+---
+
+## Documentation
+
+### Getting Started
+
+- **[Installation](installation.md)** - Install SGLang Diffusion via pip, uv, Docker, or from source
+- **[Compatibility Matrix](compatibility_matrix.md)** - Supported models and optimization compatibility
+
+### Usage
+
+- **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
+- **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
+
+### Performance Optimization
+
+- **[Performance Overview](performance/index.md)** - Overview of all performance optimization strategies
+- **[Attention Backends](performance/attention_backends.md)** - Available attention backends (FlashAttention, SageAttention, etc.)
+- **[Caching Strategies](performance/cache/)** - Cache-DiT and TeaCache acceleration
+- **[Profiling](performance/profiling.md)** - Profiling techniques with PyTorch Profiler and Nsight Systems
+
+### Reference
+
+- **[Environment Variables](environment_variables.md)** - Configuration via environment variables
+- **[Support New Models](support_new_models.md)** - Guide for adding new diffusion models
+- **[Contributing](contributing.md)** - Contribution guidelines and commit message conventions
+- **[CI Performance](ci_perf.md)** - Performance baseline generation script
+
+---
+
+## CLI Quick Reference
+
+### Generate (one-off generation)
+
+```bash
+sglang generate --model-path <MODEL> --prompt "<PROMPT>" --save-output
+```
+
+### Serve (HTTP server)
+
+```bash
+sglang serve --model-path <MODEL> --port 30010
+```
+
+### Enable Cache-DiT acceleration
+
+```bash
+SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path <MODEL> --prompt "<PROMPT>"
+```
+
+---
+
+## References
+
+- [SGLang GitHub](https://github.com/sgl-project/sglang)
+- [Cache-DiT](https://github.com/vipshop/cache-dit)
+- [FastVideo](https://github.com/hao-ai-lab/FastVideo)
+- [xDiT](https://github.com/xdit-project/xDiT)
+- [Diffusers](https://github.com/huggingface/diffusers)