Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/advanced_features/server_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--kt-max-deferred-experts-per-token` | [ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. | `None` | Type: int |

## Diffusion LLM

| Argument | Description | Defaults | Options |
| --- | --- | --- | --- |
| `--dllm-algorithm` | The diffusion LLM algorithm, such as LowConfidence. | `None` | Type: str |
Expand Down
4 changes: 2 additions & 2 deletions docs/basic_usage/diffusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ SGLang supports two categories of diffusion models for different use cases. This

## Image & Video Generation Models

For generating images and videos from text prompts, SGLang supports [many](../supported_models/image_generation/diffusion_models.md#image-generation-models) models like:
For generating images and videos from text prompts, SGLang supports [many](../diffusion/compatibility_matrix.md) models like:

- **FLUX, Qwen-Image** - High-quality image generation
- **Wan 2.2, HunyuanVideo** - Video generation
Expand All @@ -16,4 +16,4 @@ python3 -m sglang.launch_server \
--host 0.0.0.0 --port 30000
```

**Full model list:** [Diffusion Models](../supported_models/image_generation/diffusion_models.md)
**Full model list:** [Diffusion Models](../diffusion/compatibility_matrix.md)
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
## Prerequisites

- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
- Python 3.11+ if you plan to use the OpenAI Python SDK.


## Supported Arguments
Expand Down Expand Up @@ -35,15 +34,15 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
- `--seed {SEED}`: Random seed for reproducible generation


#### Image/Video Configuration
**Image/Video Configuration**

- `--height {HEIGHT}`: Height of the generated output
- `--width {WIDTH}`: Width of the generated output
- `--num-frames {NUM_FRAMES}`: Number of frames to generate
- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task


#### Output Options
**Output Options**

- `--output-path {PATH}`: Directory to save the generated video
- `--save-output`: Whether to save the image/video to disk
Expand Down Expand Up @@ -168,7 +167,7 @@ When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
3. Upon successful upload, the local file is deleted.
4. The API response returns the public URL of the uploaded object.

#### Configuration
**Configuration**

Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.

Expand All @@ -183,7 +182,7 @@ export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
```

See [Environment Variables Documentation](environment_variables.md) for more details.
See [Environment Variables Documentation](../environment_variables.md) for more details.

## Generate

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.

## Prerequisites

- Python 3.11+ if you plan to use the OpenAI Python SDK.

## Serve

Launch the server using the `sglang serve` command.
Expand All @@ -25,7 +29,7 @@ sglang serve "${SERVER_ARGS[@]}"
- **--model-path**: Path to the model or model ID.
- **--port**: HTTP port to listen on (default: `30000`).

#### Get Model Information
**Get Model Information**

**Endpoint:** `GET /models`

Expand Down Expand Up @@ -59,7 +63,7 @@ curl -sS -X GET "http://localhost:30010/models"

The server implements an OpenAI-compatible Images API under the `/v1/images` namespace.

#### Create an image
**Create an image**

**Endpoint:** `POST /v1/images/generations`

Expand Down Expand Up @@ -100,7 +104,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/generations" \
> **Note**
> The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error.

#### Edit an image
**Edit an image**

**Endpoint:** `POST /v1/images/edits`

Expand Down Expand Up @@ -130,7 +134,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
-F "response_format=url"
```

#### Download image content
**Download image content**

When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images/<IMAGE_ID>/content`.

Expand All @@ -148,7 +152,7 @@ curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \

The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace.

#### Create a video
**Create a video**

**Endpoint:** `POST /v1/videos`

Expand Down Expand Up @@ -178,7 +182,7 @@ curl -sS -X POST "http://localhost:30010/v1/videos" \
}'
```

#### List videos
**List videos**

**Endpoint:** `GET /v1/videos`

Expand All @@ -197,7 +201,7 @@ curl -sS -X GET "http://localhost:30010/v1/videos" \
-H "Authorization: Bearer sk-proj-1234567890"
```

#### Download video content
**Download video content**

**Endpoint:** `GET /v1/videos/{video_id}/content`

Expand Down Expand Up @@ -239,7 +243,7 @@ The server supports dynamic loading, merging, and unmerging of LoRA adapters.
- Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one
- Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost

#### Set LoRA Adapter
**Set LoRA Adapter**

Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.

Expand Down Expand Up @@ -301,7 +305,7 @@ curl -X POST http://localhost:30010/v1/set_lora \
> - Multiple LoRAs applied to the same target will be merged in order


#### Merge LoRA Weights
**Merge LoRA Weights**

Manually merges the currently set LoRA weights into the base model.

Expand All @@ -323,7 +327,7 @@ curl -X POST http://localhost:30010/v1/merge_lora_weights \
```


#### Unmerge LoRA Weights
**Unmerge LoRA Weights**

Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA.

Expand All @@ -336,7 +340,7 @@ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
-H "Content-Type: application/json"
```

#### List LoRA Adapters
**List LoRA Adapters**

Returns loaded LoRA adapters and current application status per module.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@

## Perf baseline generation script
## Perf Baseline Generation Script

`python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ default parameters when initializing and generating videos.

### Video Generation Models

| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear AttentionSLA| Sage Sparse Linear AttentionSageSLA|
| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) |
|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|
| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
Expand All @@ -34,8 +34,8 @@ default parameters when initializing and generating videos.
| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |

**Note**: <br>
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.<br>
**Note**:
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
2.SageSLA Based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`

### Image Generation Models
Expand All @@ -55,7 +55,7 @@ default parameters when initializing and generating videos.

This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline.

> Important: \
> Important:
> LoRAs that are not listed here are not necessarily incompatible.
> In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions.
> The entries below simply reflect configurations that have been manually validated by the SGLang team.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`).

## 1. Commit Message Convention
## Commit Message Convention

We follow a structured commit message format to maintain a clean history.

Expand All @@ -21,7 +21,7 @@ We follow a structured commit message format to maintain a clean history.
- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc.
- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature").

## 2. Performance Reporting
## Performance Reporting

For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report.

Expand All @@ -45,7 +45,7 @@ For PRs that impact **latency**, **throughput**, or **memory usage**, you **shou
```
4. **Paste**: paste the table into the PR description

## 3. CI-Based Change Protection
## CI-Based Change Protection

Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
## Caching Acceleration

These variables configure caching acceleration for Diffusion Transformer (DiT) models.
SGLang supports multiple caching strategies - see [caching documentation](cache/caching.md) for an overview.
SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview.

### Cache-DiT Configuration

See [cache-dit documentation](cache/cache_dit.md) for detailed configuration.
See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration.

| Environment Variable | Default | Description |
|-------------------------------------|---------|------------------------------------------|
Expand Down
98 changes: 98 additions & 0 deletions docs/diffusion/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# SGLang Diffusion

SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels and an efficient scheduler loop.

## Key Features

- **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
- **Fast Inference**: Optimized kernels, efficient scheduler loop, and Cache-DiT acceleration
- **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK
- **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X)

---

## Quick Start

### Installation

```bash
uv pip install "sglang[diffusion]" --prerelease=allow
```

See [Installation Guide](installation.md) for more installation methods and ROCm-specific instructions.

### Basic Usage

Generate an image with the CLI:

```bash
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains" \
--save-output
```

Or start a server with the OpenAI-compatible API:

```bash
sglang serve --model-path Qwen/Qwen-Image --port 30010
```

---

## Documentation

### Getting Started

- **[Installation](installation.md)** - Install SGLang Diffusion via pip, uv, Docker, or from source
- **[Compatibility Matrix](compatibility_matrix.md)** - Supported models and optimization compatibility

### Usage

- **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
- **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management

### Performance Optimization

- **[Performance Overview](performance/index.md)** - Overview of all performance optimization strategies
- **[Attention Backends](performance/attention_backends.md)** - Available attention backends (FlashAttention, SageAttention, etc.)
- **[Caching Strategies](performance/cache/)** - Cache-DiT and TeaCache acceleration
- **[Profiling](performance/profiling.md)** - Profiling techniques with PyTorch Profiler and Nsight Systems

### Reference

- **[Environment Variables](environment_variables.md)** - Configuration via environment variables
- **[Support New Models](support_new_models.md)** - Guide for adding new diffusion models
- **[Contributing](contributing.md)** - Contribution guidelines and commit message conventions
- **[CI Performance](ci_perf.md)** - Performance baseline generation script

---

## CLI Quick Reference

### Generate (one-off generation)

```bash
sglang generate --model-path <MODEL> --prompt "<PROMPT>" --save-output
```

### Serve (HTTP server)

```bash
sglang serve --model-path <MODEL> --port 30010
```

### Enable Cache-DiT acceleration

```bash
SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path <MODEL> --prompt "<PROMPT>"
```

---

## References

- [SGLang GitHub](https://github.com/sgl-project/sglang)
- [Cache-DiT](https://github.com/vipshop/cache-dit)
- [FastVideo](https://github.com/hao-ai-lab/FastVideo)
- [xDiT](https://github.com/xdit-project/xDiT)
- [Diffusers](https://github.com/huggingface/diffusers)
Loading
Loading