Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2369818
Update documentation
OrangeRedeng Mar 22, 2026
75de00b
Fix lint issue
OrangeRedeng Mar 22, 2026
49d06e1
Merge remote-tracking branch 'sglang' into update_npu_quantization_doc
OrangeRedeng Mar 23, 2026
50c13a2
Update quantization.md
OrangeRedeng Mar 23, 2026
094145c
Update ascend_npu_quantization.md
TamirBaydasov Mar 23, 2026
1970044
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 23, 2026
c007afd
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 24, 2026
1e528da
Update quantization.md
OrangeRedeng Mar 24, 2026
f4f3cad
Update quantization.md
OrangeRedeng Mar 24, 2026
24318e1
Update quantization.md
OrangeRedeng Mar 24, 2026
1586790
Update ascend_contribution_guide.md
OrangeRedeng Mar 24, 2026
0bc72b8
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 24, 2026
6a6b9d7
Update README.md
OrangeRedeng Mar 24, 2026
dd36297
fix Lint issue
OrangeRedeng Mar 24, 2026
32fd60a
Update ascend_contribution_guide.md
OrangeRedeng Mar 24, 2026
63ce5f1
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 24, 2026
0c1646b
Update ascend_contribution_guide.md
OrangeRedeng Mar 24, 2026
fe897d9
Update ascend_contribution_guide.md
OrangeRedeng Mar 24, 2026
390a7ce
Update ascend_contribution_guide.md
OrangeRedeng Mar 24, 2026
f7460ff
Update ascend_contribution_guide.md
OrangeRedeng Mar 25, 2026
723dd6f
Merge remote-tracking branch 'sglang' into update_npu_quantization_doc
OrangeRedeng Mar 25, 2026
ab78c20
Update quantization.md
OrangeRedeng Mar 25, 2026
055288a
Update quantization.md
OrangeRedeng Mar 25, 2026
67906c5
Update quantization.md
OrangeRedeng Mar 25, 2026
4d9c431
Update quantization.md
OrangeRedeng Mar 25, 2026
0a23f99
Update quantization.md
OrangeRedeng Mar 25, 2026
ca7d7a4
Update quantization.md
OrangeRedeng Mar 25, 2026
4ddda82
Update quantization.md
OrangeRedeng Mar 25, 2026
0d09d78
Update quantization.md
OrangeRedeng Mar 25, 2026
8491d8b
Update quantization.md
OrangeRedeng Mar 25, 2026
30dfef2
Update quantization.md
OrangeRedeng Mar 25, 2026
739e25d
Update quantization.md
OrangeRedeng Mar 25, 2026
fadbe9c
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 25, 2026
ac9831b
fix Lint issue
OrangeRedeng Mar 25, 2026
c8e53d4
Update quantization.md
OrangeRedeng Mar 25, 2026
5fac616
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 25, 2026
d4c5810
fix Lint issue
OrangeRedeng Mar 25, 2026
bb543b7
Update ascend_npu_qwen3_5_examples.md
OrangeRedeng Mar 25, 2026
4302b85
Update ascend_npu_glm5_examples.md
OrangeRedeng Mar 25, 2026
94d862f
Fix lint issue
OrangeRedeng Mar 25, 2026
6251b82
Update ascend_contribution_guide.md
OrangeRedeng Mar 25, 2026
037c347
Update ascend_contribution_guide.md
OrangeRedeng Mar 25, 2026
57cf070
Update ascend_contribution_guide.md
OrangeRedeng Mar 25, 2026
76d2eb9
Update ascend_contribution_guide.md
OrangeRedeng Mar 25, 2026
7c762c5
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 25, 2026
878e0cf
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 26, 2026
ea03e31
Merge branch 'main' into update_npu_quantization_doc
OrangeRedeng Mar 27, 2026
b55dde7
Update SKILL.md
OrangeRedeng Mar 27, 2026
d5744ce
Update README.md
OrangeRedeng Mar 27, 2026
48c6e1e
Update SKILL.md
OrangeRedeng Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion .claude/skills/write-sglang-test/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,22 @@ Defined in `python/sglang/test/test_utils.py`:
| `stage-c-test-large-8-gpu-amd` | `linux-mi325-8gpu-sglang` | 8-GPU MI325 scaling and integration |
| `stage-c-test-large-8-gpu-amd-mi35x` | `linux-mi35x-gpu-8` | 8-GPU MI35x scaling (2 partitions) |


### Per-commit (Ascend NPU)

| Suite | Runner (label) | Description |
| --- | --- | --- |
| `per-commit-1-npu-a2` | `linux-aarch64-a2-1` | 1-NPU LLM CI machine |
| `per-commit-2-npu-a2` | `linux-aarch64-a2-2` | 2-NPU LLM CI machine |
| `per-commit-4-npu-a3` | `linux-aarch64-a3-4` | 4-NPU LLM CI machine |
| `per-commit-16-npu-a3` | `linux-aarch64-a3-16` | 16-NPU LLM CI machine |
| `multimodal-gen-test-1-npu-a3` | `linux-aarch64-a3-2` | 1-NPU multimodal CI machine |
| `multimodal-gen-test-2-npu-a3` | `linux-aarch64-a3-16` | 2-NPU multimodal CI machine |
| `multimodal-gen-test-8-npu-a3` | `linux-aarch64-a3-16` | 8-NPU multimodal CI machine |

#### Nightly

Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml` and `nightly-test-amd.yml`, not `pr-test.yml`. Examples:
Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml`, `nightly-test-amd.yml` amd `nightly-test-npu.yml`, not `pr-test.yml`. Examples:

- `nightly-1-gpu` (CUDA)
- `nightly-kernel-1-gpu` (CUDA, JIT kernel full grids)
Expand All @@ -103,6 +116,11 @@ Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../
- `nightly-eval-vlm-2-gpu` (CUDA)
- `nightly-amd` (AMD)
- `nightly-amd-8-gpu-mi35x` (AMD)
- `nightly-1-npu-a3` (NPU)
- `nightly-2-npu-a3` (NPU)
- `nightly-4-npu-a3` (NPU)
- `nightly-8-npu-a3` (NPU)
- `nightly-16-npu-a3` (NPU)

> **Note**: Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.

Expand Down
121 changes: 96 additions & 25 deletions docs/advanced_features/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,35 @@ to guard against abnormal quantization loss regressions.

## Platform Compatibility

The following table summarizes quantization method support across NVIDIA and AMD GPUs.

| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Notes |
|--------|:-----------:|:-------------------------------:|-------|
| `fp8` | Yes | Yes | Aiter or Triton backend on AMD |
| `mxfp4` | Yes | Yes | Requires CDNA3/CDNA4 with MXFP support; uses Aiter |
| `blockwise_int8` | Yes | Yes | Triton-based, works on both platforms |
| `w8a8_int8` | Yes | Yes | |
| `w8a8_fp8` | Yes | Yes | Aiter or Triton FP8 on AMD |
| `awq` | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA) |
| `gptq` | Yes | Yes | Uses Triton or vLLM kernels on AMD |
| `compressed-tensors` | Yes | Yes | Aiter paths for FP8/MoE on AMD |
| `quark` | Yes | Yes | AMD Quark quantization; Aiter GEMM paths on AMD |
| `auto-round` | Yes | Yes | Platform-agnostic (Intel auto-round) |
| `quark_int4fp8_moe` | No | Yes | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) |
| `awq_marlin` | Yes | No | Marlin kernels are CUDA-only |
| `gptq_marlin` | Yes | No | Marlin kernels are CUDA-only |
| `gguf` | Yes | No | CUDA-only kernels in sgl-kernel |
| `modelopt` / `modelopt_fp8` | Yes (Hopper/SM90+) | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); requires NVIDIA hardware |
| `modelopt_fp4` | Yes (Blackwell/SM100+) | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); native FP4 on Blackwell (B200, GB200) |
| `petit_nvfp4` | No | Yes (MI250/MI300X/MI325X) | Enables NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel); use `modelopt_fp4` on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/) and [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html). |
| `bitsandbytes` | Yes | Experimental | Depends on bitsandbytes ROCm support |
| `torchao` (`int4wo`, etc.) | Yes | Partial | `int4wo` not supported on AMD; other methods may work |
The following table summarizes quantization method support across NVIDIA and AMD GPUs, Ascend NPUs.

| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Ascend NPUs (A2/A3) | Notes |
|--------|:-----------:|:-------------------------------:|:-----------------------:|-------|
| `fp8` | Yes | Yes | WIP | Aiter or Triton backend on AMD |
| `mxfp4` | Yes | Yes | WIP | Requires CDNA3/CDNA4 with MXFP support; uses Aiter |
| `blockwise_int8` | Yes | Yes | No | Triton-based, works on both platforms |
| `w8a8_int8` | Yes | Yes | No | |
| `w8a8_fp8` | Yes | Yes | No | Aiter or Triton FP8 on AMD |
| `awq` | Yes | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA). Uses CANN kernels on Ascend|
| `gptq` | Yes | Yes | Yes | Uses Triton or vLLM kernels on AMD. Uses CANN kernels on Ascend|
| `compressed-tensors` | Yes | Yes | Partial | Aiter paths for FP8/MoE on AMD. Uses CANN kernels on Ascend, `FP8` not supported yet|
| `quark` | Yes | Yes | No | AMD Quark quantization; Aiter GEMM paths on AMD |
| `auto-round` | Yes | Yes | Partial | Platform-agnostic (Intel auto-round). Uses CANN kernels on Ascend|
| `quark_int4fp8_moe` | No | Yes | No | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) |
| `awq_marlin` | Yes | No | No | Marlin kernels are CUDA-only |
| `gptq_marlin` | Yes | No | No | Marlin kernels are CUDA-only |
| `gguf` | Yes | No | WIP | CUDA-only kernels in sgl-kernel |
| `modelopt` / `modelopt_fp8` | Yes (Hopper/SM90+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); requires NVIDIA hardware |
| `modelopt_fp4` | Yes (Blackwell/SM100+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); native FP4 on Blackwell (B200, GB200) |
| `petit_nvfp4` | No | Yes (MI250/MI300X/MI325X) | No | Enables NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel); use `modelopt_fp4` on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/) and [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html). |
| `bitsandbytes` | Yes | Experimental | No | Depends on bitsandbytes ROCm support |
| `torchao` (`int4wo`, etc.) | Yes | Partial | No | `int4wo` not supported on AMD; other methods may work |
| `modelslim` | No | No | Yes | Ascend quantization; Uses CANN kernels |

On AMD, several of these methods use [Aiter](https://github.com/ROCm/aiter) for acceleration -- set `SGLANG_USE_AITER=1` where noted. See [AMD GPU setup](../platforms/amd_gpu.md) for installation and configuration details.

On Ascend, various layers quantization configurations are supported, see [Ascend NPU quantization](../platforms/ascend/ascend_npu_quantization.md) for details.

## GEMM Backends for FP4/FP8 Quantization

:::{note}
Expand Down Expand Up @@ -81,7 +84,7 @@ When FlashInfer is unavailable for NVFP4, sgl-kernel CUTLASS is used as an autom

To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline,
there's no need to add `--quantization` argument when starting the engine. The quantization method will be parsed from the
downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.**
downloaded Hugging Face or msModelSlim config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.**

```bash
python3 -m sglang.launch_server \
Expand Down Expand Up @@ -319,7 +322,6 @@ For detailed usage and supported model architectures, see [NVIDIA Model Optimize

SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.


##### Installation

First, install ModelOpt:
Expand Down Expand Up @@ -477,6 +479,74 @@ model_loader.load_model(model_config=model_config, device_config=DeviceConfig())
- **Calibration-based**: Uses calibration datasets for optimal quantization quality
- **Production Ready**: Enterprise-grade quantization with NVIDIA support

#### Using [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.

- **Installation**

```bash
# Clone repo and install msmodelslim:
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
```

- **LLM quantization**

Download the original floating-point weights of the large model. Taking Qwen3-32B as an example, you can go to [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the huggingface model card).
> Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).

_Traditional quantification methods require the preparation of calibration data files (```.jsonl``` formats) for calibration in the quantification process._
```bash
Qwen3-32B/ # floating-point model downloaded from official HF (or modelscope) repo
msmodelslim/ # msmodelslim repo
|----- lab_calib # calibration date folder (put your dataset here in ```.jsonl``` format or use pre-prepared ones)
|----- some file (such as laos_calib.jsonl)
|----- lab_practice # best practice folder with configs for quantization
|----- model folder (such as qwen3_5_moe folder) # folder with quantization configs
|----- quant_config (such as qwen3_5_moe_w8a8.yaml) # quantization config
|----- another folders
output_folder/ # generated by below command
|----- quant_model_weights-00001-of-0001.safetensors # quantized weights
|----- quant_model_description.json # file with description of the quantization methods for each layer (```W4A4_DYNAMIC```, etc.)
|----- another files (such as config.json, tokenizer.json, etc.)
```
Run quantization using one-click quantization (recommended):
```bash
msmodelslim quant \
--model_path ${MODEL_PATH} \
--save_path ${SAVE_PATH} \
--device npu:0,1 \
--model_type Qwen3-32B \
--quant_type w8a8 \
--trust_remote_code True
```

- **Usage Example**
```bash
python3 -m sglang.launch_server \
--model-path $PWD/Qwen3-32B-w8a8 \
--port 30000 --host 0.0.0.0
```

- **Available Quantization Methods**:
- [x] ```W4A4_DYNAMIC``` linear with online quantization of activations
- [x] ```W8A8``` linear with offline quantization of activations
- [x] ```W8A8_DYNAMIC``` linear with online quantization of activations
- [x] ```W4A4_DYNAMIC``` MOE with online quantization of activations
- [x] ```W4A8_DYNAMIC``` MOE with online quantization of activations
- [x] ```W8A8_DYNAMIC``` MOE with online quantization of activations
- [ ] ```W4A8``` linear TBD
- [ ] ```W4A16``` linear TBD
- [ ] ```W48A16``` linear TBD
- [ ] ```W4A16``` MoE in progress
- [ ] ```W8A16``` MoE in progress
- [ ] ```KV Cache``` in progress
- [ ] ```Attention``` in progress


For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/README.md) section in ModelSLim repo.

## Online Quantization

To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
Expand Down Expand Up @@ -529,3 +599,4 @@ Other layers (e.g. projections in the attention layers) have their weights quant
- [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao)
- [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/)
- [auto-round](https://github.com/intel/auto-round)
- [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
2 changes: 1 addition & 1 deletion docs/basic_usage/deepseek_v3.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Detailed commands for reference:
- [16 x A100 (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
- [32 x L40S (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
- [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
- [4 x Atlas 800I A3 (int8)](../platforms/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3)
- [4 x Atlas 800I A3 (int8)](../platforms/ascend/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3)

### Download Weights
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
Expand Down
2 changes: 1 addition & 1 deletion docs/diffusion/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ pip install -e "python[all_musa]"

## Platform-Specific: Ascend NPU

For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend_npu.md).
For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend/ascend_npu.md).

Quick test:

Expand Down
68 changes: 67 additions & 1 deletion docs/diffusion/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,8 @@ backend.
|------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| `fp8` | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL | None | Component-folder and single-file flows are both supported |
| `nvfp4-modelopt` | NVFP4 safetensors file, sharded directory, or repo providing transformer weights | `--transformer-weights-path` | FLUX.2 | `comfy-kitchen` optional on Blackwell | Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
| `nunchaku-svdq` | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...` | `--transformer-weights-path` | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku` | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4` |
| `nunchaku-svdq` | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...` | `--transformer-weights-path` | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku` | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4` |
| `msmodelslim` | Pre-quantized msmodelslim transformer weights | `--model-path` | Wan2.2 family | None | Currently only compatible with the Ascend NPU family and supports both `w8a8` and `w4a4` |

## NVFP4

Expand Down Expand Up @@ -171,3 +172,68 @@ sglang generate \
as `4` or `8`.
- Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x)
or SM12x GPUs. Hopper (SM90) is currently rejected.

## [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.

- **Installation**

```bash
# Clone repo and install msmodelslim:
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
```

- **Multimodal_sd quantization**

Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to [Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
> Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).

Run quantization using one-click quantization (recommended):

```bash
msmodelslim quant \
--model_path /path/to/wan2_2_float_weights \
--save_path /path/to/wan2_2_quantized_weights \
--device npu \
--model_type Wan2_2 \
--quant_type w8a8 \
--trust_remote_code True
```

For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/multimodal_sd/README.md) section in ModelSLim repo.

> Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.

- **Auto-Detection and different formats**

For msmodelslim checkpoints, it's enough to specify only ```--model-path```, the detection of quantization occurs automatically for each layer using parsing of `quant_model_description.json` config.

In the case of `Wan2.2` only `Diffusers` weights storage format are supported, whereas modelslim saves the quantized model in the original `Wan2.2` format,
for conversion in use `python/sglang/multimodal_gen/tools/wan_repack.py` script:

```bash
python wan_repack.py \
--input-path {path_to_quantized_model} \
--output-path {path_to_converted_model}
```

After that, please copy all files from original `Diffusers` checkpoint (instead of `transformer`/`tranfsormer_2` folders)

- **Usage Example**

With auto-detected flow:

```bash
sglang generate \
--model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \
--prompt "a beautiful sunset" \
--save-output
```

- **Available Quantization Methods**:
- [x] ```W4A4_DYNAMIC``` linear with online quantization of activations
- [x] ```W8A8``` linear with offline quantization of activations
- [x] ```W8A8_DYNAMIC``` linear with online quantization of activations
- [ ] ```mxfp8``` linear in progress
2 changes: 1 addition & 1 deletion docs/get_started/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

You can install SGLang using one of the methods below.
This page primarily applies to common NVIDIA GPU platforms.
For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md), and [Intel XPU](../platforms/xpu.md).
For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend/ascend_npu.md), and [Intel XPU](../platforms/xpu.md).

## Method 1: With pip or uv

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Its core features include:
platforms/cpu_server.md
platforms/tpu.md
platforms/nvidia_jetson.md
platforms/ascend_npu_support.rst
platforms/ascend/ascend_npu_support.rst
platforms/xpu.md

.. toctree::
Expand Down
Loading
Loading