Skip to content

Releases: sgl-project/sglang

Release v0.4.6

27 Apr 21:47
84022c0
Compare
Choose a tag to compare

Highlights

  • Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, QWen, Llama, etc). #4709 (comment)
  • PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
  • DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
  • Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
  • Preliminary support for blackwell #5303

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

  • Large scale expert parallelism + PD disaggregation #4734 #5524
  • Pipeline Parallelism #5724
  • MLA Cutlass Backend #5390

What's Changed

Read more

Release v0.4.5

07 Apr 08:33
57f9960
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.

New Features

  • Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct model and 80.7 for Llama-4-Maverick-17B-128E-Instruct model. #5092

  • FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709

  • EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247

  • DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.

  • Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.

Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!

Coming Soon

  • Disaggregated Prefill and Decoding: #4655

  • Llama 4 Optimization: #5118

  • EP Enhancement: #4734

  • FA3 Enhancement: #4709

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

What's Changed

Read more

Release v0.4.4

13 Mar 18:21
6aaeb84
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

  • AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog

  • Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
    --enable-flashinfer-mla

  • Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.

  • DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
    export SGL_ENABLE_JIT_DEEPGEMM=1

  • Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:

  • Other Optimizations:

    • Blackwell architecture Block Scale FP8 GEMM support

    • Support page size greater than 1 #4356

    • Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89

    • Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390

Coming soon

  • Integrate Flash Attention #4385

  • Integrate FlashMLA #4384

  • EAGLE 2 optimization #4383

  • EAGLE 3 day one support #4247

  • Integrate DeepEP #4232

  • Prefill and Decoding Disaggregation

What's Changed

Read more

v0.4.3

14 Feb 02:50
e0b9a42
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

  • Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
  • Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
  • Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

  • Upgraded to FlashInfer v0.2
  • Enabled Flash Attention 3 by default for prefill
  • Extended EAGLE 2 support:
    • Enhanced integration with FlashInfer backend
    • Added support in Triton backend

New Features

  • Introduced Function Calling capabilities
  • Added regex pattern support in XGrammar backend
  • Implemented custom sampling processor for flexible inference control
  • Integrated LoRA support in Triton backend

What's Changed

Read more

Release v0.4.1

25 Dec 23:27
efc52f8
Compare
Choose a tag to compare

Highlights

  • We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

    The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

    Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

  • Various improvements to the cache-aware sglang router, torchao integration, server termination

  • Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Read more

Release v0.4.0

04 Dec 02:14
f8b0326
Compare
Choose a tag to compare

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

  • Zero-overhead batch scheduler: 1.1x increase in throughput.
  • Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
  • Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
  • Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

Read more

Release v0.3.6

22 Nov 11:36
9a00e6f
Compare
Choose a tag to compare

Highlights

  • Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
  • Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
  • Cache-aware load balancer. 4x higher cache hit rate (#1934)
  • Support xgrammar backend for grammar-guided decoding (#2056)
  • Support Prometheus metrics (#1853, #1981)
  • Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
  • Support graceful termination (#1838) and watchdog (#1816)
  • Support notebook-style documentation (https://sgl-project.github.io/)
  • Add an offline benchmark script (#1968)
  • Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
  • New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

Read more

Release v0.3.4.post1

22 Oct 04:30
1f26e8b
Compare
Choose a tag to compare

Highlights

  • Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
    • Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
  • Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
  • Added an overlap scheduler for reducing CPU overhead #1738
  • New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
  • Added support for reward models #1525.
  • Added support for Intel XPU #1480.
  • Improved stability for greedy decoding #1589.
  • Accelerated multi-LoRA serving #1587.

What's Changed

Read more

Release v0.3.2

02 Oct 17:19
37c5899
Compare
Choose a tag to compare

Highlight

  • Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
  • Initial support for multi-LoRA serving #1307
  • Integrate torchao for quantization #1341
  • Optimize the CPU scheduler overhead
  • Multiple critical bug fixes for llama and llava (tokenizer, modality)
  • Support AMD backend #1420
  • New models: MiniCPM3, OLMoE

What's Changed

Read more

Release v0.3.0

19 Sep 10:09
5ab9418
Compare
Choose a tag to compare

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

  • Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
  • Up to 1.5x lower latency with torch.compile on small batch sizes
  • Support for interleaved text and multi-image/video in LLaVA-OneVision
  • Support for interleaved window attention and 2x longer context length in Gemma-2
  • Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
  • Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

Read more