Please join our Slack Channel https://slack.sglang.ai. For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at [email protected].
The SGLang Team is honored to announce that the following well-known companies and teams, among others, have adopted SGLang for running DeepSeek V3 and R1. AMD, NVIDIA, Microsoft Azure, Baseten, Novita AI, ByteDance Volcengine, DataCrunch, Hyperbolic, Vultr, RunPod and so on.
🎉 Through dedicated efforts from July to December 2024, the SGLang team has achieved significant milestones with three major releases: v0.2, v0.3, and v0.4. For detailed optimization insights, please refer to our corresponding blog posts.
🚀 We're proud to announce that SGLang has been adopted as:
- The dominant LLM engine by AMD
- The default LLM engine for xAI
For more information, please check out AMD's ROCm 6.3 official announcement and xAI's presentation at the AMD Advancing AI Conference 2024.
[2024-12-04] SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs
[2024-09-04] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision
[2024-07-25] Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)
[2024-02-05] Fast JSON Decoding for Local LLMs with Compressed Finite State Machine
[2024-01-17] Fast and Expressive LLM Inference with RadixAttention and SGLang
[2024-11-13] SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs
[2025-01-21] Accelerating LLM Inference with GemLite, TorchAO and SGLang
[2025-01-15] Efficient LLM Inference with SGLang
[2025-01-15] Cache-Aware Load Balancer in SGLang
[2025-01-15] SGLang DeepSeek Model Optimizations
[2024-12-21] SGLang v0.4 Optimization
[2024-11-10] SGLang Performance Optimization
[2024-10-16] SGLang Overview & CPU Overhead Hiding
[2024-10-16] Faster Constrained Decoding
[2024-10-16] SGLang DeepSeek MLA
[2024-10-16] Universal LLM deployment and low-latency serving in MLC LLM
[2024-10-16] XGrammar: Flexible And Efficient Structured Generation Engine for Large Language Models
[2024-10-16] Review of the first LMSYS online meetup: Efficient LLM Deployment and Serving
[2024-10-10] Efficient LLM Inference with SGLang
[2025-1-25] A fair and efficient scheduling algorithm
[2024-11-30] Update Weights From Distributed
[2024-11-16] SGLang Router and Side-Channel KV Cache Attack
[2024-11-02] Quantization on AMD
[2024-10-05] SGLang Double Sparsity
[2024-09-21] SGLang DeepSeek MLA
SGLang v0.2: Faster Interface and Runtime for LLM Inference
Welcome to follow our YouTube channel.
[2024-11-10] SGLang Performance Optimization
[2024-10-16] The First SGLang Online Meetup
[2024-10-10] Efficient LLM Inference with SGLang
[2025-01-25] SGLang Developer Sync 20250125
[2024-12-28] SGLang Developer Sync 20241228
[2024-12-14] SGLang Developer Sync 20241214
[2024-11-30] SGLang Developer Sync 20241130
[2024-11-16] SGLang Developer Sync 20241116
[2024-11-03] SGLang Developer Sync 20241103
[2024-10-19] SGLang Developer Sync 20241019
[2024-10-05] SGLang Developer Sync 20241005
[2024-09-21] SGLang Developer Sync 20240921
[NeurIPS 24] SGLang: Efficient Execution of Structured Language Model Programs