Development Roadmap (2025 H1) #4042

zhyncs · 2025-03-04T00:09:49Z

Here is the development roadmap for 2025 H1. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). The previous 2024 Q4 roadmap can be found in #1487

Focus

Throughput-oriented large-scale deployment similar to the deepseek inference system
Long context optimizations
Low latency speculative decoding
Reinforcement learning training framework integration
Kernel optimizations

Parallelism

Support PD disaggregation @ByronHsu
Support expert parallelism and load balancer https://github.com/deepseek-ai/eplb
Support pipeline parallelism @Ying1123
Support data parallelism attention compatible with all other parallelism @ispobock @merrymercy
Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper. @Ying1123
Support overlap communication in TP/EP @tom @Zhuohao-Li [WIP] Support overlapping two batches #4068
Improvements of sgl-router for better data parallelism @Qihang-Zhang

Caching

Optimize Hierarchical cache (GPU/CPU/Disk) Hierarchical Caching for SGLang #2693 Hierarchical Caching supports MLA #4009 @xiezhq-hermann
Integrate DeepSeek 3FS @yizhang2077

Kernel

Integrate DeepGemm linear support deepgemm #4199 Integrate DeepGemm contiguous group gemm into Fused MoE #4343
Integrate FlashMLA https://github.com/deepseek-ai/FlashMLA
Integrate cuDNN attention. reference
Integrate TransformerEngine layers
Start to maintain performant attention ops in sgl-kernel
Start to maintain more sparse attention ops in sgl-kernel

Quantization

INT4-FP8 MoE & Fused MoE @HaiShaw @carlushuang ROCm: enable trillion-parameter MoE models with INT4-FP8 single node #4152
W8A8 (FP8 and INT8) implementation in sgl-kernel, removing vllm dependency
Integration of awq and gptq in sgl-kernel, removing vllm dependency
TorchAO support extension to additional models
Blackwell FP4 support
Optional quantization support using vllm's implementation (e.g. bnb, gguf)
Communication quant
unsloth model support @guapisolo @XueyingJia @yyihuang

RL Framework integration

veRL integration SGLang + Verl #3852 @fzyzcjy @zhaochenyang20 @ocss884
Multi-turn RL Support for mutliturn online RL training volcengine/verl#385 @UbeCc @PeterSH6
VLM RLHF @yiranyyu @PeterSH6 @zhaochenyang20 @tongyx361 @shuaills
GRPO to trl @jhinpan

Core refactor

Support page size > 1 Support page size > 1 #4356
Simplify scheduler.py and model_runner.py to make them more modular
Integrate CacheTensorManager from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
Integrate Cross-Process Request Object from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
Remove the dependency of vLLM @zhyncs @ByronHsu @yizhang2077 [Track] progress in removing vLLM dependencies #2245

Speculative decoding

Optimizations for large batch @FrankLeeeee
Adaptive speculative decoding according to batch sizes
Reference-based speculative decoding Reference speculative decoding #270 Speculative decoding with lookahead #2790

Multi-LoRA serving

Add Triton backend for lora kernels @Fridge003 [Feature] Define backends and add Triton backend for Lora #3161
Support Tensor Parallelism @ShenAo1111 [Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274
Support cuda graph @Qiaolin-Yu Support cuda graph for LoRA #4115
Support radix attention @Sunt-ing @jcbjcbjc
Support embedding layers @Beichen-Ma
Support Unified Paging @Sunt-ing @jcbjcbjc
Optimizing speed with cublas/cutlass kernels @Fridge003 @jcbjcbjc
Support dynamic loading and unloading @Fridge003

Hardware

Blackwell support @merrymercy
AMD GPU @HaiShaw
- CK kernels
- aiter integration
More backends (Intel XPU, TPU)

Model coverage

Multi-modal models
- DeepSeek VL2 [Feature] Support DeepSeek VL 2 #2653
- mistralai/Pixtral [Feature] Support mistralai/Pixtral #2351
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA https://arxiv.org/abs/2412.04468 @Lyken17
- MiniCPM-o Minicpmo #3023 @mickqian @yiranyyu @yizhang2077
- Janus-pro model: Support Janus-pro #3203 @mickqian @yizhang2077
- intern-vl 2.5 model: Intern vl 2.5 #3351 @mickqian @yizhang2077
- upstream transformers to 4.49.0 @yizhang2077
Language models
- Mamba models

Function Calling

Structural Tag @minleminzui @shuaills @Ubospica
Adapter Refactor @shuaills @Qiaolin-Yu

Others

A padded batch mode to make results more deterministic

sglang/docs/references/faq.md

Line 3 in 8912b76

## The results are not deterministic, even with a temperature of 0
Add nightly eval CI by using lm eval harness @XiaotongJiang @PopSoda2002 @ziliangpeng @Monstertail
Add open-to-use grafana @PopSoda2002 @ziliangpeng

The text was updated successfully, but these errors were encountered:

artetaout · 2025-03-04T03:18:41Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Swipe4057 · 2025-03-04T04:31:11Z

As part of long context optimization, the implementation of HiP #3930 attention will be considered?

zhaochenyang20 · 2025-03-04T08:08:11Z

@Swipe4057 Thanks. We will review this and merge it

Zhuohao-Li · 2025-03-04T08:32:07Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

SandroPats · 2025-03-11T12:44:57Z

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

zhyncs · 2025-03-11T21:44:09Z

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

Hi @SandroPats Please join https://slack.sglang.ai and discuss at #quantization Thanks!

artetaout · 2025-03-13T03:56:13Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

Zhuohao-Li · 2025-03-13T04:51:44Z

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

In TE, if you need to enable tp overlap only in inference, you need to split sequences manually (SP/TP). I guess that's perhaps the reason you did not see improvement. You can join https://slack.sglang.ai/ and find me to discuss further

zhyncs pinned this issue Mar 4, 2025

zhyncs assigned zhyncs, merrymercy, ByronHsu, Ying1123, yizhang2077, zhaochenyang20, ispobock, HaiShaw and Fridge003 Mar 4, 2025

zhyncs mentioned this issue Mar 4, 2025

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

Fridge003 mentioned this issue Mar 4, 2025

[Feature] When will pipeline model parallelism be supported? #4059

Closed

2 tasks

lmxyy mentioned this issue Mar 12, 2025

Nunchaku March Development Roadmap mit-han-lab/nunchaku#201

Open

35 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2025 H1) #4042

Development Roadmap (2025 H1) #4042

zhyncs commented Mar 4, 2025 •

edited by merrymercy

Loading

artetaout commented Mar 4, 2025

Swipe4057 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

Zhuohao-Li commented Mar 4, 2025

SandroPats commented Mar 11, 2025

zhyncs commented Mar 11, 2025

artetaout commented Mar 13, 2025

Zhuohao-Li commented Mar 13, 2025

Development Roadmap (2025 H1) #4042

Development Roadmap (2025 H1) #4042

Comments

zhyncs commented Mar 4, 2025 • edited by merrymercy Loading

Focus

Parallelism

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

Model coverage

Function Calling

Others

artetaout commented Mar 4, 2025

Swipe4057 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

Zhuohao-Li commented Mar 4, 2025

SandroPats commented Mar 11, 2025

zhyncs commented Mar 11, 2025

artetaout commented Mar 13, 2025

Zhuohao-Li commented Mar 13, 2025

zhyncs commented Mar 4, 2025 •

edited by merrymercy

Loading