Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development Roadmap (2025 H1) #4042

Open
1 of 58 tasks
zhyncs opened this issue Mar 4, 2025 · 8 comments
Open
1 of 58 tasks

Development Roadmap (2025 H1) #4042

zhyncs opened this issue Mar 4, 2025 · 8 comments

Comments

@zhyncs
Copy link
Member

zhyncs commented Mar 4, 2025

Here is the development roadmap for 2025 H1. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). The previous 2024 Q4 roadmap can be found in #1487

Focus

  • Throughput-oriented large-scale deployment similar to the deepseek inference system
  • Long context optimizations
  • Low latency speculative decoding
  • Reinforcement learning training framework integration
  • Kernel optimizations

Parallelism

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

  • Blackwell support @merrymercy
  • AMD GPU @HaiShaw
    • CK kernels
    • aiter integration
  • More backends (Intel XPU, TPU)

Model coverage

Function Calling

Others

@artetaout
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

@Swipe4057
Copy link

As part of long context optimization, the implementation of HiP #3930 attention will be considered?

@zhaochenyang20
Copy link
Collaborator

@Swipe4057 Thanks. We will review this and merge it

@Zhuohao-Li
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

@SandroPats
Copy link

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

@zhyncs
Copy link
Member Author

zhyncs commented Mar 11, 2025

Hi @zhyncs , could you please specify your plans on unsloth model support a bit? Will you be supporting unsloth's 1.58-bit dynamic quantization for deepseek-r1?

Hi @SandroPats Please join https://slack.sglang.ai and discuss at #quantization Thanks!

@artetaout
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

@Zhuohao-Li
Copy link

Hi, about Integrate TransformerEngine layers, which kind of TE layers do you want to integrate ?

Hi @artetaout , now it is layernorm_mlp, we also plan to borrow components from te.linear

Do we plan to get performance improvement via te.layernorm_mlp or te.layernorm_linear ? I've integrated them, but didn't see improvement in bf16

In TE, if you need to enable tp overlap only in inference, you need to split sequences manually (SP/TP). I guess that's perhaps the reason you did not see improvement. You can join https://slack.sglang.ai/ and find me to discuss further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests