LightLLM v1.0.0 Release!
New Features
-
Cross-Process Request Object:
- Retained and optimized the previous three-process architecture design.
- Introduced a request object that can be accessed across processes, significantly reducing inter-process communication overhead.
-
Folding of scheduling and model inference:
- Implemented the folding of scheduling and model inference, significantly reducing communication overhead between the scheduler and modelrpc.
-
CacheTensorManager:
- New class to manage the allocation and release of Torch tensors within the framework.
- Maximizes tensor sharing across layers at runtime and enhances memory sharing between different CUDA graphs.
- On an 8x80GB H100 machine, using the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without out of memory (OOM).
-
PD-Disaggregation Prototype
- Dynamic registration of P and D nodes
-
Fastest DeepSeek-R1 performance on H200
For more details, stay tuned to our blog at https://www.light-ai.top/lightllm-blog/. Thanks to outstanding projects like vllm, sglang, and trtllm, LightLLM also leverages some of the high-performance quantization kernels from vllm. We hope to collaborate in driving the growth of the open-source community.