Release LightLLM v1.0.0 Release! · ModelTC/lightllm

New Features

Cross-Process Request Object:
- Retained and optimized the previous three-process architecture design.
- Introduced a request object that can be accessed across processes, significantly reducing inter-process communication overhead.
Folding of scheduling and model inference:
- Implemented the folding of scheduling and model inference, significantly reducing communication overhead between the scheduler and modelrpc.
CacheTensorManager:
- New class to manage the allocation and release of Torch tensors within the framework.
- Maximizes tensor sharing across layers at runtime and enhances memory sharing between different CUDA graphs.
- On an 8x80GB H100 machine, using the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without out of memory (OOM).
PD-Disaggregation Prototype
- Dynamic registration of P and D nodes
Fastest DeepSeek-R1 performance on H200
- sglang==0.4.3, vllm==0.7.2, trtllm==0.17.0
- num_clients = 100. The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128

For more details, stay tuned to our blog at https://www.light-ai.top/lightllm-blog/. Thanks to outstanding projects like vllm, sglang, and trtllm, LightLLM also leverages some of the high-performance quantization kernels from vllm. We hope to collaborate in driving the growth of the open-source community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightLLM v1.0.0 Release!

New Features