Skip to content

LightLLM v1.0.0 Release!

Compare
Choose a tag to compare
@shihaobai shihaobai released this 18 Feb 05:09
· 19 commits to main since this release
2d768aa

New Features

  • Cross-Process Request Object:

    • Retained and optimized the previous three-process architecture design.
    • Introduced a request object that can be accessed across processes, significantly reducing inter-process communication overhead.
  • Folding of scheduling and model inference:

    • Implemented the folding of scheduling and model inference, significantly reducing communication overhead between the scheduler and modelrpc.
  • CacheTensorManager:

    • New class to manage the allocation and release of Torch tensors within the framework.
    • Maximizes tensor sharing across layers at runtime and enhances memory sharing between different CUDA graphs.
    • On an 8x80GB H100 machine, using the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without out of memory (OOM).
  • PD-Disaggregation Prototype

    • Dynamic registration of P and D nodes
  • Fastest DeepSeek-R1 performance on H200

    • sglang==0.4.3, vllm==0.7.2, trtllm==0.17.0
    • num_clients = 100. The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128
    • image

For more details, stay tuned to our blog at https://www.light-ai.top/lightllm-blog/. Thanks to outstanding projects like vllm, sglang, and trtllm, LightLLM also leverages some of the high-performance quantization kernels from vllm. We hope to collaborate in driving the growth of the open-source community.