[Feature] implement async LoRA prefetch by glenliu21 · Pull Request #14190 · sgl-project/sglang

glenliu21 · 2025-12-01T05:30:28Z

Motivation

This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.

Modifications

Added max_loras_prefetch as a server argument
Implement creation of a ForwardBatch as a LoRA prefetch batch, which consists of requests that are next to be ran on the Scheduler's waiting queue
Implement the LoRA prefetch backend in LoRAManager, the memory pool, and the LoRA backend
Utilize ThreadPoolExecutor and a separate torch.cuda.Stream to enable async prefetching

Accuracy Tests

Added a basic, end-to-end test to ensure that enabling LoRA prefetching doesn't change expected outputs

Benchmarking and Profiling

@ConnorLi96 ran the following commands to benchmark LoRA prefetching:

for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; done
python3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16

This yielded the following results:

Before

----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22579.58  
Median E2E Latency (ms):                 22261.88  
---------------Time to First Token----------------
Mean TTFT (ms):                          16157.50  
Median TTFT (ms):                        15918.48  
P99 TTFT (ms):                           34927.59

After

----------------End-to-End Latency----------------                                         
Mean E2E Latency (ms):          17620.85                                          
Median E2E Latency (ms):         16273.82                                          
---------------Time to First Token----------------                                         
Mean TTFT (ms):             11926.84                                          
Median TTFT (ms):            10865.44                                          
P99 TTFT (ms):              26765.88

These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-01T05:30:56Z

Summary of Changes

Hello @glenliu21, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an asynchronous LoRA prefetching feature designed to enhance the performance of LoRA adapter loading within the system. By anticipating future LoRA adapter needs based on the scheduler's waiting queue, the system can proactively load these adapters in the background using a dedicated thread pool and CUDA stream. This optimization aims to significantly decrease the time requests spend waiting for LoRA weights, leading to improved overall latency and responsiveness, as evidenced by the provided benchmarks.

Highlights

Asynchronous LoRA Prefetching: Implemented a new mechanism to asynchronously prefetch LoRA adapters, based on the S-LoRA policy, to reduce latency.
Configurable Prefetching: Introduced a "max_loras_prefetch" server argument to control the number of LoRA adapters prefetched.
Performance Improvement: Benchmarking results demonstrate a significant reduction in Time to First Token (31%) and End-to-End Latency (27%).
Integrated Backend Support: Prefetching logic is integrated into the LoRAManager, memory pool, and various LoRA backends (Ascend, Chunked, Triton).
Accuracy Validation: Basic end-to-end tests confirm that enabling LoRA prefetching does not alter expected outputs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves performance by reducing TTFT and E2E latency. The implementation follows the S-LoRA paper's prefetch policy and is well-integrated into the existing architecture. The changes are spread across multiple files, including the LoRA backends, memory pool, and scheduler, and are accompanied by a correctness test. My review focuses on a potential issue in the eviction policy and a suggestion for making the prefetch trigger more configurable. Overall, this is a great feature with a solid implementation.

python/sglang/srt/lora/mem_pool.py

python/sglang/srt/managers/scheduler.py

glenliu21 · 2025-12-01T05:57:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves TTFT and E2E latency according to the benchmarks. The implementation is well-structured, with clear separation of concerns between the scheduler, memory pool, and LoRA backends. The use of a separate CUDA stream and a thread pool for prefetching is a good approach for achieving asynchrony.

My review includes a few suggestions to improve maintainability and robustness, such as making the prefetch interval configurable, refactoring duplicated code, and ensuring graceful shutdown of the thread pool. Overall, this is a great feature addition with a solid implementation.

python/sglang/srt/managers/scheduler.py

python/sglang/srt/lora/lora_manager.py

python/sglang/srt/managers/schedule_batch.py

lifuhuang · 2025-12-01T06:35:28Z

Thank you @glenliu21 for the great work!

I wonder in the benchmark done by @ConnorLi96 , what was max-loras-per-batch configured to? Can we run a set of benchmark where the max-loras-per-batch of control is twice as large as that of treatment to make sure we are separating the gains from:

Smart prefetch mechanism (which is what we hope to see), and
Gains coming from double memory buffer

ConnorLi96 · 2025-12-06T21:35:52Z

Thanks for the PR! I'd like to help validate the prefetch feature with a fair comparison. Based on our initial tests and design, I think there few things we need to consider:

let's test scenarios where # adapters > cache capacity, which is more aligned with production serverless endpoint.
I think we can use LRU as a baseline to evaluate the performance of prefetching. Because with the same number of slots used by both methods, we expect prefetching to perform better than pure LRU.
Let's use different workloads to compare them, like 3 scenarios I listed below

So firstly, we will have Test Configurations like:

Config	`max_loras_per_batch`	`max_loras_prefetch`	`max_loaded_loras`	Strategy
A (Prefetch)	8	8	16	Async prefetch
B (Baseline)	16	0	16	LRU only

Both configs use the same memory footprint (16 adapter slots):

Test Scenarios

We'll test with varying adapter counts to stress the eviction/loading path:

Scenario 1: Uniform Distribution
Expected: Prefetch should reduce tail latency (P99 TTFT) as adapter count increases.

Scenario 2: Zipf Distribution (α=1.5)
80% requests hit 20% adapters (more like the realistic workload)
Expected: LRU might still be competitive since hot adapters stay cached. Prefetch could help with cold adapters.

Scenario 3: Bursty Traffic

# Sequential bursts: 
# - First 100 requests: adapter1-10
# - Next 100 requests: adapter11-30
# - Next 100 requests: adapter31-50

Expected: Prefetch should anticipate the burst and preload, reducing batch intervals, and be better than LRU since they are different adapters.

Let me know your thoughts! Happy to run these tests and share detailed results! @glenliu21 @lifuhuang @Fridge003

glenliu21 · 2025-12-06T22:21:52Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces LoRA adapter prefetching functionality to improve performance. Key changes include adding a max_loras_prefetch parameter across various components like ServerArgs, BaseLoRABackend and its implementations (AscendLoRABackend, ChunkedSgmvLoRABackend, TritonLoRABackend), LoRAManager, and LoRAMemoryPool. The LoRAMemoryPool is expanded to accommodate prefetched LoRAs, and a dedicated CUDA stream is introduced for asynchronous prefetching. The Scheduler now includes logic to identify and submit LoRA prefetch tasks using a ThreadPoolExecutor, and ScheduleBatch and ModelWorkerBatch are updated to support this prefetching. The ModelRunner and TPWorker expose new methods to handle prefetch batches. Review comments highlight the need to explicitly shut down the ThreadPoolExecutor in the Scheduler to prevent resource leaks, suggest refactoring a complex ternary expression for stream_ctx in mem_pool.py into a more readable if/elif/else block, and recommend using self.lora_backend.max_loras_total for consistency when initializing lora_ranks and scalings in lora_manager.py.

gemini-code-assist · 2025-12-06T22:24:03Z

python/sglang/srt/managers/scheduler.py

        self.max_loras_per_batch = server_args.max_loras_per_batch
+        self.max_loras_prefetch = server_args.max_loras_prefetch
+        self.prefetch_loras_in_flight = set()
+        self.lora_prefetch_executor = futures.ThreadPoolExecutor(max_workers=1)


The ThreadPoolExecutor for LoRA prefetching is created but never shut down. This can lead to resource leaks, as the worker thread may not terminate cleanly. It's best practice to explicitly shut down the executor when it's no longer needed, for example, in a shutdown method for the Scheduler class which calls self.lora_prefetch_executor.shutdown(wait=True).

python/sglang/srt/lora/mem_pool.py

python/sglang/srt/lora/lora_manager.py

glenliu21 · 2025-12-23T16:33:54Z

Please see #15512 for the updated design.

implement lora prefetch and add end-to-end test

c20b58e

glenliu21 requested review from Fridge003, Ying1123, hnyls2002, ispobock, lifuhuang, merrymercy, xiezhq-hermann and zhyncs as code owners December 1, 2025 05:30

github-actions bot added lora npu labels Dec 1, 2025

glenliu21 mentioned this pull request Dec 1, 2025

(WIP) Async LoRA prefetch - add scheduler logic for lora prefetch #13828

Closed

6 tasks

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

python/sglang/srt/lora/mem_pool.py Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

gemini fix

ac1d966

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

python/sglang/srt/lora/lora_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/schedule_batch.py Show resolved Hide resolved

Fridge003 mentioned this pull request Dec 6, 2025

Development Roadmap (2026 Q1) #12780

Open

small fixes

b2f7aab

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

glenliu21 added 2 commits December 6, 2025 22:29

gemini fixes

8ed9e3a

add tensor copying

2cade13

glenliu21 mentioned this pull request Dec 20, 2025

[Feature] overlap LoRA weight loading with compute #15512

Merged

6 tasks

glenliu21 closed this Dec 23, 2025

glenliu21 deleted the lora_prefetch branch January 24, 2026 15:35

Conversation

glenliu21 commented Dec 1, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

glenliu21 commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Dec 1, 2025

Uh oh!

ConnorLi96 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Scenarios

Uh oh!

glenliu21 commented Dec 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

glenliu21 commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ConnorLi96 commented Dec 6, 2025 •

edited

Loading