Skip to content

[Feature] implement async LoRA prefetch#14190

Closed
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21:lora_prefetch
Closed

[Feature] implement async LoRA prefetch#14190
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21:lora_prefetch

Conversation

@glenliu21
Copy link
Contributor

Motivation

This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.

Modifications

  • Added max_loras_prefetch as a server argument
  • Implement creation of a ForwardBatch as a LoRA prefetch batch, which consists of requests that are next to be ran on the Scheduler's waiting queue
  • Implement the LoRA prefetch backend in LoRAManager, the memory pool, and the LoRA backend
  • Utilize ThreadPoolExecutor and a separate torch.cuda.Stream to enable async prefetching

Accuracy Tests

  • Added a basic, end-to-end test to ensure that enabling LoRA prefetching doesn't change expected outputs

Benchmarking and Profiling

@ConnorLi96 ran the following commands to benchmark LoRA prefetching:

  1. for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; done
  2. python3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16

This yielded the following results:

Before

----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22579.58  
Median E2E Latency (ms):                 22261.88  
---------------Time to First Token----------------
Mean TTFT (ms):                          16157.50  
Median TTFT (ms):                        15918.48  
P99 TTFT (ms):                           34927.59  

After

----------------End-to-End Latency----------------                                         
Mean E2E Latency (ms):          17620.85                                          
Median E2E Latency (ms):         16273.82                                          
---------------Time to First Token----------------                                         
Mean TTFT (ms):             11926.84                                          
Median TTFT (ms):            10865.44                                          
P99 TTFT (ms):              26765.88      

These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @glenliu21, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an asynchronous LoRA prefetching feature designed to enhance the performance of LoRA adapter loading within the system. By anticipating future LoRA adapter needs based on the scheduler's waiting queue, the system can proactively load these adapters in the background using a dedicated thread pool and CUDA stream. This optimization aims to significantly decrease the time requests spend waiting for LoRA weights, leading to improved overall latency and responsiveness, as evidenced by the provided benchmarks.

Highlights

  • Asynchronous LoRA Prefetching: Implemented a new mechanism to asynchronously prefetch LoRA adapters, based on the S-LoRA policy, to reduce latency.
  • Configurable Prefetching: Introduced a "max_loras_prefetch" server argument to control the number of LoRA adapters prefetched.
  • Performance Improvement: Benchmarking results demonstrate a significant reduction in Time to First Token (31%) and End-to-End Latency (27%).
  • Integrated Backend Support: Prefetching logic is integrated into the LoRAManager, memory pool, and various LoRA backends (Ascend, Chunked, Triton).
  • Accuracy Validation: Basic end-to-end tests confirm that enabling LoRA prefetching does not alter expected outputs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves performance by reducing TTFT and E2E latency. The implementation follows the S-LoRA paper's prefetch policy and is well-integrated into the existing architecture. The changes are spread across multiple files, including the LoRA backends, memory pool, and scheduler, and are accompanied by a correctness test. My review focuses on a potential issue in the eviction policy and a suggestion for making the prefetch trigger more configurable. Overall, this is a great feature with a solid implementation.

@glenliu21
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves TTFT and E2E latency according to the benchmarks. The implementation is well-structured, with clear separation of concerns between the scheduler, memory pool, and LoRA backends. The use of a separate CUDA stream and a thread pool for prefetching is a good approach for achieving asynchrony.

My review includes a few suggestions to improve maintainability and robustness, such as making the prefetch interval configurable, refactoring duplicated code, and ensuring graceful shutdown of the thread pool. Overall, this is a great feature addition with a solid implementation.

@lifuhuang
Copy link
Collaborator

Thank you @glenliu21 for the great work!

I wonder in the benchmark done by @ConnorLi96 , what was max-loras-per-batch configured to? Can we run a set of benchmark where the max-loras-per-batch of control is twice as large as that of treatment to make sure we are separating the gains from:

  1. Smart prefetch mechanism (which is what we hope to see), and
  2. Gains coming from double memory buffer

@ConnorLi96
Copy link
Contributor

ConnorLi96 commented Dec 6, 2025

Thanks for the PR! I'd like to help validate the prefetch feature with a fair comparison. Based on our initial tests and design, I think there few things we need to consider:

  1. let's test scenarios where # adapters > cache capacity, which is more aligned with production serverless endpoint.
  2. I think we can use LRU as a baseline to evaluate the performance of prefetching. Because with the same number of slots used by both methods, we expect prefetching to perform better than pure LRU.
  3. Let's use different workloads to compare them, like 3 scenarios I listed below

So firstly, we will have Test Configurations like:

Config max_loras_per_batch max_loras_prefetch max_loaded_loras Strategy
A (Prefetch) 8 8 16 Async prefetch
B (Baseline) 16 0 16 LRU only

Both configs use the same memory footprint (16 adapter slots):

Test Scenarios

We'll test with varying adapter counts to stress the eviction/loading path:

Scenario 1: Uniform Distribution
Expected: Prefetch should reduce tail latency (P99 TTFT) as adapter count increases.

Scenario 2: Zipf Distribution (α=1.5)
80% requests hit 20% adapters (more like the realistic workload)
Expected: LRU might still be competitive since hot adapters stay cached. Prefetch could help with cold adapters.

Scenario 3: Bursty Traffic

# Sequential bursts: 
# - First 100 requests: adapter1-10
# - Next 100 requests: adapter11-30
# - Next 100 requests: adapter31-50

Expected: Prefetch should anticipate the burst and preload, reducing batch intervals, and be better than LRU since they are different adapters.

Let me know your thoughts! Happy to run these tests and share detailed results! @glenliu21 @lifuhuang @Fridge003

@glenliu21
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LoRA adapter prefetching functionality to improve performance. Key changes include adding a max_loras_prefetch parameter across various components like ServerArgs, BaseLoRABackend and its implementations (AscendLoRABackend, ChunkedSgmvLoRABackend, TritonLoRABackend), LoRAManager, and LoRAMemoryPool. The LoRAMemoryPool is expanded to accommodate prefetched LoRAs, and a dedicated CUDA stream is introduced for asynchronous prefetching. The Scheduler now includes logic to identify and submit LoRA prefetch tasks using a ThreadPoolExecutor, and ScheduleBatch and ModelWorkerBatch are updated to support this prefetching. The ModelRunner and TPWorker expose new methods to handle prefetch batches. Review comments highlight the need to explicitly shut down the ThreadPoolExecutor in the Scheduler to prevent resource leaks, suggest refactoring a complex ternary expression for stream_ctx in mem_pool.py into a more readable if/elif/else block, and recommend using self.lora_backend.max_loras_total for consistency when initializing lora_ranks and scalings in lora_manager.py.

self.max_loras_per_batch = server_args.max_loras_per_batch
self.max_loras_prefetch = server_args.max_loras_prefetch
self.prefetch_loras_in_flight = set()
self.lora_prefetch_executor = futures.ThreadPoolExecutor(max_workers=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ThreadPoolExecutor for LoRA prefetching is created but never shut down. This can lead to resource leaks, as the worker thread may not terminate cleanly. It's best practice to explicitly shut down the executor when it's no longer needed, for example, in a shutdown method for the Scheduler class which calls self.lora_prefetch_executor.shutdown(wait=True).

@glenliu21
Copy link
Contributor Author

Please see #15512 for the updated design.

@glenliu21 glenliu21 closed this Dec 23, 2025
@glenliu21 glenliu21 deleted the lora_prefetch branch January 24, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants