[Feature] implement async LoRA prefetch#14190
[Feature] implement async LoRA prefetch#14190glenliu21 wants to merge 5 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @glenliu21, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces an asynchronous LoRA prefetching feature designed to enhance the performance of LoRA adapter loading within the system. By anticipating future LoRA adapter needs based on the scheduler's waiting queue, the system can proactively load these adapters in the background using a dedicated thread pool and CUDA stream. This optimization aims to significantly decrease the time requests spend waiting for LoRA weights, leading to improved overall latency and responsiveness, as evidenced by the provided benchmarks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves performance by reducing TTFT and E2E latency. The implementation follows the S-LoRA paper's prefetch policy and is well-integrated into the existing architecture. The changes are spread across multiple files, including the LoRA backends, memory pool, and scheduler, and are accompanied by a correctness test. My review focuses on a potential issue in the eviction policy and a suggestion for making the prefetch trigger more configurable. Overall, this is a great feature with a solid implementation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous LoRA prefetching mechanism, which significantly improves TTFT and E2E latency according to the benchmarks. The implementation is well-structured, with clear separation of concerns between the scheduler, memory pool, and LoRA backends. The use of a separate CUDA stream and a thread pool for prefetching is a good approach for achieving asynchrony.
My review includes a few suggestions to improve maintainability and robustness, such as making the prefetch interval configurable, refactoring duplicated code, and ensuring graceful shutdown of the thread pool. Overall, this is a great feature addition with a solid implementation.
|
Thank you @glenliu21 for the great work! I wonder in the benchmark done by @ConnorLi96 , what was
|
|
Thanks for the PR! I'd like to help validate the prefetch feature with a fair comparison. Based on our initial tests and design, I think there few things we need to consider:
So firstly, we will have Test Configurations like:
Both configs use the same memory footprint (16 adapter slots): Test ScenariosWe'll test with varying adapter counts to stress the eviction/loading path: Scenario 1: Uniform Distribution Scenario 2: Zipf Distribution (α=1.5) Scenario 3: Bursty Traffic # Sequential bursts:
# - First 100 requests: adapter1-10
# - Next 100 requests: adapter11-30
# - Next 100 requests: adapter31-50Expected: Prefetch should anticipate the burst and preload, reducing batch intervals, and be better than LRU since they are different adapters. Let me know your thoughts! Happy to run these tests and share detailed results! @glenliu21 @lifuhuang @Fridge003 |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces LoRA adapter prefetching functionality to improve performance. Key changes include adding a max_loras_prefetch parameter across various components like ServerArgs, BaseLoRABackend and its implementations (AscendLoRABackend, ChunkedSgmvLoRABackend, TritonLoRABackend), LoRAManager, and LoRAMemoryPool. The LoRAMemoryPool is expanded to accommodate prefetched LoRAs, and a dedicated CUDA stream is introduced for asynchronous prefetching. The Scheduler now includes logic to identify and submit LoRA prefetch tasks using a ThreadPoolExecutor, and ScheduleBatch and ModelWorkerBatch are updated to support this prefetching. The ModelRunner and TPWorker expose new methods to handle prefetch batches. Review comments highlight the need to explicitly shut down the ThreadPoolExecutor in the Scheduler to prevent resource leaks, suggest refactoring a complex ternary expression for stream_ctx in mem_pool.py into a more readable if/elif/else block, and recommend using self.lora_backend.max_loras_total for consistency when initializing lora_ranks and scalings in lora_manager.py.
| self.max_loras_per_batch = server_args.max_loras_per_batch | ||
| self.max_loras_prefetch = server_args.max_loras_prefetch | ||
| self.prefetch_loras_in_flight = set() | ||
| self.lora_prefetch_executor = futures.ThreadPoolExecutor(max_workers=1) |
There was a problem hiding this comment.
The ThreadPoolExecutor for LoRA prefetching is created but never shut down. This can lead to resource leaks, as the worker thread may not terminate cleanly. It's best practice to explicitly shut down the executor when it's no longer needed, for example, in a shutdown method for the Scheduler class which calls self.lora_prefetch_executor.shutdown(wait=True).
|
Please see #15512 for the updated design. |
Motivation
This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.
Modifications
max_loras_prefetchas a server argumentForwardBatchas a LoRA prefetch batch, which consists of requests that are next to be ran on theScheduler's waiting queueThreadPoolExecutorand a separatetorch.cuda.Streamto enable async prefetchingAccuracy Tests
Benchmarking and Profiling
@ConnorLi96 ran the following commands to benchmark LoRA prefetching:
for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; donepython3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16This yielded the following results:
Before
After
These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.
Checklist