Skip to content

[Feature] overlap LoRA weight loading with compute#15512

Merged
Fridge003 merged 31 commits intosgl-project:mainfrom
glenliu21:lora_pipeline
Jan 19, 2026
Merged

[Feature] overlap LoRA weight loading with compute#15512
Fridge003 merged 31 commits intosgl-project:mainfrom
glenliu21:lora_pipeline

Conversation

@glenliu21
Copy link
Contributor

@glenliu21 glenliu21 commented Dec 20, 2025

Motivation

In #14190, I tried out a method to make LoRA weight loading asynchronous (see #8712). However, one issue is that this method required allocating more GPU memory for LoRA weight storage, which is not ideal. This PR instead makes LoRA weight loading truly free and asynchronous by pipelining the loading of LoRA weights. To illustrate:

In the current implementation, loading in a new batch of LoRA adapters blocks all forward computation:

GPU Compute
R1:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R2:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R3:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R4:  │            idle                                                       │████████ RUN R1..R4 ████████████│

PCIe / LoRA Load
     │[ LOAD R1, R2, R3, R4 (blocking before run) ]                          │
     │─████████████████████──────────────────────────────────────────────────|

With this PR, we pipeline this process so that forward compute can overlap with LoRA weight loading:

GPU Compute
R1:  │      idle       │█████ RUN R1 ████│███ RUN R1+R2 ███│█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R2:  │      idle       │      idle       │███ RUN R1+R2 ███│█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R3:  │      idle       │      idle       │      idle       │█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R4:  │      idle       │      idle       │      idle       │       idle      │RUN R1..R4 █████│

PCIe / LoRA Load
     │ [ LOAD R1 ]     │ [ LOAD R2 ]     │ [ LOAD R3 ]     │ [ LOAD R4 ]     │
     │█████████────────│█████████────────│█████████────────│█████████────────│

Modifications

  • Add --enable-lora-overlap-loading as a server argument
  • Pin LoRA weights and introduce a separate CUDA stream (load_stream) to ensure adapter loading happens on a separate stream from forward_stream
  • Modify existing batch selection logic in Scheduler and add required bookkeeping logic

Accuracy Tests

Implemented an end-to-end test that checks that outputs for workloads with --enable-lora-overlap-loading match workloads without overlap loading enabled.

Benchmarking and Profiling

The hardware I used was a single H200 GPU. I benchmarked with the following scripts:

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --max-loaded-loras 16 \
    --max-loras-per-batch 8 \
    --enable-lora-overlap-loading \ # Disable as needed
    --lora-paths \
        adapter0=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter1=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter2=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter3=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter4=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter5=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter6=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter7=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter8=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter9=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter10=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter11=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter12=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter13=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter14=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter15=mkopecki/chess-lora-adapter-llama-3.1-8b

Note that mkopecki/chess-lora-adapter-llama-3.1-8b is a large adapter (>1GB). However, I used it mainly to see how well this implementation can perform in the best case and to establish it as a proof of concept.

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30000 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 4 \
  --random-input-len 512 \
  --random-output-len 512 \
  --lora-request-distribution distinct \
  --lora-name \
    adapter0 \
    adapter1 \
    adapter2 \
    adapter3 \
    adapter4 \
    adapter5 \
    adapter6 \
    adapter7 \
    adapter8 \
    adapter9 \
    adapter10 \
    adapter11 \
    adapter12 \
    adapter13 \
    adapter14 \
    adapter15

We use distinct for the request distribution to maximize cache misses. Furthermore, we use a relatively small random-input-len and random-output-len so that the bottleneck is our LoRA weight loading, not compute.

Benchmark Comparison: Main vs PR

Metric main Run 1 PR Run 1 % decrease
Mean E2E Latency 5768.42 4529.93 21.5%
Median E2E Latency 5468.30 4137.39 24.3%
Mean TTFT 2220.95 1533.80 30.9%
Median TTFT 750.58 164.11 78.1%
P99 TTFT 9248.78 5724.13 38.1%
Median TPOT 13.93 7.76 44.3%

Overall, we see meaningful decreases in E2E latency and TTFT.

Running benchmarks with the profiler also shows that weight loading is overlapped with compute. This first image shows GPU compute being blocked by weight loading:
image

And this image shows weights being loading concurrently with GPU computation:
image

Benchmarking with Smaller Adapters

I also ran benchmarks using smaller adapters to get a more representative benchmark for the standard LoRA use case. In the below script, the adapter sizes range from 10s to 100s of MB, with most being less than 200MB:

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --max-loaded-loras 16 \
    --max-loras-per-batch 8 \
    --enable-lora-overlap-loading \
    --lora-paths \
        adapter0=faridlazuarda/valadapt-llama-3.1-8B-it-chinese \
        adapter1=LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching \
        adapter2=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
        adapter3=pbevan11/llama-3.1-8b-ocr-correction \
        adapter4=reissbaker/llama-3.1-8b-abliterated-lora \
        adapter5=Roblox/Llama-3.1-8B-Instruct-RobloxGuard-1.0 \
        adapter6=nvidia/llama-3.1-nemoguard-8b-topic-control \
        adapter7=Kawon/llama3.1-food-finetune_v14_r8 \
        adapter8=faridlazuarda/valadapt-llama-3.1-8B-it-chinese \
        adapter9=LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching \
        adapter10=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
        adapter11=pbevan11/llama-3.1-8b-ocr-correction \
        adapter12=reissbaker/llama-3.1-8b-abliterated-lora \
        adapter13=Roblox/Llama-3.1-8B-Instruct-RobloxGuard-1.0 \
        adapter14=nvidia/llama-3.1-nemoguard-8b-topic-control \
        adapter15=Kawon/llama3.1-food-finetune_v14_r8

This produces the following results:

Metric main Run 1 PR Run 1 % decrease
Mean E2E Latency 3744.45 3408.13 8.98%
Median E2E Latency 3569.53 3215.31 9.92%
Mean TTFT 1231.55 1013.58 17.7%
Median TTFT 194.21 126.46 34.88%
P99 TTFT 4819.61 3974.23 17.54%
Median TPOT 9.69 9.29 4.13%

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Collaborator

@glenliu21
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization by overlapping LoRA weight loading with GPU computation. The approach involves a new LoRAPrefetcher to handle asynchronous loading, which is a solid design for separating concerns. The changes are consistent and well-motivated, with impressive benchmark improvements. I have a couple of suggestions to improve maintainability by reducing code duplication and simplifying a method signature. Overall, this is a great contribution.

@github-actions github-actions bot added the lora label Dec 23, 2025
@glenliu21
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an excellent optimization by overlapping LoRA weight loading with GPU computation, which significantly improves performance, especially for scenarios with frequent LoRA adapter switching. The implementation is well-structured, using a dedicated LoRAPrefetcher with a ThreadPoolExecutor and a separate CUDA stream to handle asynchronous loading. The changes to the scheduler and memory pool are correct and necessary for this feature.

My review includes a couple of suggestions for improvement:

  • Adding a shutdown mechanism for the ThreadPoolExecutor to prevent potential resource leaks in a long-running server.
  • Refactoring duplicated code in ScheduleBatch to improve maintainability.

Overall, this is a great contribution that brings substantial performance gains.

@Fridge003
Copy link
Collaborator

The accuracy error still exists
https://github.com/sgl-project/sglang/actions/runs/20468910505/job/58819496065?pr=15512
Maybe there is bug on weight synchronization?

@Fridge003
Copy link
Collaborator

Can we add a server arg for controlling this feature?

@glenliu21
Copy link
Contributor Author

The accuracy error still exists https://github.com/sgl-project/sglang/actions/runs/20468910505/job/58819496065?pr=15512 Maybe there is bug on weight synchronization?

Yes, there's an issue with weight synchronization currently - I'm taking a look now.

Can we add a server arg for controlling this feature?

Yes, I can add an enable-lora-prefetch argument. In the future (probably in a follow-up PR), we should add some statistics tracking so that we only load large adapters asynchronously.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement by overlapping LoRA weight loading with GPU computation. The implementation is well-structured, introducing a LoRAPrefetcher class that manages asynchronous loading on a separate CUDA stream. The changes to the scheduler to accommodate prefetching states are logical and correct. The addition of end-to-end tests to verify correctness is also a great addition. My feedback includes a couple of minor suggestions to improve code maintainability.

@Fridge003
Copy link
Collaborator

The benchmark used large adaptors as showcase.
But I'm curious how this feature behaves on normal cases (like llama-8b, with some common small adaptors)
On the normal cases it shouldn't cause performance regression

@glenliu21
Copy link
Contributor Author

The benchmark used large adaptors as showcase. But I'm curious how this feature behaves on normal cases (like llama-8b, with some common small adaptors) On the normal cases it shouldn't cause performance regression

Please see the updated description for a benchmark ran with smaller adapters.

One thing to note is that there are potentially cases in which enabling overlap loading results in lower throughput. This is because distinct LoRA adapters must run their own individual prefill steps, whereas in the baseline case, we wait for all adapters to load then run one single prefill step. So if the time it takes to load an adapter is much less than the time it takes to run a prefill step, then we can see throughput decrease.

"\n",
"* `enable_lora`: Enable LoRA support for the model. This argument is automatically set to True if `--lora-paths` is provided for backward compatibility.\n",
"\n",
"* `enable_lora_overlap_loading`: Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example for lora overlap loading in below section (can be updated in a following PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #17464.

@Fridge003 Fridge003 merged commit ad1b4e4 into sgl-project:main Jan 19, 2026
198 of 211 checks passed
DotSlash-A pushed a commit to DotSlash-A/sglang that referenced this pull request Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256)

* [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225)

Signed-off-by: Lancer <maruixiang6688@gmail.com>

* Add runner utilization report workflow (sgl-project#17234)

* cli: support sglang version (sgl-project#17250)

* Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261)

* [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Tiny] Improve docs (sgl-project#17264)

* [diffusion] fix: set guidance_scale default to None (sgl-project#17182)

* Tiny fix comment typo (sgl-project#17287)

* [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974)

* Add kl test for swa radix cache (sgl-project#17281)

* fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* Move radix cache related tests (sgl-project#17295)

* [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534)

Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>

* [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296)

* [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649)

* [NPU]Support GPT-OSS for NPU (sgl-project#14197)

* [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235)

Co-authored-by: root <root@ubuntu-nvidia.localdomain>

* Update CODEOWNERS for multimodal_gen (sgl-project#17308)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] overlap LoRA weight loading with compute (sgl-project#15512)

* [PD] Optimize MHA models pp util calculation logic (sgl-project#17306)

* [Minor] Correct sglang version when installing from source (sgl-project#17315)

* Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347)

* [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961)

* Update code sync scripts (sgl-project#17319)

* [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* support new qwen3_coder_detector (sgl-project#16744)

Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>

* Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325)

* KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412)

* [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)

Co-authored-by: Minglei Zhu <zminglei@linkedin.com>

* [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245)

* fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238)

* Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332)

* Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177)

Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

* Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158)

* [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309)

* [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix v32 continue_final_message not work (sgl-project#16567)

* Evict swa kv cache during decoding (sgl-project#17220)

* [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142)

Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

* [AMD CI] Migrate and Add More Testcases (sgl-project#17116)

Co-authored-by: yctseng0211 <yctseng@amd.com>

* [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345)

* Restore deepseek_v2.py to main's code, except the utils

* Ran `pre-commit`

---------

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Hudson Xing <1277646412@qq.com>
Co-authored-by: Lancer <402430575@qq.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Jerry Ji <jerryjilol@gmail.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Koushik Dutta <koush@koushikdutta.com>
Co-authored-by: root <root@ubuntu-nvidia.localdomain>
Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com>
Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Minglei Zhu <zminglei@linkedin.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: zhangheng <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
@glenliu21 glenliu21 deleted the lora_pipeline branch January 21, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lora run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants