[Feature] overlap LoRA weight loading with compute by glenliu21 · Pull Request #15512 · sgl-project/sglang

glenliu21 · 2025-12-20T04:05:05Z

Motivation

In #14190, I tried out a method to make LoRA weight loading asynchronous (see #8712). However, one issue is that this method required allocating more GPU memory for LoRA weight storage, which is not ideal. This PR instead makes LoRA weight loading truly free and asynchronous by pipelining the loading of LoRA weights. To illustrate:

In the current implementation, loading in a new batch of LoRA adapters blocks all forward computation:

GPU Compute
R1:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R2:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R3:  │            idle                                                       │████████ RUN R1..R4 ████████████│
R4:  │            idle                                                       │████████ RUN R1..R4 ████████████│

PCIe / LoRA Load
     │[ LOAD R1, R2, R3, R4 (blocking before run) ]                          │
     │─████████████████████──────────────────────────────────────────────────|

With this PR, we pipeline this process so that forward compute can overlap with LoRA weight loading:

GPU Compute
R1:  │      idle       │█████ RUN R1 ████│███ RUN R1+R2 ███│█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R2:  │      idle       │      idle       │███ RUN R1+R2 ███│█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R3:  │      idle       │      idle       │      idle       │█ RUN R1+R2+R3 ██│RUN R1..R4 █████│
R4:  │      idle       │      idle       │      idle       │       idle      │RUN R1..R4 █████│

PCIe / LoRA Load
     │ [ LOAD R1 ]     │ [ LOAD R2 ]     │ [ LOAD R3 ]     │ [ LOAD R4 ]     │
     │█████████────────│█████████────────│█████████────────│█████████────────│

Modifications

Add --enable-lora-overlap-loading as a server argument
Pin LoRA weights and introduce a separate CUDA stream (load_stream) to ensure adapter loading happens on a separate stream from forward_stream
Modify existing batch selection logic in Scheduler and add required bookkeeping logic

Accuracy Tests

Implemented an end-to-end test that checks that outputs for workloads with --enable-lora-overlap-loading match workloads without overlap loading enabled.

Benchmarking and Profiling

The hardware I used was a single H200 GPU. I benchmarked with the following scripts:

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --max-loaded-loras 16 \
    --max-loras-per-batch 8 \
    --enable-lora-overlap-loading \ # Disable as needed
    --lora-paths \
        adapter0=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter1=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter2=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter3=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter4=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter5=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter6=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter7=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter8=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter9=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter10=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter11=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter12=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter13=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter14=mkopecki/chess-lora-adapter-llama-3.1-8b \
        adapter15=mkopecki/chess-lora-adapter-llama-3.1-8b

Note that mkopecki/chess-lora-adapter-llama-3.1-8b is a large adapter (>1GB). However, I used it mainly to see how well this implementation can perform in the best case and to establish it as a proof of concept.

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30000 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 4 \
  --random-input-len 512 \
  --random-output-len 512 \
  --lora-request-distribution distinct \
  --lora-name \
    adapter0 \
    adapter1 \
    adapter2 \
    adapter3 \
    adapter4 \
    adapter5 \
    adapter6 \
    adapter7 \
    adapter8 \
    adapter9 \
    adapter10 \
    adapter11 \
    adapter12 \
    adapter13 \
    adapter14 \
    adapter15

We use distinct for the request distribution to maximize cache misses. Furthermore, we use a relatively small random-input-len and random-output-len so that the bottleneck is our LoRA weight loading, not compute.

Benchmark Comparison: Main vs PR

Metric	`main` Run 1	PR Run 1	% decrease
Mean E2E Latency	5768.42	4529.93	21.5%
Median E2E Latency	5468.30	4137.39	24.3%
Mean TTFT	2220.95	1533.80	30.9%
Median TTFT	750.58	164.11	78.1%
P99 TTFT	9248.78	5724.13	38.1%
Median TPOT	13.93	7.76	44.3%

Overall, we see meaningful decreases in E2E latency and TTFT.

Running benchmarks with the profiler also shows that weight loading is overlapped with compute. This first image shows GPU compute being blocked by weight loading:

And this image shows weights being loading concurrently with GPU computation:

Benchmarking with Smaller Adapters

I also ran benchmarks using smaller adapters to get a more representative benchmark for the standard LoRA use case. In the below script, the adapter sizes range from 10s to 100s of MB, with most being less than 200MB:

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --max-loaded-loras 16 \
    --max-loras-per-batch 8 \
    --enable-lora-overlap-loading \
    --lora-paths \
        adapter0=faridlazuarda/valadapt-llama-3.1-8B-it-chinese \
        adapter1=LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching \
        adapter2=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
        adapter3=pbevan11/llama-3.1-8b-ocr-correction \
        adapter4=reissbaker/llama-3.1-8b-abliterated-lora \
        adapter5=Roblox/Llama-3.1-8B-Instruct-RobloxGuard-1.0 \
        adapter6=nvidia/llama-3.1-nemoguard-8b-topic-control \
        adapter7=Kawon/llama3.1-food-finetune_v14_r8 \
        adapter8=faridlazuarda/valadapt-llama-3.1-8B-it-chinese \
        adapter9=LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching \
        adapter10=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
        adapter11=pbevan11/llama-3.1-8b-ocr-correction \
        adapter12=reissbaker/llama-3.1-8b-abliterated-lora \
        adapter13=Roblox/Llama-3.1-8B-Instruct-RobloxGuard-1.0 \
        adapter14=nvidia/llama-3.1-nemoguard-8b-topic-control \
        adapter15=Kawon/llama3.1-food-finetune_v14_r8

This produces the following results:

Metric	`main` Run 1	PR Run 1	% decrease
Mean E2E Latency	3744.45	3408.13	8.98%
Median E2E Latency	3569.53	3215.31	9.92%
Mean TTFT	1231.55	1013.58	17.7%
Median TTFT	194.21	126.46	34.88%
P99 TTFT	4819.61	3974.23	17.54%
Median TPOT	9.69	9.29	4.13%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-20T04:05:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2025-12-21T05:18:07Z

/tag-and-rerun-ci

Fridge003 · 2025-12-21T20:35:56Z

There is accuracy issue https://github.com/sgl-project/sglang/actions/runs/20389111310/job/58633965556?pr=15512

python/sglang/srt/managers/scheduler.py

glenliu21 · 2025-12-23T16:32:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization by overlapping LoRA weight loading with GPU computation. The approach involves a new LoRAPrefetcher to handle asynchronous loading, which is a solid design for separating concerns. The changes are consistent and well-motivated, with impressive benchmark improvements. I have a couple of suggestions to improve maintainability by reducing code duplication and simplifying a method signature. Overall, this is a great contribution.

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/managers/scheduler.py

glenliu21 · 2025-12-23T16:40:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an excellent optimization by overlapping LoRA weight loading with GPU computation, which significantly improves performance, especially for scenarios with frequent LoRA adapter switching. The implementation is well-structured, using a dedicated LoRAPrefetcher with a ThreadPoolExecutor and a separate CUDA stream to handle asynchronous loading. The changes to the scheduler and memory pool are correct and necessary for this feature.

My review includes a couple of suggestions for improvement:

Adding a shutdown mechanism for the ThreadPoolExecutor to prevent potential resource leaks in a long-running server.
Refactoring duplicated code in ScheduleBatch to improve maintainability.

Overall, this is a great contribution that brings substantial performance gains.

python/sglang/srt/lora/lora_prefetcher.py

python/sglang/srt/managers/schedule_batch.py

Fridge003 · 2025-12-24T08:41:45Z

The accuracy error still exists
https://github.com/sgl-project/sglang/actions/runs/20468910505/job/58819496065?pr=15512
Maybe there is bug on weight synchronization?

Fridge003 · 2025-12-24T09:04:20Z

Can we add a server arg for controlling this feature?

glenliu21 · 2025-12-24T15:11:19Z

The accuracy error still exists https://github.com/sgl-project/sglang/actions/runs/20468910505/job/58819496065?pr=15512 Maybe there is bug on weight synchronization?

Yes, there's an issue with weight synchronization currently - I'm taking a look now.

Can we add a server arg for controlling this feature?

Yes, I can add an enable-lora-prefetch argument. In the future (probably in a follow-up PR), we should add some statistics tracking so that we only load large adapters asynchronously.

gemini-code-assist

Code Review

This pull request introduces a significant performance improvement by overlapping LoRA weight loading with GPU computation. The implementation is well-structured, introducing a LoRAPrefetcher class that manages asynchronous loading on a separate CUDA stream. The changes to the scheduler to accommodate prefetching states are logical and correct. The addition of end-to-end tests to verify correctness is also a great addition. My feedback includes a couple of minor suggestions to improve code maintainability.

python/sglang/srt/lora/lora.py

python/sglang/srt/server_args.py

test/registered/lora/test_lora_prefetch.py

docs/advanced_features/lora.ipynb

test/registered/lora/test_lora_prefetch.py

python/sglang/srt/lora/lora.py

python/sglang/srt/managers/scheduler.py

Fridge003 · 2026-01-03T12:03:08Z

The benchmark used large adaptors as showcase.
But I'm curious how this feature behaves on normal cases (like llama-8b, with some common small adaptors)
On the normal cases it shouldn't cause performance regression

docs/advanced_features/lora.ipynb

python/sglang/srt/lora/lora_prefetcher.py

python/sglang/srt/server_args.py

…t code

glenliu21 · 2026-01-05T03:31:16Z

The benchmark used large adaptors as showcase. But I'm curious how this feature behaves on normal cases (like llama-8b, with some common small adaptors) On the normal cases it shouldn't cause performance regression

Please see the updated description for a benchmark ran with smaller adapters.

One thing to note is that there are potentially cases in which enabling overlap loading results in lower throughput. This is because distinct LoRA adapters must run their own individual prefill steps, whereas in the baseline case, we wait for all adapters to load then run one single prefill step. So if the time it takes to load an adapter is much less than the time it takes to run a prefill step, then we can see throughput decrease.

Fridge003 · 2026-01-18T17:52:10Z

docs/advanced_features/lora.ipynb

    "\n",
    "* `enable_lora`: Enable LoRA support for the model. This argument is automatically set to True if `--lora-paths` is provided for backward compatibility.\n",
    "\n",
+    "* `enable_lora_overlap_loading`: Enable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters.\n",


Can we add an example for lora overlap loading in below section (can be updated in a following PR)

Please see #17464.

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

glenliu21 requested review from Fridge003, Ying1123, hnyls2002, ispobock, lifuhuang, merrymercy and xiezhq-hermann as code owners December 20, 2025 04:05

overlap lora weight loading with compute

2c9353d

glenliu21 force-pushed the lora_pipeline branch from 5925391 to 2c9353d Compare December 20, 2025 04:24

github-actions bot added the run-ci label Dec 21, 2025

ConnorLi96 suggested changes Dec 21, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

move logic to LoRAPrefetcher

1aebbc7

glenliu21 mentioned this pull request Dec 23, 2025

[Feature] implement async LoRA prefetch #14190

Closed

6 tasks

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

add lora_prefetcher.py

bff918e

github-actions bot added the lora label Dec 23, 2025

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

python/sglang/srt/lora/lora_prefetcher.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

merge main

bc9cf43

glenliu21 and others added 4 commits December 28, 2025 11:33

fix weight sync issue

b442c8b

fix

8a674b5

Merge branch 'main' into lora_pipeline

ee707d6

Merge branch 'main' into lora_pipeline

071465f

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

python/sglang/srt/lora/lora.py Show resolved Hide resolved

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

glenliu21 and others added 4 commits January 1, 2026 21:38

improve test

044d789

Merge branch 'main' into lora_pipeline

fb31cf7

add tp test

db1a67e

Merge branch 'main' into lora_pipeline

daa8403

glenliu21 mentioned this pull request Jan 3, 2026

enhance LoRA tests and fix base model LoRA eviction in Scheduler #16333

Merged

2 tasks

Fridge003 reviewed Jan 3, 2026

View reviewed changes

ConnorLi96 approved these changes Jan 4, 2026

View reviewed changes

docs/advanced_features/lora.ipynb Outdated Show resolved Hide resolved

python/sglang/srt/lora/lora_prefetcher.py Outdated Show resolved Hide resolved

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

rename lora_prefetcher to lora_overlap_loader; reorganize and reforma…

ea43f6d

…t code

glenliu21 and others added 12 commits January 4, 2026 22:31

Merge branch 'main' into lora_pipeline

30930f5

Merge branch 'main' into lora_pipeline

2892f63

Merge branch 'main' into lora_pipeline

1500be6

Merge branch 'main' into lora_pipeline

3293996

max_loaded_loras fix

f87bdf6

Merge branch 'main' into lora_pipeline

354ff9c

Merge branch 'main' into lora_pipeline

77609c0

fix server arg description

8b69562

max_loaded_loras arg

0aa9226

Merge branch 'main' into lora_pipeline

0409f43

Merge branch 'main' into lora_pipeline

ec6cecf

Merge branch 'main' into lora_pipeline

ae79420

Fridge003 mentioned this pull request Jan 14, 2026

Development Roadmap (2026 Q1) #12780

Open

Fridge003 reviewed Jan 18, 2026

View reviewed changes

Fridge003 approved these changes Jan 18, 2026

View reviewed changes

Fridge003 merged commit ad1b4e4 into sgl-project:main Jan 19, 2026
198 of 211 checks passed

glenliu21 deleted the lora_pipeline branch January 21, 2026 02:27

glenliu21 mentioned this pull request Jan 21, 2026

add documentation example for LoRA overlap loading and cleanup unused function #17464

Merged

3 tasks

Conversation

glenliu21 commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Benchmark Comparison: Main vs PR

Benchmarking with Smaller Adapters

Checklist

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

Fridge003 commented Dec 21, 2025

Uh oh!

Fridge003 commented Dec 21, 2025

Uh oh!

Uh oh!

glenliu21 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

glenliu21 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Dec 24, 2025

Uh oh!

Fridge003 commented Dec 24, 2025

Uh oh!

glenliu21 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Jan 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glenliu21 commented Jan 5, 2026

Uh oh!

Fridge003 Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

glenliu21 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

glenliu21 commented Dec 20, 2025 •

edited

Loading