Skip to content

UPSTREAM PR #19280: fix: only reset LoRa configs when they have changed from previous batch#1142

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19280-bug-fix-memory-leak
Open

UPSTREAM PR #19280: fix: only reset LoRa configs when they have changed from previous batch#1142
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19280-bug-fix-memory-leak

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 3, 2026

Note

Source pull request: ggml-org/llama.cpp#19280

Overview

Fix for ggml-org/llama.cpp#19217

Currently we are setting the LoRa config for each and every token request, even if the configuration between batches has not changed. It appears in this PR, in llama-context.cpp, line 1078 we added schedule reserving for when we set new LoRa configs, and so now when we have a LoRa config, for every batch we are decoding, we are reserving the scheduler.

This change adds a field to the server slot that holds the last batche's LoRa config. For each batch, we run a check to see if the config is the same as the previous batch, and only if it is not do we go ahead and set it. Otherwise, proceed.

Testing

I was able to re-produce the issue relatively easily with some debug logs

./llama-server -hf bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0 --lora ~/Downloads/LoRA-Llama-3.1-8B-MultiReflection-f16.gguf -c 1024 --parallel 1 --host 0.0.0.0 --port 8080 --verbose

and was able to see this endlessly along with huge memory spikes:

Setting sched_need_reserve to TRUE in the set_adapter_lora functionset_embeddings: value = 0
Reserving memory during decoding

sched_reserve: reserving ...
sched_reserve: max_nodes = 3692
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" requests"}}],"created":1770077607,"id":"chatcmpl-1gqsKPcqCDHRRMZFNqfWqnl557ARXVc9","model":"bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0","system_fingerprint":"b7916-0dfcd3b60","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":109,"prompt_ms":628.118,"prompt_per_token_ms":5.762550458715597,"prompt_per_second":173.5342722227352,"predicted_n":2,"predicted_ms":180.088,"predicted_per_token_ms":90.044,"predicted_per_second":11.105681666740704}}


sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:       MTL0 compute buffer size =   509.12 MiB
sched_reserve:        CPU compute buffer size =    18.01 MiB
sched_reserve: graph nodes  = 1903
sched_reserve: graph splits = 2

After these changes, the issue no longer appears and memory remains stable.

@loci-review
Copy link

loci-review bot commented Feb 3, 2026

No meaningful performance changes were detected across 115472 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 4a5a4c2 to 45aacad Compare February 24, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 945fa3a to 0e8e1d6 Compare March 20, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants