server: router fix model unload reload deadlock by 0cc4m · Pull Request #22284 · ggml-org/llama.cpp

0cc4m · 2026-04-23T11:42:30Z

Overview

When you run a server in router mode and you get requests for 2 or more models simultaneously that cannot run at the same time (e.g. cause you only allow one model at a time, or cause they don't fit into memory together), then the server gets into a loop.

Assuming we have model A and model B:

A is loaded, A and B get requests
A is unloaded to make space for B
B is unloaded to make space for A
etc

This is because the router doesn't track whether a model is in use and just aggressively terminates models to load a new one, then terminates the new one before it even handled one request. This branch is one way to fix that. It's built on top of #21231 because I ran into this problem after solving the memory unloading case. Draft until that is merged.

Only a5355a0 is specific to this PR. Basically I added a counter that tracks open requests against a model and prevents unload until they reach 0, or until a timeout runs out (DEFAULT_STOP_TIMEOUT) in this case. I plan to look at more complex solutions later that load/unload more intelligently.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, Claude was used for assistance and to write tests.

…ing models when they exceed a memory size threshold

…eparately

…e running requests this avoids a deadlock when models A and B don't fit together, but both have requests, so the server gets into a loop unloading A, loading B, unloading B, loading A again, and so on

0cc4m and others added 15 commits April 21, 2026 14:33

server: add --models-memory-max parameter to allow dynamically unload…

8e8e200

…ing models when they exceed a memory size threshold

estimate with to-be-loaded model size included

777395f

use no_alloc to get memory requirements for model load

2603b4c

only set model memory_mb if not previously calculated

9b5af58

use memory margin instead of total size limit, apply to each device s…

56122b3

…eparately

add server memory debug logging

51538c1

move llama_context_device_memory function to llama-ext.h

ba2521c

fix model count exceeded check

7500063

improve memory_per_device map naming

173da43

improve variable naming, fix style

69e3086

also strip models memory margin from child processes

eb2cf73

cont : clean-up

1a8aec0

handle models that need to be downloaded before estimation

b1623a6

load directly from downloaded state

cf0ebc4

server: keep router model refcount to avoid unloading models that hav…

a5355a0

…e running requests this avoids a deadlock when models A and B don't fit together, but both have requests, so the server gets into a loop unloading A, loading B, unloading B, loading A again, and so on

github-actions Bot added examples python python script changes server labels Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: router fix model unload reload deadlock#22284

server: router fix model unload reload deadlock#22284
0cc4m wants to merge 15 commits into
masterfrom
0cc4m/server-router-fix-reload-deadlock

0cc4m commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0cc4m commented Apr 23, 2026

Overview

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants