server: router fix model unload reload deadlock#22284
Draft
0cc4m wants to merge 15 commits into
Draft
Conversation
…ing models when they exceed a memory size threshold
…e running requests this avoids a deadlock when models A and B don't fit together, but both have requests, so the server gets into a loop unloading A, loading B, unloading B, loading A again, and so on
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
When you run a server in router mode and you get requests for 2 or more models simultaneously that cannot run at the same time (e.g. cause you only allow one model at a time, or cause they don't fit into memory together), then the server gets into a loop.
Assuming we have model A and model B:
This is because the router doesn't track whether a model is in use and just aggressively terminates models to load a new one, then terminates the new one before it even handled one request. This branch is one way to fix that. It's built on top of #21231 because I ran into this problem after solving the memory unloading case. Draft until that is merged.
Only a5355a0 is specific to this PR. Basically I added a counter that tracks open requests against a model and prevents unload until they reach 0, or until a timeout runs out (
DEFAULT_STOP_TIMEOUT) in this case. I plan to look at more complex solutions later that load/unload more intelligently.Requirements