studio: settle GPU VRAM after killing llama-server before the next reload#5693
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45a172ef0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if time.monotonic() >= deadline: | ||
| return None | ||
| try: | ||
| return LlamaCppBackend._get_gpu_free_memory() | ||
| except Exception: |
There was a problem hiding this comment.
Enforce wait deadline during VRAM probe call
_wait_for_vram_settle checks the deadline only before calling _get_gpu_free_memory, but that probe can block for up to 10s (the nvidia-smi timeout in _get_gpu_free_memory). If a reload happens after a kill and the probe is slow/hung, this path can exceed max_wait by several seconds, delaying /load and cancellation handling despite the new helper’s contract that wall-clock wait is bounded by max_wait.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to wait for VRAM stabilization after a process is killed, addressing issues where asynchronous VRAM reclamation by the NVIDIA driver leads to incorrect GPU selection during model reloads. Key changes include the addition of a _wait_for_vram_settle method and the tracking of kill timestamps in llama_cpp.py, along with a comprehensive new test suite. The reviewer suggested adding debug logging to a silent exception handler within the VRAM polling logic to improve observability.
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Avoid using broad, silent exception handlers. While the underlying _get_gpu_free_memory method performs its own logging, it is best practice to log any unexpected exceptions caught here at a debug level to provide visibility into why a probe might have failed during the settle wait, especially if it leads to an early return.
| except Exception: | |
| return None | |
| except Exception as e: | |
| logger.debug(f"VRAM settle probe failed: {e}") | |
| return None |
References
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
45a172e to
3fb15c0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a5cbe7247
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if time.monotonic() >= deadline: | ||
| return None | ||
| try: | ||
| return LlamaCppBackend._get_gpu_free_memory() |
There was a problem hiding this comment.
Include probe runtime in the settle wait deadline
_wait_for_vram_settle only checks deadline before invoking _get_gpu_free_memory, so if that probe blocks (e.g., nvidia-smi hangs until its 10s timeout in _get_gpu_free_memory) the function can exceed max_wait by several seconds. This breaks the helper’s bounded-wait contract and can noticeably delay /load completion and cancellation responsiveness when the GPU probe is slow or wedged.
Useful? React with 👍 / 👎.
…load The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of #5161 / #5401 / #5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties.
for more information, see https://pre-commit.ci
2a5cbe7 to
63ae19b
Compare
…load (unslothai#5693) * studio: settle GPU VRAM after killing llama-server before the next reload The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Michael Han <michaelhan2050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Problem
After
_kill_processreaps the prior llama-server PID, the NVIDIA driver reclaims that process's CUDA allocations asynchronously — typically tens to hundreds of milliseconds. Sampling_get_gpu_free_memoryin that window reads artificially low, which then propagates into_select_gpus/_fit_context_to_vramand flips the layer-split toward--fit onwith more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit.This continues the lineage of #5161 / #5401 / #5427 — PR #5427's
_serial_load_lockalready eliminated the double-llama-server concurrent-kill race, but the single-process kill→probe→spawn window is still hot enough to misread free VRAM on the next reload and produce the same Apply-only OOM symptom on cards where the steady-state would have fit.Fix
LlamaCppBackend._wait_for_vram_settle: bounded poll of_get_gpu_free_memorythat returns as soon as two consecutive samples agree per-GPU withinmax(256 MiB, 2% of larger sample), ormax_wait(default 2 s) wall-clock elapses with probe time included in the bound (so a wedgednvidia-smicannot extend the reload by its 10 s subprocess timeout).The wait engages on both in-process
load_model -> _kill_process -> loadand the frontend chat-settings Apply path (/unloadthen/load) —_kill_processrecordsself._last_kill_monotonicin itsfinallyblock, andload_modelreads that timestamp before Phase 3 instead of an in-bandhad_live_processflag. The frontend Apply path is the common one for chat-settings changes and it would otherwise miss the wait entirely.The call site runs outside the broad
self._lockso concurrent/unload,/cancel, and/statusare not blocked during the up-to-2-s wait. Re-checks_cancel_eventafter lock re-acquisition (existing check already in Phase 3 head).Short-circuits at zero cost on:
_last_kill_monotonic == 0.0)[])nvidia-smigone away)Test plan
11 new unit tests in
test_llama_cpp_wait_for_vram_settle.py:max_waitrespected when probe never settlesmax_waitrespected when probe itself is slow (the deadline check runs before each probe, sleep is clipped tomin(interval, remaining))_kill_processtimestamp recording on real kill vs no-opinspect.getsourcecontract pinning_wait_for_vram_settle/since_kill/_last_kill_monotonicsubstrings inload_model, the call ordering (before Phase 3's broad lock), and thathad_live_processcannot regressAll 102 tests in the affected test modules pass:
test_llama_cpp_wait_for_vram_settle.py(new)test_llama_cpp_no_context_shift.pytest_gguf_reload_inheritance.pytest_llama_cpp_context_fit.pytest_native_context_length.pytest_windows_gpu_detection_mock.pytest_llama_cpp_max_context_threshold.pyEnd-to-end smoke against a live
unsloth studioinstance on the local HTTP API: cold/load→ identical re-/load(already_loadedshort-circuit) →/loadwith changedmax_seq_length(Apply path) →/unload→/loadwith changedcache_type_kv(the frontend Apply flow this fix specifically engages on). 7/7 requests returned 200 with correct status fields and no orphan llama-server PIDs.