studio: settle GPU VRAM after killing llama-server before the next reload by danielhanchen · Pull Request #5693 · unslothai/unsloth

danielhanchen · 2026-05-22T11:19:51Z

Problem

After _kill_process reaps the prior llama-server PID, the NVIDIA driver reclaims that process's CUDA allocations asynchronously — typically tens to hundreds of milliseconds. Sampling _get_gpu_free_memory in that window reads artificially low, which then propagates into _select_gpus / _fit_context_to_vram and flips the layer-split toward --fit on with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit.

This continues the lineage of #5161 / #5401 / #5427 — PR #5427's _serial_load_lock already eliminated the double-llama-server concurrent-kill race, but the single-process kill→probe→spawn window is still hot enough to misread free VRAM on the next reload and produce the same Apply-only OOM symptom on cards where the steady-state would have fit.

Fix

LlamaCppBackend._wait_for_vram_settle: bounded poll of _get_gpu_free_memory that returns as soon as two consecutive samples agree per-GPU within max(256 MiB, 2% of larger sample), or max_wait (default 2 s) wall-clock elapses with probe time included in the bound (so a wedged nvidia-smi cannot extend the reload by its 10 s subprocess timeout).

The wait engages on both in-process load_model -> _kill_process -> load and the frontend chat-settings Apply path (/unload then /load) — _kill_process records self._last_kill_monotonic in its finally block, and load_model reads that timestamp before Phase 3 instead of an in-band had_live_process flag. The frontend Apply path is the common one for chat-settings changes and it would otherwise miss the wait entirely.

The call site runs outside the broad self._lock so concurrent /unload, /cancel, and /status are not blocked during the up-to-2-s wait. Re-checks _cancel_event after lock re-acquisition (existing check already in Phase 3 head).

Short-circuits at zero cost on:

cold start (no kill recorded → _last_kill_monotonic == 0.0)
stale kill (older than 15 s → driver has already settled)
CPU-only host (first probe returns [])
probe exceptions (nvidia-smi gone away)

Test plan

11 new unit tests in test_llama_cpp_wait_for_vram_settle.py:
- Cold-start zero cost
- Stale-kill skip beyond the settle window
- Empty first sample short-circuit
- First probe raises → no polling
- Three-sample reclaim ramp settles
- Mid-loop probe failure → bail
- max_wait respected when probe never settles
- max_wait respected when probe itself is slow (the deadline check runs before each probe, sleep is clipped to min(interval, remaining))
- GPU index-set change between samples → return
- Per-GPU stability when one card is still draining and another is stable
- 2 % adaptive tolerance for large-VRAM cards
- _kill_process timestamp recording on real kill vs no-op
- inspect.getsource contract pinning _wait_for_vram_settle / since_kill / _last_kill_monotonic substrings in load_model, the call ordering (before Phase 3's broad lock), and that had_live_process cannot regress
All 102 tests in the affected test modules pass:
- test_llama_cpp_wait_for_vram_settle.py (new)
- test_llama_cpp_no_context_shift.py
- test_gguf_reload_inheritance.py
- test_llama_cpp_context_fit.py
- test_native_context_length.py
- test_windows_gpu_detection_mock.py
- test_llama_cpp_max_context_threshold.py
End-to-end smoke against a live unsloth studio instance on the local HTTP API: cold /load → identical re-/load (already_loaded short-circuit) → /load with changed max_seq_length (Apply path) → /unload → /load with changed cache_type_kv (the frontend Apply flow this fix specifically engages on). 7/7 requests returned 200 with correct status fields and no orphan llama-server PIDs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a172ef0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T11:21:37Z

+            if time.monotonic() >= deadline:
+                return None
+            try:
+                return LlamaCppBackend._get_gpu_free_memory()
+            except Exception:


Enforce wait deadline during VRAM probe call

_wait_for_vram_settle checks the deadline only before calling _get_gpu_free_memory, but that probe can block for up to 10s (the nvidia-smi timeout in _get_gpu_free_memory). If a reload happens after a kill and the probe is slow/hung, this path can exceed max_wait by several seconds, delaying /load and cancellation handling despite the new helper’s contract that wall-clock wait is bounded by max_wait.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a mechanism to wait for VRAM stabilization after a process is killed, addressing issues where asynchronous VRAM reclamation by the NVIDIA driver leads to incorrect GPU selection during model reloads. Key changes include the addition of a _wait_for_vram_settle method and the tracking of kill timestamps in llama_cpp.py, along with a comprehensive new test suite. The reviewer suggested adding debug logging to a silent exception handler within the VRAM polling logic to improve observability.

gemini-code-assist · 2026-05-22T11:22:43Z

+            except Exception:
+                return None


Avoid using broad, silent exception handlers. While the underlying _get_gpu_free_memory method performs its own logging, it is best practice to log any unexpected exceptions caught here at a debug level to provide visibility into why a probe might have failed during the settle wait, especially if it leads to an early return.

Suggested change

except Exception:

return None

except Exception as e:

logger.debug(f"VRAM settle probe failed: {e}")

return None

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a5cbe7247

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T11:53:46Z

+            if time.monotonic() >= deadline:
+                return None
+            try:
+                return LlamaCppBackend._get_gpu_free_memory()


Include probe runtime in the settle wait deadline

_wait_for_vram_settle only checks deadline before invoking _get_gpu_free_memory, so if that probe blocks (e.g., nvidia-smi hangs until its 10s timeout in _get_gpu_free_memory) the function can exceed max_wait by several seconds. This breaks the helper’s bounded-wait contract and can noticeably delay /load completion and cancellation responsiveness when the GPU probe is slow or wedged.

Useful? React with 👍 / 👎.

…load The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of #5161 / #5401 / #5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties.

for more information, see https://pre-commit.ci

…load (unslothai#5693) * studio: settle GPU VRAM after killing llama-server before the next reload The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Michael Han <michaelhan2050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen requested a review from rolandtannous as a code owner May 22, 2026 11:19

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

danielhanchen force-pushed the studio-vram-settle-hotfix branch from 45a172e to 3fb15c0 Compare May 22, 2026 11:47

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

shimmyshimmer and others added 2 commits May 22, 2026 12:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

63ae19b

for more information, see https://pre-commit.ci

danielhanchen force-pushed the studio-vram-settle-hotfix branch from 2a5cbe7 to 63ae19b Compare May 22, 2026 12:50

danielhanchen merged commit 2e1d0e2 into main May 22, 2026
6 of 7 checks passed

danielhanchen deleted the studio-vram-settle-hotfix branch May 22, 2026 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: settle GPU VRAM after killing llama-server before the next reload#5693

studio: settle GPU VRAM after killing llama-server before the next reload#5693
danielhanchen merged 2 commits into
mainfrom
studio-vram-settle-hotfix

danielhanchen commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

danielhanchen commented May 22, 2026

Problem

Fix

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants