Skip to content

studio: settle GPU VRAM after killing llama-server before the next reload#5693

Merged
danielhanchen merged 2 commits into
mainfrom
studio-vram-settle-hotfix
May 22, 2026
Merged

studio: settle GPU VRAM after killing llama-server before the next reload#5693
danielhanchen merged 2 commits into
mainfrom
studio-vram-settle-hotfix

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Problem

After _kill_process reaps the prior llama-server PID, the NVIDIA driver reclaims that process's CUDA allocations asynchronously — typically tens to hundreds of milliseconds. Sampling _get_gpu_free_memory in that window reads artificially low, which then propagates into _select_gpus / _fit_context_to_vram and flips the layer-split toward --fit on with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit.

This continues the lineage of #5161 / #5401 / #5427 — PR #5427's _serial_load_lock already eliminated the double-llama-server concurrent-kill race, but the single-process kill→probe→spawn window is still hot enough to misread free VRAM on the next reload and produce the same Apply-only OOM symptom on cards where the steady-state would have fit.

Fix

LlamaCppBackend._wait_for_vram_settle: bounded poll of _get_gpu_free_memory that returns as soon as two consecutive samples agree per-GPU within max(256 MiB, 2% of larger sample), or max_wait (default 2 s) wall-clock elapses with probe time included in the bound (so a wedged nvidia-smi cannot extend the reload by its 10 s subprocess timeout).

The wait engages on both in-process load_model -> _kill_process -> load and the frontend chat-settings Apply path (/unload then /load) — _kill_process records self._last_kill_monotonic in its finally block, and load_model reads that timestamp before Phase 3 instead of an in-band had_live_process flag. The frontend Apply path is the common one for chat-settings changes and it would otherwise miss the wait entirely.

The call site runs outside the broad self._lock so concurrent /unload, /cancel, and /status are not blocked during the up-to-2-s wait. Re-checks _cancel_event after lock re-acquisition (existing check already in Phase 3 head).

Short-circuits at zero cost on:

  • cold start (no kill recorded → _last_kill_monotonic == 0.0)
  • stale kill (older than 15 s → driver has already settled)
  • CPU-only host (first probe returns [])
  • probe exceptions (nvidia-smi gone away)

Test plan

  • 11 new unit tests in test_llama_cpp_wait_for_vram_settle.py:

    • Cold-start zero cost
    • Stale-kill skip beyond the settle window
    • Empty first sample short-circuit
    • First probe raises → no polling
    • Three-sample reclaim ramp settles
    • Mid-loop probe failure → bail
    • max_wait respected when probe never settles
    • max_wait respected when probe itself is slow (the deadline check runs before each probe, sleep is clipped to min(interval, remaining))
    • GPU index-set change between samples → return
    • Per-GPU stability when one card is still draining and another is stable
    • 2 % adaptive tolerance for large-VRAM cards
    • _kill_process timestamp recording on real kill vs no-op
    • inspect.getsource contract pinning _wait_for_vram_settle / since_kill / _last_kill_monotonic substrings in load_model, the call ordering (before Phase 3's broad lock), and that had_live_process cannot regress
  • All 102 tests in the affected test modules pass:

    • test_llama_cpp_wait_for_vram_settle.py (new)
    • test_llama_cpp_no_context_shift.py
    • test_gguf_reload_inheritance.py
    • test_llama_cpp_context_fit.py
    • test_native_context_length.py
    • test_windows_gpu_detection_mock.py
    • test_llama_cpp_max_context_threshold.py
  • End-to-end smoke against a live unsloth studio instance on the local HTTP API: cold /load → identical re-/load (already_loaded short-circuit) → /load with changed max_seq_length (Apply path) → /unload/load with changed cache_type_kv (the frontend Apply flow this fix specifically engages on). 7/7 requests returned 200 with correct status fields and no orphan llama-server PIDs.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a172ef0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1410 to +1414
if time.monotonic() >= deadline:
return None
try:
return LlamaCppBackend._get_gpu_free_memory()
except Exception:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce wait deadline during VRAM probe call

_wait_for_vram_settle checks the deadline only before calling _get_gpu_free_memory, but that probe can block for up to 10s (the nvidia-smi timeout in _get_gpu_free_memory). If a reload happens after a kill and the probe is slow/hung, this path can exceed max_wait by several seconds, delaying /load and cancellation handling despite the new helper’s contract that wall-clock wait is bounded by max_wait.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to wait for VRAM stabilization after a process is killed, addressing issues where asynchronous VRAM reclamation by the NVIDIA driver leads to incorrect GPU selection during model reloads. Key changes include the addition of a _wait_for_vram_settle method and the tracking of kill timestamps in llama_cpp.py, along with a comprehensive new test suite. The reviewer suggested adding debug logging to a silent exception handler within the VRAM polling logic to improve observability.

Comment on lines +1414 to +1415
except Exception:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using broad, silent exception handlers. While the underlying _get_gpu_free_memory method performs its own logging, it is best practice to log any unexpected exceptions caught here at a debug level to provide visibility into why a probe might have failed during the settle wait, especially if it leads to an early return.

Suggested change
except Exception:
return None
except Exception as e:
logger.debug(f"VRAM settle probe failed: {e}")
return None
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

@danielhanchen danielhanchen force-pushed the studio-vram-settle-hotfix branch from 45a172e to 3fb15c0 Compare May 22, 2026 11:47

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a5cbe7247

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1387 to +1390
if time.monotonic() >= deadline:
return None
try:
return LlamaCppBackend._get_gpu_free_memory()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include probe runtime in the settle wait deadline

_wait_for_vram_settle only checks deadline before invoking _get_gpu_free_memory, so if that probe blocks (e.g., nvidia-smi hangs until its 10s timeout in _get_gpu_free_memory) the function can exceed max_wait by several seconds. This breaks the helper’s bounded-wait contract and can noticeably delay /load completion and cancellation responsiveness when the GPU probe is slow or wedged.

Useful? React with 👍 / 👎.

shimmyshimmer and others added 2 commits May 22, 2026 12:50
…load

The NVIDIA driver reclaims a dead process's CUDA allocations
asynchronously after the kernel reaps the PID -- typically tens to
hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that
window reads artificially low, which propagates into `_select_gpus`
/ `_fit_context_to_vram` and flips the layer-split toward `--fit on`
with more CPU-offloaded layers than steady-state would have required.
On a tight VRAM card the resulting mmap thrash + OOM matches the
Apply-reload kill path that bare-shell launches with the same flags
never hit (continues the lineage of #5161 / #5401 / #5427).

Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of
`_get_gpu_free_memory` that returns as soon as two consecutive
samples agree per-GPU within `max(256 MiB, 2% of larger sample)`,
or `max_wait` (default 2 s) wall-clock elapses with probe time
included in the bound. Records `_last_kill_monotonic` inside
`_kill_process`'s `finally` block so the wait engages on both
in-process `load_model -> _kill_process -> load` and the frontend
chat-settings Apply path (`/unload` then `/load`). The call site
runs OUTSIDE the broad `self._lock` so concurrent `/unload`,
`/cancel`, `/status` are not blocked during the wait.

Short-circuits at zero cost on cold start (no kill recorded), stale
kill (older than 15 s, driver has already settled), CPU-only host
(probe returns empty), and probe exceptions (nvidia-smi gone away).

11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover:
cold-start zero cost, stale-kill skip, slow-probe deadline bound,
GPU index-set change, per-GPU stability with one draining card, the
2 % adaptive tolerance, _kill_process timestamp recording on real
kill vs no-op, and an `inspect.getsource` contract that pins the
call site to outside the Phase 3 lock and uses `_last_kill_monotonic`
so a future refactor can't silently regress any of these properties.
@danielhanchen danielhanchen force-pushed the studio-vram-settle-hotfix branch from 2a5cbe7 to 63ae19b Compare May 22, 2026 12:50
@danielhanchen danielhanchen merged commit 2e1d0e2 into main May 22, 2026
6 of 7 checks passed
@danielhanchen danielhanchen deleted the studio-vram-settle-hotfix branch May 22, 2026 12:50
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…load (unslothai#5693)

* studio: settle GPU VRAM after killing llama-server before the next reload

The NVIDIA driver reclaims a dead process's CUDA allocations
asynchronously after the kernel reaps the PID -- typically tens to
hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that
window reads artificially low, which propagates into `_select_gpus`
/ `_fit_context_to_vram` and flips the layer-split toward `--fit on`
with more CPU-offloaded layers than steady-state would have required.
On a tight VRAM card the resulting mmap thrash + OOM matches the
Apply-reload kill path that bare-shell launches with the same flags
never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427).

Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of
`_get_gpu_free_memory` that returns as soon as two consecutive
samples agree per-GPU within `max(256 MiB, 2% of larger sample)`,
or `max_wait` (default 2 s) wall-clock elapses with probe time
included in the bound. Records `_last_kill_monotonic` inside
`_kill_process`'s `finally` block so the wait engages on both
in-process `load_model -> _kill_process -> load` and the frontend
chat-settings Apply path (`/unload` then `/load`). The call site
runs OUTSIDE the broad `self._lock` so concurrent `/unload`,
`/cancel`, `/status` are not blocked during the wait.

Short-circuits at zero cost on cold start (no kill recorded), stale
kill (older than 15 s, driver has already settled), CPU-only host
(probe returns empty), and probe exceptions (nvidia-smi gone away).

11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover:
cold-start zero cost, stale-kill skip, slow-probe deadline bound,
GPU index-set change, per-GPU stability with one draining card, the
2 % adaptive tolerance, _kill_process timestamp recording on real
kill vs no-op, and an `inspect.getsource` contract that pins the
call site to outside the Phase 3 lock and uses `_last_kill_monotonic`
so a future refactor can't silently regress any of these properties.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Michael Han <michaelhan2050@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants