From be5ac1f139b54626d247691ff9997d85f408ded1 Mon Sep 17 00:00:00 2001
From: Aman Gupta <amangupta052@gmail.com>
Date: Thu, 21 May 2026 12:26:39 +0800
Subject: [PATCH] server : free draft/MTP resources on sleep to fix VRAM leak

The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).

For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.

Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.

ref: https://github.com/ggml-org/llama.cpp/issues/23395

Assisted-by: llama.cpp:local pi
---
 tools/server/server-context.cpp | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index f517310266c0..80d77b0c06b9 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -701,6 +701,10 @@ struct server_context_impl {
     bool sleeping = false;
 
     void destroy() {
+        spec.reset();
+        ctx_dft.reset();
+        model_dft.reset();
+
         llama_init.reset();
 
         ctx_tgt = nullptr;