Skip to content

TP: fix ggml context size calculation#22616

Merged
ggerganov merged 5 commits into
ggml-org:masterfrom
JohannesGaessler:tp-fix-ctx-size
May 25, 2026
Merged

TP: fix ggml context size calculation#22616
ggerganov merged 5 commits into
ggml-org:masterfrom
JohannesGaessler:tp-fix-ctx-size

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Fixes #22404 .

On master there is no way for the meta backend to tell for how many tensors a buffer will be allocated so the corresponding ggml contexts for the simple backends are created with a constant 1 GB of memory. In principle the ggml backend API could be extended to track this but long-term I don't think that that is how it should be done. As a stopgap solution this PR sets a maximum expected number of ggml tensors per simple backend in the context of the meta backend buffer type. This may result in overallocation but is safe in terms of multithreading.

Requirements

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 2, 2026
@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 2, 2026

Crashes immediately for me now:

sched_reserve: reserve took 66.48 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_new_object: not enough space in the context's memory pool (needed 71024, available 70656)
/devel/tools/llama.cpp/ggml/src/ggml.c:1699: not enough space in the context's memory pool

stacktrace:

#7  0x0000723e3d556d90 in ggml_new_object (ctx=<optimized out>, type=<optimized out>, size=<optimized out>) at /devel/tools/llama.cpp/ggml/src/ggml.c:1699
1699	        GGML_ABORT("not enough space in the context's memory pool");
#8  0x0000723e3d558c51 in ggml_new_tensor_impl (ctx=0x5bacb174db40, type=GGML_TYPE_F32, n_dims=4, ne=0x7ffd787ebb70, view_src=0x0, view_offs=0) at /devel/tools/llama.cpp/ggml/src/ggml.c:1765
1765	    struct ggml_object * const obj_new = ggml_new_object(ctx, GGML_OBJECT_TYPE_TENSOR, GGML_TENSOR_SIZE + obj_alloc_size);
#9  ggml_new_tensor (ctx=0x5bacb174db40, type=GGML_TYPE_F32, n_dims=4, ne=0x7ffd787ebb70) at /devel/tools/llama.cpp/ggml/src/ggml.c:1810
1810	    return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL, 0);
#10 0x0000723e3d57fa74 in ggml_backend_meta_buffer_init_tensor (buffer=0x5bacb174c8a0, tensor=0x723a7969e840) at /devel/tools/llama.cpp/ggml/src/ggml-backend-meta.cpp:1137
1137	        ggml_tensor * t_ij = ggml_new_tensor(simple_ctx, tensor->type, GGML_MAX_DIMS, ne);
#11 0x0000723e3d56d666 in ggml_gallocr_alloc_graph (galloc=0x5bacb2b68800, graph=0x5bacb193b008) at /devel/tools/llama.cpp/ggml/src/ggml-alloc.c:1099
1099	        ggml_gallocr_init_tensor(galloc, node, &node_alloc->dst);
#12 0x0000723e3d573df1 in ggml_backend_sched_alloc_splits (sched=0x5bacb193aeb0) at /devel/tools/llama.cpp/ggml/src/ggml-backend.cpp:1505
1505	    if (backend_ids_changed || !ggml_gallocr_alloc_graph(sched->galloc, &sched->graph)) {
#13 ggml_backend_sched_alloc_graph (sched=0x5bacb193aeb0, graph=<optimized out>) at /devel/tools/llama.cpp/ggml/src/ggml-backend.cpp:1870
1870	    if (!ggml_backend_sched_alloc_splits(sched)) {
#14 0x0000723e3d2b9427 in llama_context::process_ubatch (this=this@entry=0x5bacb174f020, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x5bacaf04fd60, ret=@0x7ffd787f22d0: GGML_STATUS_SUCCESS) at /usr/include/c++/15/bits/unique_ptr.h:192
192	      pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#15 0x0000723e3d2c000f in llama_context::decode (this=0x5bacb174f020, batch_inp=...) at /devel/tools/llama.cpp/src/llama-context.cpp:1692
1692	        const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#16 0x0000723e3d2c1b52 in llama_decode (ctx=<optimized out>, batch=...) at /devel/tools/llama.cpp/src/llama-context.cpp:3454
3454	    const int ret = ctx->decode(batch);
#17 0x0000723e3d80acfd in common_init_from_params (params=...) at /devel/tools/llama.cpp/common/common.cpp:1370
1370	            llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
#18 0x00005bac7150ddfe in server_context_impl::load_model (this=0x5bacaf0b04d0, params=...) at /devel/tools/llama.cpp/tools/server/server-context.cpp:754
754	        llama_init = common_init_from_params(params_base);
#19 0x00005bac714e10cc in server_context::load_model (this=this@entry=0x7ffd787f4518, params=...) at /devel/tools/llama.cpp/tools/server/server-context.cpp:3090
3090	    return impl->load_model(params);

@heislera763
Copy link
Copy Markdown

heislera763 commented May 2, 2026

It looks like I'm getting the same crash as pwilkin on this branch

Logs
alexander@alexander-main:~/.llama-server$ ./test-tp-debug.sh
ggml_cuda_init: found 6 CUDA devices (Total VRAM: 194964 MiB):
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
  Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
  Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
  Device 3: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
  Device 4: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
  Device 5: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32494 MiB
load_backend: failed to find ggml_backend_init in /home/alexander/.llama-server/llama.cpp-debug/build/bin/libggml-cuda.so
load_backend: failed to find ggml_backend_init in /home/alexander/.llama-server/llama.cpp-debug/build/bin/libggml-cpu.so
build_info: b9008-cd35c1466
system_info: n_threads = 44 (n_threads_batch = 44) / 88 | CUDA : ARCHS = 700 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 87 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'models/gemma-4-31b/google_gemma-4-31B-it-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
common_fit_params: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: skipping CPU (Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz) for tensor parallelism
llama_model_load_from_file_impl: creating a Meta device for tensor parallelism from 6 devices:
llama_model_load_from_file_impl: - device 0: CUDA0 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: - device 1: CUDA1 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: - device 2: CUDA2 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: - device 3: CUDA3 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: - device 4: CUDA4 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: - device 5: CUDA5 (Tesla V100-SXM2-32GB)
llama_model_load_from_file_impl: using device Meta() (Meta()) (unknown id) - 193101 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 833 tensors from models/gemma-4-31b/google_gemma-4-31B-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 64
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Gemma 4 31B It
llama_model_loader: - kv   6:                           general.finetune str              = it
llama_model_loader: - kv   7:                           general.basename str              = gemma-4
llama_model_loader: - kv   8:                         general.size_label str              = 31B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://ai.google.dev/gemma/docs/gemm...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         gemma4.block_count u32              = 60
llama_model_loader: - kv  13:                      gemma4.context_length u32              = 262144
llama_model_loader: - kv  14:                    gemma4.embedding_length u32              = 5376
llama_model_loader: - kv  15:                 gemma4.feed_forward_length u32              = 21504
llama_model_loader: - kv  16:                gemma4.attention.head_count u32              = 32
llama_model_loader: - kv  17:             gemma4.attention.head_count_kv arr[i32,60]      = [16, 16, 16, 16, 16, 4, 16, 16, 16, 1...
llama_model_loader: - kv  18:                      gemma4.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:                  gemma4.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  20:    gemma4.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                gemma4.attention.key_length u32              = 512
llama_model_loader: - kv  22:              gemma4.attention.value_length u32              = 512
llama_model_loader: - kv  23:             gemma4.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  24:            gemma4.attention.sliding_window u32              = 1024
llama_model_loader: - kv  25:          gemma4.attention.shared_kv_layers u32              = 0
llama_model_loader: - kv  26:    gemma4.embedding_length_per_layer_input u32              = 0
llama_model_loader: - kv  27:    gemma4.attention.sliding_window_pattern arr[bool,60]     = [true, true, true, true, true, false,...
llama_model_loader: - kv  28:            gemma4.attention.key_length_swa u32              = 256
llama_model_loader: - kv  29:          gemma4.attention.value_length_swa u32              = 256
llama_model_loader: - kv  30:                gemma4.rope.dimension_count u32              = 512
llama_model_loader: - kv  31:            gemma4.rope.dimension_count_swa u32              = 256
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gemma4
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  34:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,514906]  = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  41:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- macro format_parameters(propertie...
llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  44:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 7
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = /models_out/gemma-4-31B-it-GGUF/googl...
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 410
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 822
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type q8_0:  411 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.38 GiB (8.50 BPW)
load: 0 unused tokens
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
load: special tokens cache size = 24
load: token to piece cache size = 1.9445 MB
print_info: arch                  = gemma4
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5376
print_info: n_embd_inp            = 5376
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
print_info: n_rot                 = 512
print_info: n_swa                 = 1024
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 512
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 1.0e+00
print_info: n_ff                  = 21504
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 256
print_info: n_embd_head_v_swa     = 256
print_info: n_rot_swa             = 256
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 31B
print_info: model params          = 30.70 B
print_info: general.name          = Gemma 4 31B It
print_info: vocab type            = BPE
print_info: n_vocab               = 262144
print_info: n_merges              = 514906
print_info: BOS token             = 2 '<bos>'
print_info: EOS token             = 1 '<eos>'
print_info: UNK token             = 3 '<unk>'
print_info: PAD token             = 0 '<pad>'
print_info: MASK token            = 4 '<mask>'
print_info: LF token              = 107 '
'
print_info: EOG token             = 1 '<eos>'
print_info: EOG token             = 50 '<|tool_response>'
print_info: EOG token             = 106 '<turn|>'
print_info: max token length      = 93
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1428.00 MiB
load_tensors:       Meta() model buffer size =  6400.51 MiB
..............................................................................................
common_init_result: added <eos> logit bias = -inf
common_init_result: added <|tool_response> logit bias = -inf
common_init_result: added <turn|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 262144 cells
llama_kv_cache:     Meta() KV buffer size =  3584.00 MiB
llama_kv_cache: size = 20480.00 MiB (262144 cells,  10 layers,  1/1 seqs), K (f16): 10240.00 MiB, V (f16): 10240.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:     Meta() KV buffer size =   201.00 MiB
llama_kv_cache: size = 1200.00 MiB (  1536 cells,  50 layers,  1/1 seqs), K (f16):  600.00 MiB, V (f16):  600.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:     Meta() compute buffer size =   857.52 MiB
sched_reserve:  CUDA_Host compute buffer size =   536.02 MiB
sched_reserve: graph nodes  = 2462
sched_reserve: graph splits = 2
sched_reserve: reserve took 302.88 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:1699: not enough space in the context's memory pool
ggml_new_object: not enough space in the context's memory pool (needed 147568, available 147200)
[New LWP 27142]
[New LWP 27141]
[New LWP 27140]
[New LWP 27139]
[New LWP 27138]
[New LWP 27137]
[New LWP 27136]
[New LWP 27135]
[New LWP 27134]
[New LWP 27133]
[New LWP 27132]
[New LWP 27131]
[New LWP 27130]
[New LWP 27122]
[New LWP 27121]
[New LWP 27120]
[New LWP 27119]
[New LWP 27118]
[New LWP 27117]
[New LWP 27116]
[New LWP 27115]
[New LWP 27114]
[New LWP 27113]
[New LWP 27112]
[New LWP 27111]
[New LWP 27110]
[New LWP 27109]
[New LWP 27108]
[New LWP 27107]
[New LWP 27106]
[New LWP 27105]
[New LWP 27104]
[New LWP 27103]
[New LWP 27102]
[New LWP 27101]
[New LWP 27100]
[New LWP 27099]
[New LWP 27098]
[New LWP 27097]
[New LWP 27096]
[New LWP 27095]
[New LWP 27094]
[New LWP 27093]
[New LWP 27092]
[New LWP 27091]
[New LWP 27090]
[New LWP 27089]
[New LWP 27088]
[New LWP 27087]
[New LWP 27086]
[New LWP 27085]
[New LWP 27084]
[New LWP 27083]
[New LWP 27082]
[New LWP 27081]
[New LWP 27080]
[New LWP 27079]
[New LWP 27078]
[New LWP 27077]
[New LWP 27076]
[New LWP 27075]
[New LWP 27074]
[New LWP 27073]
[New LWP 27072]
[New LWP 27071]
[New LWP 27070]
[New LWP 27069]
[New LWP 27068]
[New LWP 27067]
[New LWP 27066]
[New LWP 27065]
[New LWP 27064]
[New LWP 27063]
[New LWP 27062]
[New LWP 27061]
[New LWP 27060]
[New LWP 27059]
[New LWP 27058]
[New LWP 27057]
[New LWP 27056]
[New LWP 27055]
[New LWP 27054]
[New LWP 27053]
[New LWP 27052]
[New LWP 27051]
[New LWP 27050]
[New LWP 27049]
[New LWP 27048]
[New LWP 27047]
[New LWP 27046]
[New LWP 27045]
[New LWP 27044]
[New LWP 27043]
[New LWP 27042]
[New LWP 27041]
[New LWP 27040]
[New LWP 27039]
[New LWP 27038]
[New LWP 27037]
[New LWP 27036]
[New LWP 27035]
[New LWP 27034]
[New LWP 27033]
[New LWP 27032]
[New LWP 27031]
[New LWP 27030]
[New LWP 27029]
[New LWP 27028]
[New LWP 27008]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fc405e9b668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49	./nptl/cancellation.c: No such file or directory
#2  0x00007fc405e9b6ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x00007fc405f067c7 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fc410b263ec in ggml_print_backtrace () at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:234
234	        waitpid(child_pid, NULL, 0);
#5  0x00007fc410b265bc in ggml_abort (file=0x7fc410bba7c1 "/home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c", line=1699, fmt=0x7fc410bbd284 "not enough space in the context's memory pool") at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:268
268	        ggml_print_backtrace();
#6  0x00007fc410b28d62 in ggml_new_object (ctx=0x557381df9e80, type=GGML_OBJECT_TYPE_TENSOR, size=336) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:1699
1699	        GGML_ABORT("not enough space in the context's memory pool");
#7  0x00007fc410b287cb in ggml_new_tensor_impl (ctx=0x557381df9e80, type=GGML_TYPE_F16, n_dims=4, ne=0x7fffa79853c0, view_src=0x0, view_offs=0) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:1765
1765	    struct ggml_object * const obj_new = ggml_new_object(ctx, GGML_OBJECT_TYPE_TENSOR, GGML_TENSOR_SIZE + obj_alloc_size);
#8  0x00007fc410b285c1 in ggml_new_tensor (ctx=0x557381df9e80, type=GGML_TYPE_F16, n_dims=4, ne=0x7fffa79853c0) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml.c:1810
1810	    return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL, 0);
#9  0x00007fc410b4a21f in ggml_backend_meta_buffer_init_tensor (buffer=0x557388454740, tensor=0x5573897af330) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-backend-meta.cpp:1137
1137	        ggml_tensor * t_ij = ggml_new_tensor(simple_ctx, tensor->type, GGML_MAX_DIMS, ne);
#10 0x00007fc410b3ec65 in ggml_backend_buffer_init_tensor (buffer=0x557388454740, tensor=0x5573897af330) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-backend.cpp:147
147	        return buffer->iface.init_tensor(buffer, tensor);
#11 0x00007fc410b45c78 in ggml_backend_view_init (tensor=0x5573897af330) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-backend.cpp:1985
1985	    return ggml_backend_buffer_init_tensor(tensor->buffer, tensor);
#12 0x00007fc410b3c6a1 in ggml_gallocr_init_tensor (galloc=0x557381ead2d0, tensor=0x5573897af330, tensor_alloc=0x55738a329410) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-alloc.c:986
986	            ggml_backend_view_init(tensor);
#13 0x00007fc410b3c375 in ggml_gallocr_alloc_graph (galloc=0x557381ead2d0, graph=0x5573823db7f8) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-alloc.c:1099
1099	        ggml_gallocr_init_tensor(galloc, node, &node_alloc->dst);
#14 0x00007fc410b447b1 in ggml_backend_sched_alloc_splits (sched=0x5573823db6a0) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-backend.cpp:1505
1505	    if (backend_ids_changed || !ggml_gallocr_alloc_graph(sched->galloc, &sched->graph)) {
#15 0x00007fc410b4462d in ggml_backend_sched_alloc_graph (sched=0x5573823db6a0, graph=0x5573896d2700) at /home/alexander/.llama-server/llama.cpp-debug/ggml/src/ggml-backend.cpp:1870
1870	    if (!ggml_backend_sched_alloc_splits(sched)) {
#16 0x00007fc4103ab399 in llama_context::process_ubatch (this=0x557381d0ff90, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x5573823dd8f0, ret=@0x7fffa798ba74: GGML_STATUS_SUCCESS) at /home/alexander/.llama-server/llama.cpp-debug/src/llama-context.cpp:1214
1214	        if (!ggml_backend_sched_alloc_graph(sched.get(), gf)) {
#17 0x00007fc4103ace77 in llama_context::decode (this=0x557381d0ff90, batch_inp=...) at /home/alexander/.llama-server/llama.cpp-debug/src/llama-context.cpp:1692
1692	        const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#18 0x00007fc4103b2c09 in llama_decode (ctx=0x557381d0ff90, batch=...) at /home/alexander/.llama-server/llama.cpp-debug/src/llama-context.cpp:3454
3454	    const int ret = ctx->decode(batch);
#19 0x00007fc4112fcb06 in common_init_from_params (params=...) at /home/alexander/.llama-server/llama.cpp-debug/common/common.cpp:1370
1370	            llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
#20 0x0000557365e8111e in server_context_impl::load_model (this=0x557381d25e00, params=...) at /home/alexander/.llama-server/llama.cpp-debug/tools/server/server-context.cpp:754
754	        llama_init = common_init_from_params(params_base);
#21 0x0000557365e609c5 in server_context::load_model (this=0x7fffa7993d90, params=...) at /home/alexander/.llama-server/llama.cpp-debug/tools/server/server-context.cpp:3090
3090	    return impl->load_model(params);
#22 0x0000557365da410b in main (argc=11, argv=0x7fffa7995c28) at /home/alexander/.llama-server/llama.cpp-debug/tools/server/server.cpp:280
280	        if (!ctx_server.load_model(params)) {
[Inferior 1 (process 27007) detached]
./test-tp-debug.sh: line 9: 27007 Aborted                 ./llama.cpp-debug/build/bin/llama-server -m models/gemma-4-31b/google_gemma-4-31B-it-Q8_0.gguf -sm tensor -fa 1 -c 262144 -np 1

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

It seems there were multiple problems vs. master. The meta backend needs both a way to calculate the actual context size that is and also a way to temporarily store simple tensors for external views, otherwise the ggml contexts per backend are slowly depleted until the program crashes. Does it work now?

@heislera763
Copy link
Copy Markdown

So far so good for me, no longer crashing at start up. I will run this branch for a bit to gauge the stability

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 2, 2026

For what it's worth, ran it up a full session with OpenCode from zero up to compaction at 100k, no errors, no crashes, keeping around a steady 1000 t/s prefill + 50 t/s generation for 2x 5070 with Qwen 3.6 dense (27B) and self-speculative decoding (--spec-default).

@jimneumann
Copy link
Copy Markdown

This seems to have fixed it for me too. 5 instances of llama-eval have been running for 6 hours straight now without issue on different numbers of GPUs (2x and 3x V100 with Qwen3.6 35B MoE and 2x, 3x, and 4x V100 with the 27B dense). In the best case before it would've crashed twice by now.

@Odel
Copy link
Copy Markdown

Odel commented May 3, 2026

Seems fixed here as well on 2x 3090s

@ggerganov
Copy link
Copy Markdown
Member

Try to keep the meta backend changes contained better:

diff --git a/ggml/src/ggml-backend-impl.h b/ggml/src/ggml-backend-impl.h
index 9c56ec30c..0630319fe 100644
--- a/ggml/src/ggml-backend-impl.h
+++ b/ggml/src/ggml-backend-impl.h
@@ -98,6 +98,9 @@ extern "C" {
     // temporary workaround to statically allocate tensors from a context in a deduplicated way:
     GGML_API struct ggml_backend_buffer * ggml_backend_meta_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft);
 
+    // another temporary workaround
+    GGML_API void ggml_backend_meta_buft_update_max_n_tensors(ggml_backend_buffer_type_t buft, size_t n_tensors);
+
     //
     // Backend (stream)
     //
diff --git a/ggml/src/ggml-backend-meta.cpp b/ggml/src/ggml-backend-meta.cpp
index 0f57af6e6..78d681edf 100644
--- a/ggml/src/ggml-backend-meta.cpp
+++ b/ggml/src/ggml-backend-meta.cpp
@@ -1478,12 +1478,12 @@ struct ggml_backend_buffer * ggml_backend_meta_alloc_ctx_tensors_from_buft(struc
     const size_t n_simple_bufts = ggml_backend_meta_buft_n_bufts(buft);
 
     ggml_init_params params_static = {
-        /*.mem_size   =*/ ctx->mem_size,
+        /*.mem_size   =*/ ggml_get_mem_size(ctx),
         /*.mem_buffer =*/ nullptr,
         /*.no_alloc   =*/ true,
     };
     ggml_init_params params_compute = {
-        /*.mem_size   =*/ 8*ctx->mem_size,
+        /*.mem_size   =*/ 8*ggml_get_mem_size(ctx),
         /*.mem_buffer =*/ nullptr,
         /*.no_alloc   =*/ true,
     };
diff --git a/ggml/src/ggml-impl.h b/ggml/src/ggml-impl.h
index b713705f9..62b76abbc 100644
--- a/ggml/src/ggml-impl.h
+++ b/ggml/src/ggml-impl.h
@@ -57,22 +57,6 @@ uint64_t ggml_graph_next_uid(void);
     #endif
 #endif
 
-//
-// ggml context
-//
-
-struct ggml_context {
-    size_t mem_size;
-    void * mem_buffer;
-    bool   mem_buffer_owned;
-    bool   no_alloc;
-
-    int    n_objects;
-
-    struct ggml_object * objects_begin;
-    struct ggml_object * objects_end;
-};
-
 static inline int ggml_up32(int n) {
     return (n + 31) & ~31;
 }
@@ -757,9 +741,6 @@ static inline bool ggml_can_fuse_subgraph(const struct ggml_cgraph * cgraph,
     return ggml_can_fuse_subgraph_ext(cgraph, idxs, count, ops, outputs, num_outputs);
 }
 
-struct ggml_backend_buffer_type;
-void ggml_backend_meta_buft_update_max_n_tensors(struct ggml_backend_buffer_type * buft, size_t n_tensors);
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 9b969ae16..81343eeb1 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -952,6 +952,22 @@ struct ggml_object {
 
 static const size_t GGML_OBJECT_SIZE = sizeof(struct ggml_object);
 
+//
+// ggml context
+//
+
+struct ggml_context {
+    size_t mem_size;
+    void * mem_buffer;
+    bool   mem_buffer_owned;
+    bool   no_alloc;
+
+    int    n_objects;
+
+    struct ggml_object * objects_begin;
+    struct ggml_object * objects_end;
+};
+
 //
 // data types
 //

Comment thread ggml/src/ggml-backend-meta.cpp Outdated

std::string name;

std::atomic<size_t> max_n_tensors = 1024; // FIXME replace with better handling in ggml-alloc.c
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this has to be atomic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple threads were to try to use the same meta ggml backend buffer type at the same time there could be a race condition on which maximum value gets set for it, leading to one of the threads allocating a ggml_context with not enough memory and resulting in spurious and hard to debug crashes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can completely avoid the max_n_tensors changes. Can you try to dynamically resize the ggml_contexts when they run out of space? Any reason not do it like this in the first place?

Comment thread ggml/src/ggml-backend.cpp Outdated
Comment on lines +1826 to +1843
std::set<ggml_backend_buffer_t> seen_bufs;
for (int i = 0; i < sched->graph.n_leafs; i++) {
ggml_backend_buffer_t buf = sched->graph.leafs[i]->buffer;
if (buf == nullptr || seen_bufs.find(buf) != seen_bufs.end()) {
continue;
}
ggml_backend_buffer_reset(buf);
seen_bufs.emplace(buf);
}
for (int i = 0; i < sched->graph.n_nodes; i++) {
ggml_backend_buffer_t buf = sched->graph.nodes[i]->buffer;
if (buf == nullptr || seen_bufs.find(buf) != seen_bufs.end()) {
continue;
}
ggml_backend_buffer_reset(buf);
seen_bufs.emplace(buf);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to need these resets now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta backend has one simple tensor per simple backend and per original tensor. They are created when init_tensor is called and their lifetime is tied to the meta ggml backend buffer. However, when a view of one of those tensors is created by an external ggml_context that view will be using the statically allocated meta ggml backend buffer and thus have its simple tensors created there. In llama.cpp we are continuously creating views of statically allocated tensors so this slowly depletes the memory of meta backend buffer's ggml contexts. Resetting the buffers here frees the memory for external views.

But the way I have currently implemented it is problematic because it is vulnerable to dangling pointers if the ggml backend buffers are freed before ggml_sched_reset is called - which is why test-opt is segfaulting in the CI. I'm currently thinking about whether the way to handle the mapping from an original tensors to the simple tensors per simple backend needs to be handled differently altogether.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I revised the way simple tensors are stored in the meta backend. For now it is using a rotating set of temporary ggml contexts to avoid memory depletion; this should work correctly for the usage pattern in llama.cpp. Long-term my opinion is that we should change the backend scheduler to use GGML_OP_CUSTOM in order to move copies between backends into ggml_cgraph rather than to do it externally via API calls. Then it should be possible to tie the lifetime of the simple tensors to the graph execution by the meta backend which would I think be a more robust solution.

@mtone
Copy link
Copy Markdown

mtone commented May 3, 2026

Cloned just now (commit 14a95f6 & previous commit 63d93d1). I'm no longer able to load this model on 2x RTX Pro 6000.

      -hf unsloth/Qwen3.5-397B-A17B-GGUF:Q3_K_XL
      --port ${PORT} --jinja
      --temp 1.0 --top-p 0.95 --top-k 20 --min_p 0 --presence-penalty 1.0 --repeat-penalty 1.0
      --split-mode tensor
     # Various combinations of the following:
      --no-mmproj
      --ctx-size 64000 # Usually 256000
      # --parallel 1
      # -b 2048 -ub 2048

In all cases I get ggml_new_object: not enough space in the context's memory pool (needed 44528, available 44160).

Logs
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 194489 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97249 MiB
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97240 MiB
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9008-14a95f649
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/mtone/.cache/huggingface/hub/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/da33c16fa4440f831149fcf53b98a22bc07785e5/UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
common_fit_params: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: skipping CPU (AMD Ryzen 9 7950X3D 16-Core Processor) for tensor parallelism
llama_model_load_from_file_impl: creating a Meta device for tensor parallelism from 2 devices:
llama_model_load_from_file_impl: - device 0: CUDA0 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition)
llama_model_load_from_file_impl: - device 1: CUDA1 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition)
llama_model_load_from_file_impl: using device Meta() (Meta()) (unknown id) - 193361 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 1098 tensors from /home/mtone/.cache/huggingface/hub/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/da33c16fa4440f831149fcf53b98a22bc07785e5/UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5-397B-A17B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5-397B-A17B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 397B-A17B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.5 397B A17B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35moe.block_count u32              = 60
llama_model_loader: - kv  18:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  19:                 qwen35moe.embedding_length u32              = 4096
llama_model_loader: - kv  20:             qwen35moe.attention.head_count u32              = 32
llama_model_loader: - kv  21:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  22:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                     qwen35moe.expert_count u32              = 512
llama_model_loader: - kv  26:                qwen35moe.expert_used_count u32              = 10
llama_model_loader: - kv  27:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  28:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  29:       qwen35moe.expert_feed_forward_length u32              = 1024
llama_model_loader: - kv  30: qwen35moe.expert_shared_feed_forward_length u32              = 1024
llama_model_loader: - kv  31:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  32:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  33:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  34:               qwen35moe.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  35:                   qwen35moe.ssm.inner_size u32              = 8192
llama_model_loader: - kv  36:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  37:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  39:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  40:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  41:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  42:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  45:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  46:               general.quantization_version u32              = 2
llama_model_loader: - kv  47:                          general.file_type u32              = 12
llama_model_loader: - kv  48:                      quantize.imatrix.file str              = Qwen3.5-397B-A17B-GGUF/imatrix_unslot...
llama_model_loader: - kv  49:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.5-397B-A17B...
llama_model_loader: - kv  50:             quantize.imatrix.entries_count u32              = 765
llama_model_loader: - kv  51:              quantize.imatrix.chunks_count u32              = 80
llama_model_loader: - kv  52:                                   split.no u16              = 0
llama_model_loader: - kv  53:                        split.tensors.count i32              = 1098
llama_model_loader: - kv  54:                                split.count u16              = 5
llama_model_loader: - type  f32:  451 tensors
llama_model_loader: - type q8_0:  459 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_xxs:  118 tensors
llama_model_loader: - type iq4_xs:   61 tensors
llama_model_loader: - type bf16:    7 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 166.45 GiB (3.61 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35moe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 16
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 0
print_info: n_expert              = 512
print_info: n_expert_used         = 10
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 8192
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 64
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 397B.A17B
print_info: model params          = 396.35 B
print_info: general.name          = Qwen3.5-397B-A17B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1030.62 MiB
load_tensors:       Meta() model buffer size = 85338.41 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_init_from_model: enabling flash_attn since it is required for SPLIT_MODE_TENSOR
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 64000
llama_context: n_ctx_seq     = 64000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (64000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     3.79 MiB
llama_kv_cache:     Meta() KV buffer size =   937.50 MiB
llama_kv_cache: size = 1875.00 MiB ( 64000 cells,  15 layers,  4/1 seqs), K (f16):  937.50 MiB, V (f16):  937.50 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:     Meta() RS buffer size =   372.66 MiB
llama_memory_recurrent: size =  745.31 MiB (     4 cells,  60 layers,  4 seqs), R (f32):   25.31 MiB, S (f32):  720.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:     Meta() compute buffer size =   501.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   141.02 MiB
sched_reserve: graph nodes  = 5829
sched_reserve: graph splits = 2
sched_reserve: reserve took 72.54 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/mtone/bin/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
ggml_new_object: not enough space in the context's memory pool (needed 44528, available 44160)
[New LWP 365072]
[New LWP 365071]
[New LWP 365070]
[New LWP 365069]
[New LWP 365068]
[New LWP 364863]
[New LWP 364862]
[New LWP 364861]
[New LWP 364860]
[New LWP 364859]
[New LWP 364858]
[New LWP 364857]
[New LWP 364856]
[New LWP 364855]
[New LWP 364854]
[New LWP 364853]
[New LWP 364852]
[New LWP 364851]
[New LWP 364850]
[New LWP 364849]
[New LWP 364848]
[New LWP 364847]
[New LWP 364846]
[New LWP 364845]
[New LWP 364844]
[New LWP 364843]
[New LWP 364842]
[New LWP 364841]
[New LWP 364840]
[New LWP 364839]
[New LWP 364838]
[New LWP 364837]
[New LWP 364836]
[New LWP 364835]
[New LWP 364834]
[New LWP 364833]
[New LWP 364832]
[New LWP 364831]
[New LWP 364830]
[New LWP 364829]
[New LWP 364828]
[New LWP 364827]
[New LWP 364820]
[New LWP 364812]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000713f9bb10813 in __GI___wait4 (pid=365073, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x0000713f9bb10813 in __GI___wait4 (pid=365073, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000713f9ca86563 in ggml_print_backtrace () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#2  0x0000713f9ca8670b in ggml_abort () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#3  0x0000713f9ca88ab8 in ggml_new_tensor () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#4  0x0000713f9caab7dc in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#5  0x0000713f9ca9c27b in ggml_gallocr_alloc_graph () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#6  0x0000713f9caa27b1 in ggml_backend_sched_alloc_graph () from /home/mtone/bin/llama.cpp/build/bin/libggml-base.so.0
#7  0x0000713f9c2b5747 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/mtone/bin/llama.cpp/build/bin/libllama.so.0
#8  0x0000713f9c2bcc10 in llama_context::decode(llama_batch const&) () from /home/mtone/bin/llama.cpp/build/bin/libllama.so.0
#9  0x0000713f9c2be6df in llama_decode () from /home/mtone/bin/llama.cpp/build/bin/libllama.so.0
#10 0x0000713f9c7fe13f in common_init_from_params(common_params&) () from /home/mtone/bin/llama.cpp/build/bin/libllama-common.so.0
#11 0x0000585388c79258 in server_context_impl::load_model(common_params&) ()
#12 0x0000585388bd342a in main ()
[Inferior 1 (process 364811) detached]

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@mtone does it work now?

@mtone
Copy link
Copy Markdown

mtone commented May 3, 2026

@mtone does it work now?

Yes! Performance is noticeably slower than master however (correctness trumps speed of course).

Some benchmarks

./llama-bench -hf unsloth/Qwen3.5-397B-A17B-GGUF:Q3_K_XL -ngl 99 -fa 1 -sm tensor -n 512 -p 2048 -ub 512,2048 [-pg 64000,512]

master

model size params backend ngl n_ubatch sm fa test t/s
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp2048 2034.22 ± 6.71
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 tg512 92.49 ± 3.87
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp2048 2960.71 ± 4.03
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 tg512 91.02 ± 0.38
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp64000+tg512 CRASH

PR

model size params backend ngl n_ubatch sm fa test t/s
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp2048 424.82 ± 2.50
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 tg512 63.21 ± 5.25
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp64000+tg512 435.50 ± 1.36
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp2048 2965.03 ± 3.23
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 tg512 62.90 ± 5.69
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp64000+tg512 1206.01 ± 10.79

@gordan-bobic
Copy link
Copy Markdown

It will take some time to see if the crash still occurs with this patch, but what is immediately obvious is that this massively slows things down on my setup (2x 22GB 2080Ti with NVLink)
prefill: ~550 tps -> 176 tps
decode: 35 tps -> 24 tps

This makes it slower than --split-mode layer, thus negating any benefit from using --split-mode tensor in the first place.

@jimneumann
Copy link
Copy Markdown

Also seeing a major slowdown, including random 1-2 second pauses in token generation every 5-10 seconds, which also seems to be happening during prefill based on nvtop.

Dense prefill is impacted much more and the slowdown worsens with more GPUs, especially with NVlink.
With MoE the slowdown is worst with 2 GPUs and improves with more GPUs. PCIe and NVLink slowdowns are similar with MoE.

Qwen3.6 27B, NVLink:
2x V100 prefill: 1050 tps -> 359 tps (34%)
4x V100 prefill: 1505 tps -> 385 tps (26%)
2x V100 decode: 32 tps -> 28 tps (88%)
4x V100 decode: 45.1 tps -> 38.3 tps (85%)

Qwen3.6 35B, NVLink:
2x V100 prefill: 447 tps -> 254 tps (57%)
4x V100 prefill: 194 tps -> 147 tps (76%)
2x V100 decode: 94 tps -> 71 tps (76%)
4x V100 decode: 62 tps -> 51 tps (82%)

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

It seems I underestimated the cost of rebuilding the split state cache; on my system I am also seeing a slowdown but it was less severe so I didn't notice it by just running test prompts on the server. I disabled the clearing of the split state cache again. This will in theory result in an unbounded amount of host memory being consumed over time as more and more views of statically allocated tensors are created and needs to be addressed long-term. Does this restore the original performance?

@jimneumann
Copy link
Copy Markdown

Yes, original performance is restored.

With -cram 0 --ctx-checkpoints 0 and llama-eval running for the past 30 minutes, I'm seeing around 9-21MB more host memory used on 5 different llama-server instances vs shortly after starting the runs, but I'll leave it run to get a better idea after several hours.

@gordan-bobic
Copy link
Copy Markdown

Can split cache clearing be done after the response generation completes?

@gordan-bobic
Copy link
Copy Markdown

gordan-bobic commented May 4, 2026

Both prefill and token generation are still 10-15% slower with the latest patch. :-( But it is better than it was with the original patch.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Can split cache clearing be done after the response generation completes?

Not without extra complexity because it would require the llama.cpp use code to explicitly call an API.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@gordan-bobic please post the exact hardware and model you're using as well as a llama-bench command that reproduces the performance difference between 63d93d1 and 57266f2 .

@mtone
Copy link
Copy Markdown

mtone commented May 4, 2026

Resolved for me - TG is even a bit faster in PR with commit 57266f2!

Updated benchmarks

2x RTX Pro 6000
./llama-bench -hf unsloth/Qwen3.5-397B-A17B-GGUF:Q3_K_XL -ngl 99 -fa 1 -sm tensor -n 512 -p 2048 -ub 512,2048 [-pg 64000,512]

master

model size params backend ngl n_ubatch sm fa test t/s
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp2048 2037.34 ± 6.19
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 tg512 93.21 ± 4.24
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp2048 2960.85 ± 5.14
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 tg512 91.12 ± 0.39
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp64000+tg512 CRASH

PR

model size params backend ngl n_ubatch sm fa test t/s
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp2048 2036.77 ± 7.43
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 tg512 99.92 ± 0.10
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 512 tensor 1 pp64000+tg512 1681.21 ± 7.15
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp2048 2961.19 ± 2.99
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 tg512 100.34 ± 0.23
qwen35moe 397B.A17B Q3_K - Medium 166.45 GiB 396.35 B CUDA 99 2048 tensor 1 pp64000+tg512 2193.37 ± 0.90

My test repro also completed successfully: Use subagents to summarize each package in this repo in Cline with 500k context divided in -parallel 4 -cache-ram -1. Slows down to a crawl (~12tps probably unrelated to this PR) but it works - in master this crashed quick.

@gordan-bobic
Copy link
Copy Markdown

I misspoke, I re-tested and it looks like throughput is back where it used to be with the latest patch. My bad.

@jimneumann
Copy link
Copy Markdown

As a sanity check I retested 63d93d1, 57266f2, and current master and all had similar/expected performance.

Host memory usage of llama-server rose between 30MB (27B dense) and 65MB (35B MoE) after running AIME2025 for 9 hours straight.

@spacemonkeydelivers
Copy link
Copy Markdown

Hi @JohannesGaessler,

First of all, thank you for your work.
I've tried your fixes with qwn 3.6 and sm tensor and it worked.
llama.cpp revision is 232f466
However with the recent MTP feature enabled and your patches I almost immediately get another assert in gglm:

/local_disk/llm/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
0.25.462.361 W ggml_new_object: not enough space in the context's memory pool (needed 383088, available 382720)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f72832a69ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f72832a69ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f728329b668 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f728329b6ad in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f72833067c7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x000055af00ca9b6b in ggml_print_backtrace ()
#5  0x000055af00ca9cb8 in ggml_abort ()
#6  0x000055af00cac8eb in ggml_new_tensor ()
#7  0x000055af00cd57dd in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) [clone .isra.0] ()
#8  0x000055af00cd62ad in ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) ()
#9  0x000055af00cc1b83 in ggml_gallocr_alloc_graph ()
#10 0x000055af00cc94d7 in ggml_backend_sched_alloc_graph ()
#11 0x000055aefff05947 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#12 0x000055aefff0b975 in llama_context::decode(llama_batch const&) ()
#13 0x000055aefff0d53b in llama_decode ()
#14 0x000055aeffb6e245 in server_context_impl::update_slots() ()
#15 0x000055aeffc03021 in server_queue::start_loop(long) ()
#16 0x000055aeffa5c75b in main ()

The config is:

  -m "${MODEL_PATH}" \
  -ngl 999 \
  -sm tensor \
  -c "${CTX}" \
  -fa on \
  --cache-type-k f16 \
  --cache-type-v f16 \
  -t 8 -tb 16 \
  -b 4096 -ub 1536 \
  --parallel 1 \
  --cache-ram 0 \
  --jinja \
  --no-mmap \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  --presence-penalty 1.0 \
  --no-context-shift \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --metrics \
  --host 0.0.0.0 --port 8080
Full log
==> Qwen3.6-27B Q6_K_XL + MTP — (60000 context)
==> -sm tensor | -b 4096 -ub 1536 | -spec-type draft-mtp -spec-draft-n-max 2

0.00.018.013 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.018.015 I device_info:
0.00.056.896 I   - CUDA0   : NVIDIA GeForce RTX 3090 (24252 MiB, 23991 MiB free)
0.00.098.541 I   - CUDA1   : NVIDIA GeForce RTX 3090 (24252 MiB, 23991 MiB free)
0.00.098.548 I   - CPU     : AMD Ryzen 9 9950X 16-Core Processor (190848 MiB, 190848 MiB free)
0.00.098.594 I system_info: n_threads = 8 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.098.631 I srv          init: running without SSL
0.00.098.643 I srv          init: using 31 threads for HTTP server
0.00.098.687 I srv         start: binding port with default address family
0.00.099.833 I srv          main: loading model
0.00.099.836 I srv    load_model: loading model '/local_disk/llm/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q6_K_XL.gguf'
0.00.099.852 I common_init_result: fitting params to device memory ...
0.00.099.853 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.099.904 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.06.012.623 W llama_context: n_ctx_seq (60160) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.06.201.608 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.06.455.722 I srv    load_model: creating MTP draft context against the target model '/local_disk/llm/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q6_K_XL.gguf'
0.06.455.741 W llama_context: n_ctx_seq (60160) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.06.531.877 I srv    load_model: initializing slots, n_slots = 1
0.06.560.368 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.06.567.516 I common_speculative_init: adding speculative implementation 'draft-mtp'
0.06.567.609 I srv    load_model: speculative decoding context initialized
0.06.567.609 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 60160
0.06.567.675 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.06.567.675 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.06.567.690 W srv          init: --cache-idle-slots requires --kv-unified, disabling
0.06.575.302 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.06.580.962 I srv          init: init: chat template, thinking = 1
0.06.580.968 I srv          main: model loaded
0.06.580.971 I srv          main: server is listening on http://0.0.0.0:8080
0.06.580.973 I srv  update_slots: all slots are idle
0.07.004.187 I srv  params_from_: Chat format: peg-native
0.07.004.291 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.07.004.354 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.12.008.581 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8192, progress = 0.44, t =   5.00 s / 1637.02 tokens per second
0.12.008.658 I slot update_slots: id  0 | task 0 | 8192 tokens since last checkpoint at 0, creating new checkpoint during processing at position 12288
0.12.057.040 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 8191, pos_max = 8191, n_tokens = 8192, size = 181.782 MiB)
0.14.629.794 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  12288, progress = 0.66, t =   7.63 s / 1611.45 tokens per second
0.17.283.933 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  16384, progress = 0.88, t =  10.28 s / 1593.84 tokens per second
0.17.283.978 I slot update_slots: id  0 | task 0 | 8192 tokens since last checkpoint at 8192, creating new checkpoint during processing at position 17139
0.17.321.121 I slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 16383, pos_max = 16383, n_tokens = 16384, size = 213.939 MiB)
0.17.831.892 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  17139, progress = 0.92, t =  10.83 s / 1582.91 tokens per second
0.17.869.720 I slot create_check: id  0 | task 0 | created context checkpoint 3 of 32 (pos_min = 17138, pos_max = 17138, n_tokens = 17139, size = 216.902 MiB)
0.18.867.372 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  18675, progress = 1.00, t =  11.86 s / 1574.22 tokens per second
0.18.909.141 I slot create_check: id  0 | task 0 | created context checkpoint 4 of 32 (pos_min = 18674, pos_max = 18674, n_tokens = 18675, size = 222.932 MiB)
/local_disk/llm/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
0.19.001.320 W ggml_new_object: not enough space in the context's memory pool (needed 383088, available 382720)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc1d10a69ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fc1d10a69ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fc1d109b668 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fc1d109b6ad in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fc1d11067c7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x000055b261621b6b in ggml_print_backtrace ()
#5  0x000055b261621cb8 in ggml_abort ()
#6  0x000055b2616248eb in ggml_new_tensor ()
#7  0x000055b26164d7dd in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) [clone .isra.0] ()
#8  0x000055b26164e2ad in ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) ()
#9  0x000055b261639b83 in ggml_gallocr_alloc_graph ()
#10 0x000055b2616414d7 in ggml_backend_sched_alloc_graph ()
#11 0x000055b26087d947 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#12 0x000055b260883975 in llama_context::decode(llama_batch const&) ()
#13 0x000055b26088553b in llama_decode ()
#14 0x000055b2604e6245 in server_context_impl::update_slots() ()
#15 0x000055b26057b021 in server_queue::start_loop(long) ()
#16 0x000055b2603d475b in main ()

@vektorprime
Copy link
Copy Markdown

vektorprime commented May 20, 2026

I just tested the fix in this PR and I'm getting similar results to @spacemonkeydelivers

My command:

MTMD_BACKEND_DEVICE=CUDA2 /home/user/llm/llama.cpp/build/bin/llama-server  -m /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-Q8-MTP-BIGBOY-V2.gguf  \
 --port 8000 --host 0.0.0.0 --webui-mcp-proxy -a Qwen3.6-27B  \
 --no-mmap --threads 8 --jinja --cache-ram 65536 -ctxcp 64  -c 200000 \
 --chat-template-kwargs "{\"preserve_thinking\":true}"   --cache-type-k bf16 --cache-type-v bf16 --flash-attn on -kvu  -ngl 99 -np 1 -sm tensor  \
 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
 --spec-type draft-mtp --spec-draft-n-max 4 -ub 1024 -ts 39,39,24 \
 -mm  /home/user/llm/models/Qwen3.6-27B/mmproj-BF16.gguf \
 -lv 3
/home/user/llm/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
0.39.407.448 W ggml_new_object: not enough space in the context's memory pool (needed 383088, available 382720)
[New LWP 167255]
[New LWP 167254]
[New LWP 167253]
[New LWP 167252]
[New LWP 167251]
[New LWP 167250]
[New LWP 167249]
[New LWP 167102]
[New LWP 167101]
[New LWP 167100]
[New LWP 167099]
[New LWP 167098]
[New LWP 167097]
[New LWP 167096]
[New LWP 167095]
[New LWP 167094]
[New LWP 167093]
[New LWP 167092]
[New LWP 167091]
[New LWP 167090]
[New LWP 167089]
[New LWP 167088]
[New LWP 167087]
[New LWP 167083]
[New LWP 167082]
[New LWP 167081]
[New LWP 167080]
[New LWP 167079]
[New LWP 167074]
[New LWP 167073]
[New LWP 167063]
[New LWP 167062]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000070d757710813 in __GI___wait4 (pid=167264, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x000070d757710813 in __GI___wait4 (pid=167264, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000070d75823a5c3 in ggml_print_backtrace () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#2  0x000070d75823a76b in ggml_abort () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#3  0x000070d75823cb18 in ggml_new_tensor () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#4  0x000070d75825fb5c in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#5  0x000070d75825029b in ggml_gallocr_alloc_graph () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#6  0x000070d7582567d1 in ggml_backend_sched_alloc_graph () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#7  0x000070d757ed9a87 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#8  0x000070d757ee1361 in llama_context::decode(llama_batch const&) () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#9  0x000070d757ee2fbf in llama_decode () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#10 0x000070d758600d4c in common_init_from_params(common_params&, bool) () from /home/user/llm/llama.cpp/build/bin/libllama-common.so.0
#11 0x00005df43244ba74 in server_context_impl::load_model(common_params&) ()
#12 0x00005df43239fd1e in llama_server(int, char**) ()
#13 0x000070d75762a1ca in __libc_start_call_main (main=main@entry=0x5df43239a0b0 <main>, argc=argc@entry=59, argv=argv@entry=0x7ffdbe496c98) at ../sysdeps/nptl/libc_start_call_main.h:58
warning: 58     ../sysdeps/nptl/libc_start_call_main.h: No such file or directory
#14 0x000070d75762a28b in __libc_start_main_impl (main=0x5df43239a0b0 <main>, argc=59, argv=0x7ffdbe496c98, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdbe496c88) at ../csu/libc-start.c:360
warning: 360    ../csu/libc-start.c: No such file or directory
#15 0x00005df43239a205 in _start ()
[Inferior 1 (process 167060) detached]

Aborted (core dumped)

@vektorprime
Copy link
Copy Markdown

WARNING AI SLOP 3000 INCOMING:

Seemingly, Codex 5.5 was able to resolve this. It built upon the original PR. It's working for me well enough that I can wait for the official non-ai slop fix.

@@
-#include <atomic>
 #include <cassert>

@@
-    constexpr size_t compute_headroom = 8;
+    constexpr size_t compute_headroom      = 8;
+    constexpr size_t compute_extra_tensors = 1024;
+    const size_t compute_mem_size =
+        compute_headroom*ggml_get_mem_size(ctx) + compute_extra_tensors*ggml_tensor_overhead();

@@
     const ggml_init_params params_compute = {
-        /*.mem_size   =*/ compute_headroom*ggml_get_mem_size(ctx),
+        /*.mem_size   =*/ compute_mem_size,
         /*.mem_buffer =*/ nullptr,
         /*.no_alloc   =*/ true,
     };

@ggerganov
Copy link
Copy Markdown
Member

@JohannesGaessler I see there a couple of reports with the latest version of the PR. Will take a look after these are resolved/dismissed. Please ping me again when ready.

@gopinath87607
Copy link
Copy Markdown

gopinath87607 commented May 21, 2026

I just tested the fix in this PR and I'm getting similar results to @spacemonkeydelivers

My command:

MTMD_BACKEND_DEVICE=CUDA2 /home/user/llm/llama.cpp/build/bin/llama-server  -m /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-Q8-MTP-BIGBOY-V2.gguf  \
 --port 8000 --host 0.0.0.0 --webui-mcp-proxy -a Qwen3.6-27B  \
 --no-mmap --threads 8 --jinja --cache-ram 65536 -ctxcp 64  -c 200000 \
 --chat-template-kwargs "{\"preserve_thinking\":true}"   --cache-type-k bf16 --cache-type-v bf16 --flash-attn on -kvu  -ngl 99 -np 1 -sm tensor  \
 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
 --spec-type draft-mtp --spec-draft-n-max 4 -ub 1024 -ts 39,39,24 \
 -mm  /home/user/llm/models/Qwen3.6-27B/mmproj-BF16.gguf \
 -lv 3
/home/user/llm/llama.cpp/ggml/src/ggml.c:1766: GGML_ASSERT(obj_new) failed
0.39.407.448 W ggml_new_object: not enough space in the context's memory pool (needed 383088, available 382720)
[New LWP 167255]
[New LWP 167254]
[New LWP 167253]
[New LWP 167252]
[New LWP 167251]
[New LWP 167250]
[New LWP 167249]
[New LWP 167102]
[New LWP 167101]
[New LWP 167100]
[New LWP 167099]
[New LWP 167098]
[New LWP 167097]
[New LWP 167096]
[New LWP 167095]
[New LWP 167094]
[New LWP 167093]
[New LWP 167092]
[New LWP 167091]
[New LWP 167090]
[New LWP 167089]
[New LWP 167088]
[New LWP 167087]
[New LWP 167083]
[New LWP 167082]
[New LWP 167081]
[New LWP 167080]
[New LWP 167079]
[New LWP 167074]
[New LWP 167073]
[New LWP 167063]
[New LWP 167062]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000070d757710813 in __GI___wait4 (pid=167264, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x000070d757710813 in __GI___wait4 (pid=167264, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000070d75823a5c3 in ggml_print_backtrace () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#2  0x000070d75823a76b in ggml_abort () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#3  0x000070d75823cb18 in ggml_new_tensor () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#4  0x000070d75825fb5c in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#5  0x000070d75825029b in ggml_gallocr_alloc_graph () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#6  0x000070d7582567d1 in ggml_backend_sched_alloc_graph () from /home/user/llm/llama.cpp/build/bin/libggml-base.so.0
#7  0x000070d757ed9a87 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#8  0x000070d757ee1361 in llama_context::decode(llama_batch const&) () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#9  0x000070d757ee2fbf in llama_decode () from /home/user/llm/llama.cpp/build/bin/libllama.so.0
#10 0x000070d758600d4c in common_init_from_params(common_params&, bool) () from /home/user/llm/llama.cpp/build/bin/libllama-common.so.0
#11 0x00005df43244ba74 in server_context_impl::load_model(common_params&) ()
#12 0x00005df43239fd1e in llama_server(int, char**) ()
#13 0x000070d75762a1ca in __libc_start_call_main (main=main@entry=0x5df43239a0b0 <main>, argc=argc@entry=59, argv=argv@entry=0x7ffdbe496c98) at ../sysdeps/nptl/libc_start_call_main.h:58
warning: 58     ../sysdeps/nptl/libc_start_call_main.h: No such file or directory
#14 0x000070d75762a28b in __libc_start_main_impl (main=0x5df43239a0b0 <main>, argc=59, argv=0x7ffdbe496c98, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdbe496c88) at ../csu/libc-start.c:360
warning: 360    ../csu/libc-start.c: No such file or directory
#15 0x00005df43239a205 in _start ()
[Inferior 1 (process 167060) detached]

Aborted (core dumped)

i am also getting same error
build commend

btw sm layer has no issues but its dead slow comparing to the tensor

cd ~/llama.cpp
rm -rf build

cmake -B build
-DGGML_CUDA=ON
-DGGML_VULKAN=ON
-DGGML_RPC=ON
-DGGML_NATIVE=ON
-DGGML_AVX=ON
-DGGML_AVX2=ON
-DGGML_AVX512=ON
-DGGML_AVX512_VBMI=ON
-DGGML_AVX512_VNNI=ON
-DGGML_AVX512_BF16=ON
-DGGML_AVX_VNNI=ON
-DGGML_FMA=ON
-DGGML_F16C=ON
-DGGML_BMI2=ON
-DGGML_OPENMP=ON
-DGGML_CUDA_GRAPHS=ON
-DGGML_CUDA_NCCL=ON
-DGGML_LLAMAFILE=ON
-DGGML_CPU_REPACK=ON
-DGGML_CUDA_FORCE_CUBLAS=ON
-DGGML_SCHED_MAX_COPIES=8
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CUDA_ARCHITECTURES="86;120"
-DCMAKE_C_COMPILER=gcc-13
-DCMAKE_CXX_COMPILER=g++-13
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-13
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.2/bin/nvcc

cmake --build build --config Release -j$(nproc)

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@ggerganov from my end I would now be ready for review.

@jptissot
Copy link
Copy Markdown

@JohannesGaessler I tested your latest changes with my configuration and I no longer see the exception. Thank you!

@ggerganov ggerganov self-assigned this May 25, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@ggerganov can you press the merge button please, I can't do it with only one approving review.

@ggerganov ggerganov merged commit ae251b5 into ggml-org:master May 25, 2026
50 checks passed
@wx33398-ctrl
Copy link
Copy Markdown

I try,it's work,thanks a lot

@biturbo9
Copy link
Copy Markdown

Confirming the fix works on a 2× RTX 5090 rig running b9320.

Config: Qwen3.6-27B (UD-Q8_K_XL + MTP self-spec, --spec-type draft-mtp --spec-draft-n-max 4), 262K ctx,
--split-mode tensor --tensor-split 0.50,0.50, --cache-ram 2048, f16 KV.

Soak result (8 scenarios × 5 turns, ~70K-token doc per scenario, ctx 262K):

Split mode Pre-fix b9320
tensor crash @ scen 6–7 (ggml.c:1766) 40/40 PASS
layer silent process death @ scen 3–5 40/40 PASS

Perf: ~150 t/s sustained tg @ 4096-token decode (tg_medium 153, tg_short 147, tg_code 138). No regression vs the
b9095 .

Huge thanks @JohannesGaessler🙏

@rbestuar
Copy link
Copy Markdown

rbestuar commented May 26, 2026

Looking good so far on 2x5060ti 130 ts I'll see if it crashes overnight with frigate genai also hitting the server. Thanks for all your hard work!

Edit: just about 12 hours straight not a single crash

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Meta backend ggml_context pool exhaustion with --split-mode tensor