graph : utilize ggml_build_forward_select() to avoid reallocations#18898
Merged
graph : utilize ggml_build_forward_select() to avoid reallocations#18898
ggml_build_forward_select() to avoid reallocations#18898Conversation
ggml_build_forward_select() to avoid reallocationsggml_build_forward_select() to avoid reallocations
ggml_build_forward_select() to avoid reallocationsggml_build_forward_select() to avoid reallocations
ngxson
reviewed
Jan 18, 2026
src/llama-graph.cpp
Outdated
| inp->embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, ubatch.n_tokens); | ||
| ggml_set_input(inp->embd); | ||
|
|
||
| if (hparams.n_deepstack_layers > 0) { |
Contributor
There was a problem hiding this comment.
I think the more generic condition here is n_embd_inp != n_embd
src/llama-graph.cpp
Outdated
Comment on lines
+1331
to
+1332
| cur = ggml_view_2d(ctx0, cur, hparams.n_embd, n_tokens, cur->nb[1], 0); | ||
| cur = ggml_cont (ctx0, cur); // makes the shape of this node the same as the ubatch.token path |
Contributor
There was a problem hiding this comment.
Instead of resizing input embeddings row size to n_embd, I'm wondering if we can/should do the reverse: pad the input token embedding to n_embd_inp using ggml_pad
Contributor
There was a problem hiding this comment.
Note that doing this way will allow us to remove the ggml_build_forward_select in model cgraph, since the ggml_add(ctx0, cur, ds) path will always be taken. Although, I'm a bit worry if ggml_pad will have negative impact on performance
9a066d0 to
3daf8e3
Compare
ronaldmannak
pushed a commit
to PicoMLX/llama.cpp
that referenced
this pull request
Jan 24, 2026
…gml-org#18898) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp
Anhelor
pushed a commit
to Anhelor/llama.cpp
that referenced
this pull request
Jan 24, 2026
…gml-org#18898) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp
ronaldmannak
pushed a commit
to PicoMLX/llama.cpp
that referenced
this pull request
Jan 24, 2026
…gml-org#18898) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp
shaofeiqi
pushed a commit
to qualcomm/llama.cpp
that referenced
this pull request
Feb 6, 2026
…gml-org#18898) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
target #18550
cont #17617
Extracted the usage of the new
ggml_build_forward_select()from #18550 into a separate PR in order to more clearly demonstrate how it can be applied to avoid graph reallocations.Here we utilize it to avoid reallocations when switching between different types of inputs (tokens or embeddings) for most models.
Also enable the server CI to report errors if unexpected reallocations occur during the server tests.