Skip to content

graph : utilize ggml_build_forward_select() to avoid reallocations#18898

Merged
ggerganov merged 4 commits intomasterfrom
gg/graph-use-select
Jan 23, 2026
Merged

graph : utilize ggml_build_forward_select() to avoid reallocations#18898
ggerganov merged 4 commits intomasterfrom
gg/graph-use-select

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jan 17, 2026

target #18550
cont #17617

Extracted the usage of the new ggml_build_forward_select() from #18550 into a separate PR in order to more clearly demonstrate how it can be applied to avoid graph reallocations.

Here we utilize it to avoid reallocations when switching between different types of inputs (tokens or embeddings) for most models.

Also enable the server CI to report errors if unexpected reallocations occur during the server tests.

@ggerganov ggerganov requested a review from CISC as a code owner January 17, 2026 14:09
@ggerganov ggerganov changed the title graph : utlize ggml_build_forward_select() to avoid reallocations graph : utliize ggml_build_forward_select() to avoid reallocations Jan 17, 2026
@ggerganov ggerganov changed the title graph : utliize ggml_build_forward_select() to avoid reallocations graph : utilize ggml_build_forward_select() to avoid reallocations Jan 17, 2026
@github-actions github-actions bot added model Model specific devops improvements to build systems and github actions labels Jan 17, 2026
inp->embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, ubatch.n_tokens);
ggml_set_input(inp->embd);

if (hparams.n_deepstack_layers > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more generic condition here is n_embd_inp != n_embd

Comment on lines +1331 to +1332
cur = ggml_view_2d(ctx0, cur, hparams.n_embd, n_tokens, cur->nb[1], 0);
cur = ggml_cont (ctx0, cur); // makes the shape of this node the same as the ubatch.token path
Copy link
Contributor

@ngxson ngxson Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of resizing input embeddings row size to n_embd, I'm wondering if we can/should do the reverse: pad the input token embedding to n_embd_inp using ggml_pad

Copy link
Contributor

@ngxson ngxson Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that doing this way will allow us to remove the ggml_build_forward_select in model cgraph, since the ggml_add(ctx0, cur, ds) path will always be taken. Although, I'm a bit worry if ggml_pad will have negative impact on performance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - applied in c84637d

Base automatically changed from gg/graph-avoid-branches-3 to master January 19, 2026 18:03
@ggerganov ggerganov force-pushed the gg/graph-use-select branch from 9a066d0 to 3daf8e3 Compare January 23, 2026 12:49
@ggerganov ggerganov merged commit 557515b into master Jan 23, 2026
77 of 78 checks passed
@ggerganov ggerganov deleted the gg/graph-use-select branch January 23, 2026 16:22
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…gml-org#18898)

* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
Anhelor pushed a commit to Anhelor/llama.cpp that referenced this pull request Jan 24, 2026
…gml-org#18898)

* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…gml-org#18898)

* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…gml-org#18898)

* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants