Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Oct 19, 2025

This PR adds ggml_can_fuse_subgraph which is a less strict extension of ggml_can_fuse. It checks given inputs/outputs of a subgraph, whether all the intermediate tensors can be fused. Putting as draft to iterate on the correct API

@am17an am17an marked this pull request as draft October 19, 2025 06:43
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 19, 2025
@am17an am17an changed the title Ggml can fuse subgraph ggml: add ggml_can_fuse_subgraph Oct 19, 2025
@am17an am17an requested a review from jeffbolznv October 19, 2025 06:45
@jeffbolznv
Copy link
Collaborator

While this does check that the internal nodes aren't used outside of the fusion region, we also need to check that the internal connectivity of the graph is what we expect. IMO this necessarily has to be verbose, because we have to check that all the node->src[] values are what we expect them to be.

The first idea that comes to mind would be to pass in a list of triples, where each triple is a { dst_node, src_idx, src_node }, and verify that nodes[start + dst_node]->src[src_idx] == nodes[start + src_node].

@am17an
Copy link
Collaborator Author

am17an commented Oct 20, 2025

While this does check that the internal nodes aren't used outside of the fusion region, we also need to check that the internal connectivity of the graph is what we expect. IMO this necessarily has to be verbose, because we have to check that all the node->src[] values are what we expect them to be.

I think this is better done at the caller site, which has more context about the fusion. This is just to avoid cases where the write is needed elsewhere we end up fusing it and have a common function for other sanity checks. Maybe a better a name for this function should be is_fusion_candidate. Passing triples like you mentioned is equivalent to building the graph at the caller site, but just not doing equality checks.

@jeffbolznv
Copy link
Collaborator

I think this is better done at the caller site

It's probably fine to have it as a separate function, but IMO it can still be common code (as much as any of this can be common code - there will always be special cases we want to handle differently).

2. add check for views: view_src should be part of the subgraph
@am17an am17an force-pushed the ggml_can_fuse_subgraph branch from 3059ed3 to d853036 Compare October 20, 2025 14:52
ggml/src/ggml.c Outdated

// if node is a view, check if the view src is within the subgraph
if (node->view_src) {
const struct ggml_tensor * view_src = node->view_src;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to walk the tree till we get the non-view parent instead of what I did here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most conservative thing would be to check all parents. It seems plausible we'll want to fuse a view of a view in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this logic guarantees that view_src is always the "root" source:

llama.cpp/ggml/src/ggml.c

Lines 1630 to 1646 in ad6af8f

static struct ggml_tensor * ggml_new_tensor_impl(
struct ggml_context * ctx,
enum ggml_type type,
int n_dims,
const int64_t * ne,
struct ggml_tensor * view_src,
size_t view_offs) {
GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
GGML_ASSERT(n_dims >= 1 && n_dims <= GGML_MAX_DIMS);
// find the base tensor and absolute offset
if (view_src != NULL && view_src->view_src != NULL) {
view_offs += view_src->view_offs;
view_src = view_src->view_src;
}

So I don't think we can ever get node->view_src->view_src != NULL. But I guess it does not hurt to have this traversal.

ggml/src/ggml.c Outdated

// if node is a view, check if the view src is within the subgraph
if (node->view_src) {
const struct ggml_tensor * view_src = node->view_src;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most conservative thing would be to check all parents. It seems plausible we'll want to fuse a view of a view in the future.

- check all view_src parents
- other minor review comments
@jeffbolznv
Copy link
Collaborator

I implemented logic to check the graph edges in master...jeffbolznv:llama.cpp:pull_16662_ps4. We can do this as a separate change, after these other things land.

@am17an am17an marked this pull request as ready for review October 21, 2025 01:00
@am17an
Copy link
Collaborator Author

am17an commented Oct 21, 2025

Will merge once CI passes

@am17an am17an merged commit 4926419 into ggml-org:master Oct 21, 2025
125 of 126 checks passed
@am17an am17an deleted the ggml_can_fuse_subgraph branch October 21, 2025 08:43
ye-NX pushed a commit to ye-NX/llama.cpp that referenced this pull request Oct 21, 2025
* ggml: add ggml_can_fuse_subgraph

* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe

* format

* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph

* - combine check into one loop
- check all view_src parents
- other minor review comments

* remove redudant if test

* - rename and other minor review comments

* add assert about count < 32
FMayran pushed a commit to FMayran/llama.cpp that referenced this pull request Oct 23, 2025
* ggml: add ggml_can_fuse_subgraph

* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe

* format

* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph

* - combine check into one loop
- check all view_src parents
- other minor review comments

* remove redudant if test

* - rename and other minor review comments

* add assert about count < 32
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* ggml: add ggml_can_fuse_subgraph

* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe

* format

* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph

* - combine check into one loop
- check all view_src parents
- other minor review comments

* remove redudant if test

* - rename and other minor review comments

* add assert about count < 32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants