Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
18b5d2b
Introduce annotation query function
yuslepukhin Jan 21, 2026
1eeba1a
Wire annotations to partitioning interface.
yuslepukhin Jan 29, 2026
debc8dd
Fix up annotations with Transpose Optimizer
yuslepukhin Jan 29, 2026
8f0ff86
Add ORT_EXTENDED_MINIMAL build
yuslepukhin Jan 29, 2026
f7a422e
Move rules and matcher inside the index
yuslepukhin Jan 30, 2026
1ef4078
Add Update with tests
yuslepukhin Jan 30, 2026
828eca3
TODO: Consider not removing annotations
yuslepukhin Jan 30, 2026
1626d5c
Clear annotations after partitioning
yuslepukhin Feb 7, 2026
f33bab1
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Feb 7, 2026
b3ecb39
Address accountant bug
yuslepukhin Feb 7, 2026
b5ea1c6
Annotate tiny_gpt2_beamsearch by layers
yuslepukhin Feb 9, 2026
9a422ba
Refactor Graph_GetGraphView to make it a utility
yuslepukhin Feb 10, 2026
a1caf93
Introduce a graph utility to create an IndexedSubgraph
yuslepukhin Feb 10, 2026
e1b1c4f
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Feb 10, 2026
acec402
Fix lint in python script.
yuslepukhin Feb 11, 2026
31dd7a8
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 2, 2026
9fa4849
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 5, 2026
50c58c9
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 9, 2026
e445b60
Fix build errors and address Copilt comments
yuslepukhin Mar 9, 2026
358f7df
Reject duplicate rules
yuslepukhin Mar 9, 2026
653fb8b
Move methods to .cc
yuslepukhin Mar 9, 2026
23a8ecf
Remove code duplication
yuslepukhin Mar 10, 2026
ef1227e
Add missing include
yuslepukhin Mar 10, 2026
b0b2396
Fix matching bug
yuslepukhin Mar 10, 2026
b9e13cf
Change index parsing
yuslepukhin Mar 10, 2026
add0227
Remove wrong comment
yuslepukhin Mar 10, 2026
17e3525
Address minimal build issues
yuslepukhin Mar 10, 2026
1b1a7db
Fix unused arg
yuslepukhin Mar 10, 2026
88c2c47
Add logging
yuslepukhin Mar 18, 2026
9b0b529
Make sure the annotation is copied on node copy
yuslepukhin Mar 19, 2026
dab76bc
Adjust error message
yuslepukhin Mar 19, 2026
b39a487
Copy Annotations when copying nodes and inlining functions
yuslepukhin Mar 19, 2026
4e260bc
Update LayeringIndex after function inlining
yuslepukhin Mar 20, 2026
5214350
Add intermediate buffers accounting + temp coefficient
yuslepukhin Mar 23, 2026
3fb1d1e
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 23, 2026
cd73b56
Address MakeNodeUnassigned feedback
yuslepukhin Mar 25, 2026
dfe4d13
Address InlineNodes feedback
yuslepukhin Mar 25, 2026
62f3d11
Fix underaccounting for shared weights in fused nodes
yuslepukhin Mar 25, 2026
3cab988
Update onnxruntime/python/tools/layering/layer_annotate.py
yuslepukhin Mar 26, 2026
e6cb75f
Lint
yuslepukhin Mar 26, 2026
b2ef9a2
Flip = prefix to exact match
yuslepukhin Mar 26, 2026
fde6300
Adjust comments for duplicate annotations
yuslepukhin Mar 26, 2026
7871afa
Remove bad comment
yuslepukhin Mar 26, 2026
16ec921
Adjust EpWithNoLayeringRulesSeesAllUnassignedNodes
yuslepukhin Mar 26, 2026
4da5c3b
Throw on multiple annotations
yuslepukhin Mar 26, 2026
01e4506
Make sure annotations are propagated on function inlining
yuslepukhin Mar 26, 2026
3e52f14
Update include/onnxruntime/core/session/onnxruntime_session_options_c…
yuslepukhin Mar 26, 2026
24a46e8
Update onnxruntime/core/framework/graph_partitioner.cc
yuslepukhin Mar 26, 2026
e745e02
Update onnxruntime/core/graph/graph_utils.h
yuslepukhin Mar 26, 2026
59b5ccd
Fix issues in python
yuslepukhin Mar 26, 2026
054f894
Address undercounting problem
yuslepukhin Mar 27, 2026
09967c3
Add copyright header
yuslepukhin Mar 27, 2026
d23ee08
Update onnxruntime/core/framework/graph_partitioner.cc
yuslepukhin Mar 27, 2026
9e8be7a
Adjust doc and implementaton for fetching layering ann
yuslepukhin Mar 27, 2026
f636138
Make GetContainingGraph public
yuslepukhin Mar 27, 2026
fcda524
Adjust accounting for fused node and remove stray local var
yuslepukhin Mar 27, 2026
2da1394
Address flaky test
yuslepukhin Mar 27, 2026
c0c5e51
Update onnxruntime/core/providers/cuda/cuda_execution_provider.cc
yuslepukhin Mar 27, 2026
c55bb6e
Update onnxruntime/core/graph/graph_utils.cc
yuslepukhin Mar 27, 2026
cdd9faa
Address review issues
yuslepukhin Mar 27, 2026
67a947b
Fix potential perf issue
yuslepukhin Mar 27, 2026
44c6904
Address review comments
yuslepukhin Mar 27, 2026
927a0ef
Add documentation for ann and ep propagation. Fix L1 optimizers, add …
yuslepukhin Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions docs/Optimizer_Layering_Annotations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Optimizer Layering Annotations

## Overview

Layering annotations are per-node metadata strings that guide graph partitioning by indicating which execution provider (EP) layer a node belongs to. They are loaded from the ONNX model's `NodeProto` metadata (key `"layer_ann"`) and consumed during the partitioning phase to influence EP assignment.

## Execution Pipeline

Graph optimizers run in ordered levels:

```
Level 0 (Basic) ─► Level 1 (Extended) ─► Partitioning ─► Level 2+ (Layout, etc.)
```

1. **Level 0 and Level 1** optimizers run **before** partitioning. At this point, layering annotations are present on nodes and must be preserved through any graph transformations.
2. **Partitioning** reads the annotations to assign nodes to execution providers.
3. After partitioning, `Graph::RemoveAllLayeringAnnotations()` clears all annotations.
4. **Level 2, 3, and 4** optimizers run **after** annotations have been cleared. They do not need to handle annotations.

**Key rule: Only Level 1 (and Level 0) optimizers need to propagate layering annotations.**

## Why Propagation Matters

When an optimizer replaces, fuses, or decomposes nodes, the original annotated node is removed and new nodes are created. If the new nodes do not carry the original annotation, partitioning loses the assignment hint for that subgraph, potentially causing incorrect EP placement.

## How to Propagate Annotations

### Preferred: Use the `AddNode` Overload with `annotation_source`

`Graph::AddNode` provides overloads that accept a `const Node& annotation_source` parameter. The new node automatically inherits the layering annotation from the source node.

```cpp
// Instead of:
Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs);
// Missing annotation propagation!

// Use:
Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs,
original_node); // annotation_source
```

All standard `AddNode` signatures have a corresponding `annotation_source` variant:

```cpp
// With const NodeAttributes*
Node& AddNode(name, op_type, description,
gsl::span<NodeArg* const> inputs,
gsl::span<NodeArg* const> outputs,
const Node& annotation_source,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain);

// With NodeAttributes&&
Node& AddNode(name, op_type, description,
gsl::span<NodeArg* const> inputs,
gsl::span<NodeArg* const> outputs,
const Node& annotation_source,
NodeAttributes&& attributes,
const std::string& domain = kOnnxDomain);

// initializer_list variants also available
```

### Legacy: `DuplicateNodeAnnotation`

The utility function `optimizer_utils::DuplicateNodeAnnotation(src, dst)` copies annotations between existing nodes. This is still used when the annotation source is conditional (e.g., when the source node pointer may be null). Prefer the `AddNode` overload for unconditional propagation.

### Automatic Propagation

`Graph::AddNode(const Node& other)` — the copy overload used for duplicating nodes — automatically copies annotations. No additional action is needed when duplicating a node via this overload.

## Post-Partitioning: Propagating EP Assignments

Although Level 2+ optimizers do not deal with layering annotations directly (they have been cleared), they must still propagate **execution provider (EP) assignments**. EP assignments are the downstream result of the annotation-driven partitioning step. After partitioning, each node carries an EP assignment (e.g., `CUDAExecutionProvider`, `CPUExecutionProvider`) that determines where the node's kernel runs.

When a Level 2+ optimizer creates new nodes that replace or derive from existing ones, it must copy the EP assignment from the source node:

```cpp
Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs);
new_node.SetExecutionProviderType(original_node.GetExecutionProviderType());
```

Failing to propagate the EP assignment causes the new node to fall back to the default provider (typically CPU), silently breaking the intended placement and potentially degrading performance or correctness. This requirement predates the layering annotation feature and applies to all optimizers that run after partitioning.

> **Note:** The `AddNode` overload with `annotation_source` propagates both the layering annotation *and* nothing else — EP assignment is still set separately. Layering annotations and EP assignments serve different stages of the pipeline and are managed independently.

## When You Do NOT Need to Propagate Annotations

- **Level 2+ optimizers** — annotations have already been consumed and cleared (but EP assignments must still be propagated, see above).
- **Training optimizers** — training runs after partitioning.
- **Optimizers that only remove nodes** (e.g., identity elimination) — no new nodes are created.
- **Optimizers that modify nodes in-place** — the annotation remains on the existing node.

## Examples

### Fusion (replacing multiple nodes with one)

```cpp
// GeluFusion: fusing Div + Erf + Add + Mul + Mul into a single Gelu
Node& gelu_node = graph.AddNode(
graph.GenerateNodeName("Gelu"),
"Gelu", "fused Gelu subgraphs",
{gelu_input}, {gelu_output},
div_node); // propagate annotation from the root matched node
```

### Decomposition (replacing one node with many)

```cpp
// STFT decomposition: each new node inherits from the original STFT node
auto [reshape_node, reshape_out] = AddNode(graph, "Reshape", ep, inputs, &stft);
auto [conv_node, conv_out] = AddNode(graph, "Conv", ep, conv_inputs, &stft);
auto [concat_node, concat_out] = AddNode(graph, "Concat", ep, concat_inputs, &stft);
```

### Conditional source (use DuplicateNodeAnnotation)

```cpp
Node& q_node = graph.AddNode(...);
if (src_node) {
optimizer_utils::DuplicateNodeAnnotation(*src_node, q_node);
}
```

## Checklist for New Level 1 Optimizers

1. Identify the "source" node whose annotation should propagate (typically the root of the matched pattern).
2. For every `graph.AddNode(...)` call that creates a replacement node, use the `annotation_source` overload.
3. If the source is conditional (may be null), use `optimizer_utils::DuplicateNodeAnnotation` after the `AddNode` call.
4. Test with an annotated model to verify annotations survive the transformation.
25 changes: 19 additions & 6 deletions include/onnxruntime/core/framework/resource_accountant.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,18 +45,31 @@ class IResourceAccountant {
virtual ResourceCount GetConsumedAmount() const = 0;
virtual void AddConsumedAmount(const ResourceCount& amount) = 0;
virtual void RemoveConsumedAmount(const ResourceCount& amount) = 0;
virtual ResourceCount ComputeResourceCount(const Node& node) const = 0;
virtual ResourceCount ComputeResourceCount(const Node& node) = 0;

std::optional<ResourceCount> GetThreshold() const {
return threshold_;
}

void SetThreshold(const ResourceCount& threshold) {
threshold_ = threshold;
}

void SetStopAssignment() noexcept {
stop_assignment_ = true;
}

bool IsStopIssued() const noexcept { return stop_assignment_; }

// Called before each GetCapability pass to discard pending weight tracking
// from a previous (discarded) pass. Default no-op for stats-based accountants.
virtual void ResetPendingWeights() {}

// Called when a node's cost is committed (AccountForNode/AccountForAllNodes).
// Moves the node's pending weights into the committed set so they persist
// across GetCapability passes. Default no-op for stats-based accountants.
virtual void CommitWeightsForNode(size_t /*node_index*/) {}

static std::string MakeUniqueNodeName(const Node& node);

private:
Expand Down Expand Up @@ -114,16 +127,16 @@ class NodeStatsRecorder {

void DumpStats(const std::filesystem::path& model_path) const;

[[nodiscard]] static Status CreateAccountants(
const ConfigOptions& config_options,
const std::filesystem::path& model_path,
std::optional<ResourceAccountantMap>& acc_map);

private:
void DumpStats(std::ostream& os) const;

struct Impl;
std::unique_ptr<Impl> impl_;
};

Status CreateAccountants(
const ConfigOptions& config_options,
const std::filesystem::path& model_path,
std::optional<ResourceAccountantMap>& acc_map);

} // namespace onnxruntime
105 changes: 104 additions & 1 deletion include/onnxruntime/core/graph/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,14 @@ class Node {
*/
void SetSinceVersion(int since_version) noexcept { since_version_ = since_version; }

void SetLayeringAnnotation(std::string annotation) { layering_annotation_ = std::move(annotation); }

const std::string& GetLayeringAnnotation() const noexcept { return layering_annotation_; }

const Graph* GetContainingGraph() const noexcept { return graph_; }

#if !defined(ORT_MINIMAL_BUILD)

/** Gets the Node's OpSchema.
@remarks The graph containing this node must be resolved, otherwise nullptr will be returned. */
const ONNX_NAMESPACE::OpSchema* Op() const noexcept { return op_; }
Expand Down Expand Up @@ -256,6 +263,13 @@ class Node {
#endif // !defined(ORT_MINIMAL_BUILD)

#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

// Make sure that the annotation does not occupy memory after partitioning is done.
void ClearLayeringAnnotation() {
std::string t;
layering_annotation_.swap(t);
Comment thread
tianleiwu marked this conversation as resolved.
}

/** Gets a modifiable count of arguments for each of the Node's explicit inputs.
@todo This should be removed in favor of a method that updates the input args and the count.
Currently these operations are separate which is not a good setup. */
Expand Down Expand Up @@ -685,6 +699,8 @@ class Node {
// Graph instances for subgraphs that are owned by this Node
std::vector<std::unique_ptr<Graph>> subgraphs_;

std::string layering_annotation_;

// Can be saved? The node cannot be saved anymore if removable attributes have been cleared.
bool can_be_saved_;
};
Expand Down Expand Up @@ -1044,6 +1060,41 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
gsl::span<NodeArg* const> output_args,
NodeAttributes&& attributes,
const std::string& domain = kOnnxDomain);

/** Add a Node to this Graph, propagating the layering annotation from an existing node.
This is the preferred way to create new nodes in Level 1 (pre-partitioning) graph optimizers.
The new node automatically inherits the layering annotation from @p annotation_source, which
ensures correct layer-based partitioning when annotations are present.
@param name The Node name. Must be unique in this Graph.
@param op_type The operator type. e.g. ONNX operator name.
@param description Arbitrary description of the Node.
@param input_args The explicit inputs to this Node.
@param output_args The outputs from this Node.
@param annotation_source The node from which to inherit the layering annotation.
@param attributes Optional NodeAttributes to add.
@param domain The domain for the op_type.
@returns Reference to the new Node.
@remarks Use this overload in Level 1 optimizers that create nodes replacing or derived from
existing annotated nodes. See docs/Optimizer_Layering_Annotations.md for details.
*/
Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
gsl::span<NodeArg* const> input_args,
gsl::span<NodeArg* const> output_args,
const Node& annotation_source,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain);

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
gsl::span<NodeArg* const> input_args,
gsl::span<NodeArg* const> output_args,
const Node& annotation_source,
NodeAttributes&& attributes,
const std::string& domain = kOnnxDomain);

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
Expand All @@ -1057,6 +1108,21 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
attributes, domain);
}

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
std::initializer_list<NodeArg*> input_args,
std::initializer_list<NodeArg*> output_args,
const Node& annotation_source,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain) {
return AddNode(name, op_type, description,
AsSpan(input_args),
AsSpan(output_args),
annotation_source,
attributes, domain);
}

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
Expand All @@ -1070,16 +1136,46 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
attributes, domain);
}

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
gsl::span<NodeArg* const> input_args,
std::initializer_list<NodeArg*> output_args,
const Node& annotation_source,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain) {
return AddNode(name, op_type, description,
input_args,
AsSpan(output_args),
annotation_source,
attributes, domain);
}

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
std::initializer_list<NodeArg*> input_args,
gsl::span<NodeArg* const> output_args,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain) {
return AddNode(name, op_type, description,
AsSpan(input_args),
output_args,
attributes, domain);
}

Node& AddNode(const std::string& name,
const std::string& op_type,
const std::string& description,
std::initializer_list<NodeArg*> input_args,
gsl::span<NodeArg* const> output_args,
const Node& annotation_source,
const NodeAttributes* attributes = nullptr,
const std::string& domain = kOnnxDomain) {
return AddNode(name, op_type, description,
AsSpan(input_args),
output_args,
annotation_source,
attributes, domain);
}

Expand Down Expand Up @@ -1322,10 +1418,12 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi

The Graph needs to be Resolve()d after this call.
@param func_to_inline
@param parent_annotation. Annotation inherited from the parent node that is being inlined.
@returns Status indicating success or providing an error message.
*/

Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline);
Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline,
const std::string& parent_annotation);

/** Mark a NodeArg name as coming from the outer scope when programmatically constructing a Graph that will
be used as a GraphProto attribute in another Node.
Expand Down Expand Up @@ -1569,6 +1667,11 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
// compiled model during partitioning, leaving them unused in the ORT Graph. To allow the memory to be freed
// we need to manually run the cleanup that would usually happen as part of Graph::Resolve.
Status RemovedUnusedInitializersOrtFormat();

// This examines all the nodes and removes any annotations that are only used for layering.
// This potentially saves memory.
Status RemoveAllLayeringAnnotations();

#endif // !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

// This friendship relationship should only be used to call Graph::Graph and
Expand Down
24 changes: 19 additions & 5 deletions include/onnxruntime/core/graph/indexed_sub_graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,18 +86,32 @@ struct IndexedSubGraph {

// Should call IsAccountingEnabled() first
// Takes the previously computed ResourceCount for the node
// (usually during GetCapabiilty())
// (usually during GetCapability())
// if present and adds it to the consumed amount
void AccountForNode(size_t cost_index) const {
assert(cost_index < nodes_costs.size());
resource_accountant->AddConsumedAmount(nodes_costs[cost_index]);
resource_accountant->CommitWeightsForNode(nodes[cost_index]);
}

// This computes and accounts for the resource cost for the node that just
// been fused from other nodes, and the EP did not had a chance to compute the costs.
void ComputeAndAccountForNode(const Node& node) const {
// Accounts for all constituent nodes by summing their pre-stored costs.
// Use this when fusing nodes into a single node so the total cost
// reflects what was computed during GetCapability() (with correct
// cross-node weight deduplication already applied).
void AccountForAllNodes() const {
assert(resource_accountant != nullptr);
resource_accountant->AddConsumedAmount(resource_accountant->ComputeResourceCount(node));
for (size_t i = 0; i < nodes_costs.size(); ++i) {
resource_accountant->AddConsumedAmount(nodes_costs[i]);
resource_accountant->CommitWeightsForNode(nodes[i]);
}
}

// Accounts for a node given its index and a pre-computed resource cost.
// Use this when the cost was computed externally (e.g. for a fused node).
void AccountForNode(NodeIndex node_index, const ResourceCount& resource_count) const {
assert(resource_accountant != nullptr);
resource_accountant->AddConsumedAmount(resource_count);
resource_accountant->CommitWeightsForNode(node_index);
}

void SetAccountant(IResourceAccountant* res_accountant) {
Expand Down
Loading
Loading