microsoft · tianleiwu · Mar 30, 2026 · Jan 21, 2026 · Jan 29, 2026 · Jan 29, 2026
diff --git a/docs/Optimizer_Layering_Annotations.md b/docs/Optimizer_Layering_Annotations.md
@@ -0,0 +1,130 @@
+# Optimizer Layering Annotations
+
+## Overview
+
+Layering annotations are per-node metadata strings that guide graph partitioning by indicating which execution provider (EP) layer a node belongs to. They are loaded from the ONNX model's `NodeProto` metadata (key `"layer_ann"`) and consumed during the partitioning phase to influence EP assignment.
+
+## Execution Pipeline
+
+Graph optimizers run in ordered levels:
+
+```
+Level 0 (Basic) ─► Level 1 (Extended) ─► Partitioning ─► Level 2+ (Layout, etc.)
+```
+
+1. **Level 0 and Level 1** optimizers run **before** partitioning. At this point, layering annotations are present on nodes and must be preserved through any graph transformations.
+2. **Partitioning** reads the annotations to assign nodes to execution providers.
+3. After partitioning, `Graph::RemoveAllLayeringAnnotations()` clears all annotations.
+4. **Level 2, 3, and 4** optimizers run **after** annotations have been cleared. They do not need to handle annotations.
+
+**Key rule: Only Level 1 (and Level 0) optimizers need to propagate layering annotations.**
+
+## Why Propagation Matters
+
+When an optimizer replaces, fuses, or decomposes nodes, the original annotated node is removed and new nodes are created. If the new nodes do not carry the original annotation, partitioning loses the assignment hint for that subgraph, potentially causing incorrect EP placement.
+
+## How to Propagate Annotations
+
+### Preferred: Use the `AddNode` Overload with `annotation_source`
+
+`Graph::AddNode` provides overloads that accept a `const Node& annotation_source` parameter. The new node automatically inherits the layering annotation from the source node.
+
+```cpp
+// Instead of:
+Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs);
+// Missing annotation propagation!
+
+// Use:
+Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs,
+                               original_node);  // annotation_source
+```
+
+All standard `AddNode` signatures have a corresponding `annotation_source` variant:
+
+```cpp
+// With const NodeAttributes*
+Node& AddNode(name, op_type, description,
+              gsl::span<NodeArg* const> inputs,
+              gsl::span<NodeArg* const> outputs,
+              const Node& annotation_source,
+              const NodeAttributes* attributes = nullptr,
+              const std::string& domain = kOnnxDomain);
+
+// With NodeAttributes&&
+Node& AddNode(name, op_type, description,
+              gsl::span<NodeArg* const> inputs,
+              gsl::span<NodeArg* const> outputs,
+              const Node& annotation_source,
+              NodeAttributes&& attributes,
+              const std::string& domain = kOnnxDomain);
+
+// initializer_list variants also available
+```
+
+### Legacy: `DuplicateNodeAnnotation`
+
+The utility function `optimizer_utils::DuplicateNodeAnnotation(src, dst)` copies annotations between existing nodes. This is still used when the annotation source is conditional (e.g., when the source node pointer may be null). Prefer the `AddNode` overload for unconditional propagation.
+
+### Automatic Propagation
+
+`Graph::AddNode(const Node& other)` — the copy overload used for duplicating nodes — automatically copies annotations. No additional action is needed when duplicating a node via this overload.
+
+## Post-Partitioning: Propagating EP Assignments
+
+Although Level 2+ optimizers do not deal with layering annotations directly (they have been cleared), they must still propagate **execution provider (EP) assignments**. EP assignments are the downstream result of the annotation-driven partitioning step. After partitioning, each node carries an EP assignment (e.g., `CUDAExecutionProvider`, `CPUExecutionProvider`) that determines where the node's kernel runs.
+
+When a Level 2+ optimizer creates new nodes that replace or derive from existing ones, it must copy the EP assignment from the source node:
+
+```cpp
+Node& new_node = graph.AddNode(name, op_type, description, inputs, outputs);
+new_node.SetExecutionProviderType(original_node.GetExecutionProviderType());
+```
+
+Failing to propagate the EP assignment causes the new node to fall back to the default provider (typically CPU), silently breaking the intended placement and potentially degrading performance or correctness. This requirement predates the layering annotation feature and applies to all optimizers that run after partitioning.
+
+> **Note:** The `AddNode` overload with `annotation_source` propagates both the layering annotation *and* nothing else — EP assignment is still set separately. Layering annotations and EP assignments serve different stages of the pipeline and are managed independently.
+
+## When You Do NOT Need to Propagate Annotations
+
+- **Level 2+ optimizers** — annotations have already been consumed and cleared (but EP assignments must still be propagated, see above).
+- **Training optimizers** — training runs after partitioning.
+- **Optimizers that only remove nodes** (e.g., identity elimination) — no new nodes are created.
+- **Optimizers that modify nodes in-place** — the annotation remains on the existing node.
+
+## Examples
+
+### Fusion (replacing multiple nodes with one)
+
+```cpp
+// GeluFusion: fusing Div + Erf + Add + Mul + Mul into a single Gelu
+Node& gelu_node = graph.AddNode(
+    graph.GenerateNodeName("Gelu"),
+    "Gelu", "fused Gelu subgraphs",
+    {gelu_input}, {gelu_output},
+    div_node);  // propagate annotation from the root matched node
+```
+
+### Decomposition (replacing one node with many)
+
+```cpp
+// STFT decomposition: each new node inherits from the original STFT node
+auto [reshape_node, reshape_out] = AddNode(graph, "Reshape", ep, inputs, &stft);
+auto [conv_node, conv_out]       = AddNode(graph, "Conv", ep, conv_inputs, &stft);
+auto [concat_node, concat_out]   = AddNode(graph, "Concat", ep, concat_inputs, &stft);
+```
+
+### Conditional source (use DuplicateNodeAnnotation)
+
+```cpp
+Node& q_node = graph.AddNode(...);
+if (src_node) {
+    optimizer_utils::DuplicateNodeAnnotation(*src_node, q_node);
+}
+```
+
+## Checklist for New Level 1 Optimizers
+
+1. Identify the "source" node whose annotation should propagate (typically the root of the matched pattern).
+2. For every `graph.AddNode(...)` call that creates a replacement node, use the `annotation_source` overload.
+3. If the source is conditional (may be null), use `optimizer_utils::DuplicateNodeAnnotation` after the `AddNode` call.
+4. Test with an annotated model to verify annotations survive the transformation.
diff --git a/include/onnxruntime/core/framework/resource_accountant.h b/include/onnxruntime/core/framework/resource_accountant.h
@@ -45,18 +45,31 @@ class IResourceAccountant {
   virtual ResourceCount GetConsumedAmount() const = 0;
   virtual void AddConsumedAmount(const ResourceCount& amount) = 0;
   virtual void RemoveConsumedAmount(const ResourceCount& amount) = 0;
-  virtual ResourceCount ComputeResourceCount(const Node& node) const = 0;
+  virtual ResourceCount ComputeResourceCount(const Node& node) = 0;
 
   std::optional<ResourceCount> GetThreshold() const {
     return threshold_;
   }
 
+  void SetThreshold(const ResourceCount& threshold) {
+    threshold_ = threshold;
+  }
+
   void SetStopAssignment() noexcept {
     stop_assignment_ = true;
   }
 
   bool IsStopIssued() const noexcept { return stop_assignment_; }
 
+  // Called before each GetCapability pass to discard pending weight tracking
+  // from a previous (discarded) pass. Default no-op for stats-based accountants.
+  virtual void ResetPendingWeights() {}
+
+  // Called when a node's cost is committed (AccountForNode/AccountForAllNodes).
+  // Moves the node's pending weights into the committed set so they persist
+  // across GetCapability passes. Default no-op for stats-based accountants.
+  virtual void CommitWeightsForNode(size_t /*node_index*/) {}
+
   static std::string MakeUniqueNodeName(const Node& node);
 
  private:
@@ -114,16 +127,16 @@ class NodeStatsRecorder {
 
   void DumpStats(const std::filesystem::path& model_path) const;
 
-  [[nodiscard]] static Status CreateAccountants(
-      const ConfigOptions& config_options,
-      const std::filesystem::path& model_path,
-      std::optional<ResourceAccountantMap>& acc_map);
-
  private:
   void DumpStats(std::ostream& os) const;
 
   struct Impl;
   std::unique_ptr<Impl> impl_;
 };
 
+Status CreateAccountants(
+    const ConfigOptions& config_options,
+    const std::filesystem::path& model_path,
+    std::optional<ResourceAccountantMap>& acc_map);
+
 }  // namespace onnxruntime
diff --git a/include/onnxruntime/core/graph/graph.h b/include/onnxruntime/core/graph/graph.h
@@ -174,7 +174,14 @@ class Node {
   */
   void SetSinceVersion(int since_version) noexcept { since_version_ = since_version; }
 
+  void SetLayeringAnnotation(std::string annotation) { layering_annotation_ = std::move(annotation); }
+
+  const std::string& GetLayeringAnnotation() const noexcept { return layering_annotation_; }
+
+  const Graph* GetContainingGraph() const noexcept { return graph_; }
+
 #if !defined(ORT_MINIMAL_BUILD)
+
   /** Gets the Node's OpSchema.
   @remarks The graph containing this node must be resolved, otherwise nullptr will be returned. */
   const ONNX_NAMESPACE::OpSchema* Op() const noexcept { return op_; }
@@ -256,6 +263,13 @@ class Node {
 #endif  // !defined(ORT_MINIMAL_BUILD)
 
 #if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)
+
+  // Make sure that the annotation does not occupy memory after partitioning is done.
+  void ClearLayeringAnnotation() {
+    std::string t;
+    layering_annotation_.swap(t);
+  }
+
   /** Gets a modifiable count of arguments for each of the Node's explicit inputs.
   @todo This should be removed in favor of a method that updates the input args and the count.
         Currently these operations are separate which is not a good setup. */
@@ -685,6 +699,8 @@ class Node {
   // Graph instances for subgraphs that are owned by this Node
   std::vector<std::unique_ptr<Graph>> subgraphs_;
 
+  std::string layering_annotation_;
+
   // Can be saved? The node cannot be saved anymore if removable attributes have been cleared.
   bool can_be_saved_;
 };
@@ -1044,6 +1060,41 @@ class Graph {  // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
                 gsl::span<NodeArg* const> output_args,
                 NodeAttributes&& attributes,
                 const std::string& domain = kOnnxDomain);
+
+  /** Add a Node to this Graph, propagating the layering annotation from an existing node.
+  This is the preferred way to create new nodes in Level 1 (pre-partitioning) graph optimizers.
+  The new node automatically inherits the layering annotation from @p annotation_source, which
+  ensures correct layer-based partitioning when annotations are present.
+  @param name The Node name. Must be unique in this Graph.
+  @param op_type The operator type. e.g. ONNX operator name.
+  @param description Arbitrary description of the Node.
+  @param input_args The explicit inputs to this Node.
+  @param output_args The outputs from this Node.
+  @param annotation_source The node from which to inherit the layering annotation.
+  @param attributes Optional NodeAttributes to add.
+  @param domain The domain for the op_type.
+  @returns Reference to the new Node.
+  @remarks Use this overload in Level 1 optimizers that create nodes replacing or derived from
+  existing annotated nodes. See docs/Optimizer_Layering_Annotations.md for details.
+  */
+  Node& AddNode(const std::string& name,
+                const std::string& op_type,
+                const std::string& description,
+                gsl::span<NodeArg* const> input_args,
+                gsl::span<NodeArg* const> output_args,
+                const Node& annotation_source,
+                const NodeAttributes* attributes = nullptr,
+                const std::string& domain = kOnnxDomain);
+
+  Node& AddNode(const std::string& name,
+                const std::string& op_type,
+                const std::string& description,
+                gsl::span<NodeArg* const> input_args,
+                gsl::span<NodeArg* const> output_args,
+                const Node& annotation_source,
+                NodeAttributes&& attributes,
+                const std::string& domain = kOnnxDomain);
+
   Node& AddNode(const std::string& name,
                 const std::string& op_type,
                 const std::string& description,
@@ -1057,6 +1108,21 @@ class Graph {  // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
                    attributes, domain);
   }
 
+  Node& AddNode(const std::string& name,
+                const std::string& op_type,
+                const std::string& description,
+                std::initializer_list<NodeArg*> input_args,
+                std::initializer_list<NodeArg*> output_args,
+                const Node& annotation_source,
+                const NodeAttributes* attributes = nullptr,
+                const std::string& domain = kOnnxDomain) {
+    return AddNode(name, op_type, description,
+                   AsSpan(input_args),
+                   AsSpan(output_args),
+                   annotation_source,
+                   attributes, domain);
+  }
+
   Node& AddNode(const std::string& name,
                 const std::string& op_type,
                 const std::string& description,
@@ -1070,16 +1136,46 @@ class Graph {  // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
                    attributes, domain);
   }
 
+  Node& AddNode(const std::string& name,
+                const std::string& op_type,
+                const std::string& description,
+                gsl::span<NodeArg* const> input_args,
+                std::initializer_list<NodeArg*> output_args,
+                const Node& annotation_source,
+                const NodeAttributes* attributes = nullptr,
+                const std::string& domain = kOnnxDomain) {
+    return AddNode(name, op_type, description,
+                   input_args,
+                   AsSpan(output_args),
+                   annotation_source,
+                   attributes, domain);
+  }
+
+  Node& AddNode(const std::string& name,
+                const std::string& op_type,
+                const std::string& description,
+                std::initializer_list<NodeArg*> input_args,
+                gsl::span<NodeArg* const> output_args,
+                const NodeAttributes* attributes = nullptr,
+                const std::string& domain = kOnnxDomain) {
+    return AddNode(name, op_type, description,
+                   AsSpan(input_args),
+                   output_args,
+                   attributes, domain);
+  }
+
   Node& AddNode(const std::string& name,
                 const std::string& op_type,
                 const std::string& description,
                 std::initializer_list<NodeArg*> input_args,
                 gsl::span<NodeArg* const> output_args,
+                const Node& annotation_source,
                 const NodeAttributes* attributes = nullptr,
                 const std::string& domain = kOnnxDomain) {
     return AddNode(name, op_type, description,
                    AsSpan(input_args),
                    output_args,
+                   annotation_source,
                    attributes, domain);
   }
 
@@ -1322,10 +1418,12 @@ class Graph {  // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
 
   The Graph needs to be Resolve()d after this call.
   @param func_to_inline
+  @param parent_annotation. Annotation inherited from the parent node that is being inlined.
   @returns Status indicating success or providing an error message.
   */
 
-  Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline);
+  Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline,
+                             const std::string& parent_annotation);
 
   /** Mark a NodeArg name as coming from the outer scope when programmatically constructing a Graph that will
   be used as a GraphProto attribute in another Node.
@@ -1569,6 +1667,11 @@ class Graph {  // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
   // compiled model during partitioning, leaving them unused in the ORT Graph. To allow the memory to be freed
   // we need to manually run the cleanup that would usually happen as part of Graph::Resolve.
   Status RemovedUnusedInitializersOrtFormat();
+
+  // This examines all the nodes and removes any annotations that are only used for layering.
+  // This potentially saves memory.
+  Status RemoveAllLayeringAnnotations();
+
 #endif  // !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)
 
   // This friendship relationship should only be used to call Graph::Graph and

diff --git a/include/onnxruntime/core/graph/indexed_sub_graph.h b/include/onnxruntime/core/graph/indexed_sub_graph.h
@@ -86,18 +86,32 @@ struct IndexedSubGraph {
 
   // Should call IsAccountingEnabled() first
   // Takes the previously computed ResourceCount for the node
-  // (usually during GetCapabiilty())
+  // (usually during GetCapability())
   // if present and adds it to the consumed amount
   void AccountForNode(size_t cost_index) const {
     assert(cost_index < nodes_costs.size());
     resource_accountant->AddConsumedAmount(nodes_costs[cost_index]);
+    resource_accountant->CommitWeightsForNode(nodes[cost_index]);
   }
 
-  // This computes and accounts for the resource cost for the node that just
-  // been fused from other nodes, and the EP did not had a chance to compute the costs.
-  void ComputeAndAccountForNode(const Node& node) const {
+  // Accounts for all constituent nodes by summing their pre-stored costs.
+  // Use this when fusing nodes into a single node so the total cost
+  // reflects what was computed during GetCapability() (with correct
+  // cross-node weight deduplication already applied).
+  void AccountForAllNodes() const {
     assert(resource_accountant != nullptr);
-    resource_accountant->AddConsumedAmount(resource_accountant->ComputeResourceCount(node));
+    for (size_t i = 0; i < nodes_costs.size(); ++i) {
+      resource_accountant->AddConsumedAmount(nodes_costs[i]);
+      resource_accountant->CommitWeightsForNode(nodes[i]);
+    }
+  }
+
+  // Accounts for a node given its index and a pre-computed resource cost.
+  // Use this when the cost was computed externally (e.g. for a fused node).
+  void AccountForNode(NodeIndex node_index, const ResourceCount& resource_count) const {
+    assert(resource_accountant != nullptr);
+    resource_accountant->AddConsumedAmount(resource_count);
+    resource_accountant->CommitWeightsForNode(node_index);
   }
 
   void SetAccountant(IResourceAccountant* res_accountant) {