Skip to content
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
18b5d2b
Introduce annotation query function
yuslepukhin Jan 21, 2026
1eeba1a
Wire annotations to partitioning interface.
yuslepukhin Jan 29, 2026
debc8dd
Fix up annotations with Transpose Optimizer
yuslepukhin Jan 29, 2026
8f0ff86
Add ORT_EXTENDED_MINIMAL build
yuslepukhin Jan 29, 2026
f7a422e
Move rules and matcher inside the index
yuslepukhin Jan 30, 2026
1ef4078
Add Update with tests
yuslepukhin Jan 30, 2026
828eca3
TODO: Consider not removing annotations
yuslepukhin Jan 30, 2026
1626d5c
Clear annotations after partitioning
yuslepukhin Feb 7, 2026
f33bab1
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Feb 7, 2026
b3ecb39
Address accountant bug
yuslepukhin Feb 7, 2026
b5ea1c6
Annotate tiny_gpt2_beamsearch by layers
yuslepukhin Feb 9, 2026
9a422ba
Refactor Graph_GetGraphView to make it a utility
yuslepukhin Feb 10, 2026
a1caf93
Introduce a graph utility to create an IndexedSubgraph
yuslepukhin Feb 10, 2026
e1b1c4f
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Feb 10, 2026
acec402
Fix lint in python script.
yuslepukhin Feb 11, 2026
31dd7a8
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 2, 2026
9fa4849
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 5, 2026
50c58c9
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 9, 2026
e445b60
Fix build errors and address Copilt comments
yuslepukhin Mar 9, 2026
358f7df
Reject duplicate rules
yuslepukhin Mar 9, 2026
653fb8b
Move methods to .cc
yuslepukhin Mar 9, 2026
23a8ecf
Remove code duplication
yuslepukhin Mar 10, 2026
ef1227e
Add missing include
yuslepukhin Mar 10, 2026
b0b2396
Fix matching bug
yuslepukhin Mar 10, 2026
b9e13cf
Change index parsing
yuslepukhin Mar 10, 2026
add0227
Remove wrong comment
yuslepukhin Mar 10, 2026
17e3525
Address minimal build issues
yuslepukhin Mar 10, 2026
1b1a7db
Fix unused arg
yuslepukhin Mar 10, 2026
88c2c47
Add logging
yuslepukhin Mar 18, 2026
9b0b529
Make sure the annotation is copied on node copy
yuslepukhin Mar 19, 2026
dab76bc
Adjust error message
yuslepukhin Mar 19, 2026
b39a487
Copy Annotations when copying nodes and inlining functions
yuslepukhin Mar 19, 2026
4e260bc
Update LayeringIndex after function inlining
yuslepukhin Mar 20, 2026
5214350
Add intermediate buffers accounting + temp coefficient
yuslepukhin Mar 23, 2026
3fb1d1e
Merge branch 'main' into yuslepukhin/layering
yuslepukhin Mar 23, 2026
cd73b56
Address MakeNodeUnassigned feedback
yuslepukhin Mar 25, 2026
dfe4d13
Address InlineNodes feedback
yuslepukhin Mar 25, 2026
62f3d11
Fix underaccounting for shared weights in fused nodes
yuslepukhin Mar 25, 2026
3cab988
Update onnxruntime/python/tools/layering/layer_annotate.py
yuslepukhin Mar 26, 2026
e6cb75f
Lint
yuslepukhin Mar 26, 2026
b2ef9a2
Flip = prefix to exact match
yuslepukhin Mar 26, 2026
fde6300
Adjust comments for duplicate annotations
yuslepukhin Mar 26, 2026
7871afa
Remove bad comment
yuslepukhin Mar 26, 2026
16ec921
Adjust EpWithNoLayeringRulesSeesAllUnassignedNodes
yuslepukhin Mar 26, 2026
4da5c3b
Throw on multiple annotations
yuslepukhin Mar 26, 2026
01e4506
Make sure annotations are propagated on function inlining
yuslepukhin Mar 26, 2026
3e52f14
Update include/onnxruntime/core/session/onnxruntime_session_options_c…
yuslepukhin Mar 26, 2026
24a46e8
Update onnxruntime/core/framework/graph_partitioner.cc
yuslepukhin Mar 26, 2026
e745e02
Update onnxruntime/core/graph/graph_utils.h
yuslepukhin Mar 26, 2026
59b5ccd
Fix issues in python
yuslepukhin Mar 26, 2026
054f894
Address undercounting problem
yuslepukhin Mar 27, 2026
09967c3
Add copyright header
yuslepukhin Mar 27, 2026
d23ee08
Update onnxruntime/core/framework/graph_partitioner.cc
yuslepukhin Mar 27, 2026
9e8be7a
Adjust doc and implementaton for fetching layering ann
yuslepukhin Mar 27, 2026
f636138
Make GetContainingGraph public
yuslepukhin Mar 27, 2026
fcda524
Adjust accounting for fused node and remove stray local var
yuslepukhin Mar 27, 2026
2da1394
Address flaky test
yuslepukhin Mar 27, 2026
c0c5e51
Update onnxruntime/core/providers/cuda/cuda_execution_provider.cc
yuslepukhin Mar 27, 2026
c55bb6e
Update onnxruntime/core/graph/graph_utils.cc
yuslepukhin Mar 27, 2026
cdd9faa
Address review issues
yuslepukhin Mar 27, 2026
67a947b
Fix potential perf issue
yuslepukhin Mar 27, 2026
44c6904
Address review comments
yuslepukhin Mar 27, 2026
927a0ef
Add documentation for ann and ep propagation. Fix L1 optimizers, add …
yuslepukhin Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions include/onnxruntime/core/framework/resource_accountant.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,18 +45,31 @@ class IResourceAccountant {
virtual ResourceCount GetConsumedAmount() const = 0;
virtual void AddConsumedAmount(const ResourceCount& amount) = 0;
virtual void RemoveConsumedAmount(const ResourceCount& amount) = 0;
virtual ResourceCount ComputeResourceCount(const Node& node) const = 0;
virtual ResourceCount ComputeResourceCount(const Node& node) = 0;

std::optional<ResourceCount> GetThreshold() const {
return threshold_;
}

void SetThreshold(const ResourceCount& threshold) {
threshold_ = threshold;
}

void SetStopAssignment() noexcept {
stop_assignment_ = true;
}

bool IsStopIssued() const noexcept { return stop_assignment_; }

// Called before each GetCapability pass to discard pending weight tracking
// from a previous (discarded) pass. Default no-op for stats-based accountants.
virtual void ResetPendingWeights() {}

// Called when a node's cost is committed (AccountForNode/AccountForAllNodes).
// Moves the node's pending weights into the committed set so they persist
// across GetCapability passes. Default no-op for stats-based accountants.
virtual void CommitWeightsForNode(size_t /*node_index*/) {}

static std::string MakeUniqueNodeName(const Node& node);

private:
Expand Down Expand Up @@ -114,16 +127,16 @@ class NodeStatsRecorder {

void DumpStats(const std::filesystem::path& model_path) const;

[[nodiscard]] static Status CreateAccountants(
const ConfigOptions& config_options,
const std::filesystem::path& model_path,
std::optional<ResourceAccountantMap>& acc_map);

private:
void DumpStats(std::ostream& os) const;

struct Impl;
std::unique_ptr<Impl> impl_;
};

Status CreateAccountants(
const ConfigOptions& config_options,
const std::filesystem::path& model_path,
std::optional<ResourceAccountantMap>& acc_map);

} // namespace onnxruntime
25 changes: 24 additions & 1 deletion include/onnxruntime/core/graph/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,12 @@ class Node {
*/
void SetSinceVersion(int since_version) noexcept { since_version_ = since_version; }

void SetLayeringAnnotation(std::string annotation) { layering_annotation_ = std::move(annotation); }

const std::string& GetLayeringAnnotation() const noexcept { return layering_annotation_; }

#if !defined(ORT_MINIMAL_BUILD)

/** Gets the Node's OpSchema.
@remarks The graph containing this node must be resolved, otherwise nullptr will be returned. */
const ONNX_NAMESPACE::OpSchema* Op() const noexcept { return op_; }
Expand Down Expand Up @@ -256,6 +261,13 @@ class Node {
#endif // !defined(ORT_MINIMAL_BUILD)

#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

// Make sure that the annotation does not occupy memory after partitioning is done.
void ClearLayeringAnnotation() {
std::string t;
layering_annotation_.swap(t);
Comment thread
tianleiwu marked this conversation as resolved.
}

/** Gets a modifiable count of arguments for each of the Node's explicit inputs.
@todo This should be removed in favor of a method that updates the input args and the count.
Currently these operations are separate which is not a good setup. */
Expand Down Expand Up @@ -568,6 +580,8 @@ class Node {
friend class Graph;
Node(NodeIndex index, Graph& graph) : index_(index), graph_(&graph), can_be_saved_(true) {}

Comment thread
yuslepukhin marked this conversation as resolved.
const Graph* GetContainingGraph() const noexcept { return graph_; }

protected:
#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)
// internal only method to allow selected classes to directly alter the input/output definitions and arg counts
Expand Down Expand Up @@ -685,6 +699,8 @@ class Node {
// Graph instances for subgraphs that are owned by this Node
std::vector<std::unique_ptr<Graph>> subgraphs_;

std::string layering_annotation_;

// Can be saved? The node cannot be saved anymore if removable attributes have been cleared.
bool can_be_saved_;
};
Expand Down Expand Up @@ -1322,10 +1338,12 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi

The Graph needs to be Resolve()d after this call.
@param func_to_inline
@param parent_annotation. Annotation inherited from the parent node that is being inlined.
@returns Status indicating success or providing an error message.
*/

Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline);
Status InlineFunctionProto(const ONNX_NAMESPACE::FunctionProto& func_to_inline,
const std::string& parent_annotation);

/** Mark a NodeArg name as coming from the outer scope when programmatically constructing a Graph that will
be used as a GraphProto attribute in another Node.
Expand Down Expand Up @@ -1569,6 +1587,11 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
// compiled model during partitioning, leaving them unused in the ORT Graph. To allow the memory to be freed
// we need to manually run the cleanup that would usually happen as part of Graph::Resolve.
Status RemovedUnusedInitializersOrtFormat();

// This examines all the nodes and removes any annotations that are only used for layering.
// This potentially saves memory.
Status RemoveAllLayeringAnnotations();

#endif // !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

// This friendship relationship should only be used to call Graph::Graph and
Expand Down
28 changes: 23 additions & 5 deletions include/onnxruntime/core/graph/indexed_sub_graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,24 +86,42 @@ struct IndexedSubGraph {

// Should call IsAccountingEnabled() first
// Takes the previously computed ResourceCount for the node
// (usually during GetCapabiilty())
// (usually during GetCapability())
// if present and adds it to the consumed amount
void AccountForNode(size_t cost_index) const {
assert(cost_index < nodes_costs.size());
resource_accountant->AddConsumedAmount(nodes_costs[cost_index]);
resource_accountant->CommitWeightsForNode(nodes[cost_index]);
}

// This computes and accounts for the resource cost for the node that just
// been fused from other nodes, and the EP did not had a chance to compute the costs.
void ComputeAndAccountForNode(const Node& node) const {
// Accounts for all constituent nodes by summing their pre-stored costs.
// Use this when fusing nodes into a single node so the total cost
// reflects what was computed during GetCapability() (with correct
// cross-node weight deduplication already applied).
void AccountForAllNodes() const {
assert(resource_accountant != nullptr);
resource_accountant->AddConsumedAmount(resource_accountant->ComputeResourceCount(node));
for (size_t i = 0; i < nodes_costs.size(); ++i) {
resource_accountant->AddConsumedAmount(nodes_costs[i]);
resource_accountant->CommitWeightsForNode(nodes[i]);
}
}

// Accounts for a node given its index and a pre-computed resource cost.
// Use this when the cost was computed externally (e.g. for a fused node).
void AccountForNode(NodeIndex node_index, const ResourceCount& resource_count) const {
assert(resource_accountant != nullptr);
resource_accountant->AddConsumedAmount(resource_count);
resource_accountant->CommitWeightsForNode(node_index);
}

void SetAccountant(IResourceAccountant* res_accountant) {
resource_accountant = res_accountant;
}

IResourceAccountant* GetAccountant() const noexcept {
return resource_accountant;
}

// Append resource count to the list of costs for the nodes.
void AppendNodeCost(const ResourceCount& cost) {
assert(resource_accountant != nullptr);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -325,13 +325,33 @@ static const char* const kOrtSessionOptionsCollectNodeMemoryStatsToFile = "sessi
/// This is a composite CSV setting formatted as "memory limit in kb,file name for collected stats"
/// "limit > 0": enables Capacity Aware Partitioning for Cuda EP. `limit` is optional and when absent
/// the provider may attempt to figure out the memory available automatically.
/// The setting with no pre-recorded stats is expected to look like: "limit > 0,".
/// In this case, the EP will calculate memory using the initializers referenced by the node.
/// This enables an ad-hoc and flexible scenarios with no pre-recorded stats, but may be less accurate.
/// The setting with no limit is expected to look like: ",file name for collected stats"
/// The EP will place nodes on device "file name" :
/// Finally a setting with both limit and pre-recorded stats absent can contain a single comma: ",".
/// The EP will attempt to place nodes on device (currently only CUDA is supported) :
/// this file is expected to be found at the same folder with the model. The file contains
/// pre-recorded stats collected when running with kOrtSessionOptionsCollectNodeMemoryStatsToFile enforce (see above)
Comment thread
yuslepukhin marked this conversation as resolved.
static const char* const kOrtSessionOptionsResourceCudaPartitioningSettings =
"session.resource_cuda_partitioning_settings";

/// <summary>
/// This is a setting that contains string annotations or annotation prefixes to be matched
/// against individual nodes metadata entry 'layer_ann' to guide layer assignment during partitioning.
/// The value is a semicolon separated list of strings or string prefixes per device.
/// Format: device1(annotation1, annotation2, ...); device2(annotation1, =annotation3, ...);...
/// Where:
/// - device1, device2, ... are the recognized device names to be matched against EPs configured in
/// the given session.
/// - annotation1, annotation2, ... are annotation prefixes to be matched against node annotations. Any
/// node annotation that starts with one of these prefixes will be matched.
/// - =annotation3 indicates an exact match for annotation3. Only node annotations that are exactly
/// equal to 'annotation3' will be matched.
/// TODO: add a list of recognized devices here.
/// </summary>
static const char* const kOrtSessionOptionsLayerAssignmentSettings = "session.layer_assignment_settings";

// Enable EP context feature to dump the partitioned graph which includes the EP context into Onnx file.
// The dumped Onnx model with EP context can be used for future inference to avoid the EP graph partitioning/compile overhead.
// "0": disable. (default)
Expand Down
Loading
Loading