-
Notifications
You must be signed in to change notification settings - Fork 4k
[DOCS ]Add annotated partitioning documentation #27972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
93ab698
Add annotated partitioning
yuslepukhin 0e38eff
Address Copilot review comments
yuslepukhin 79be28c
Address review round 2
yuslepukhin 2fe2614
Address Copilot review round 3
yuslepukhin d37a3e9
Address review comments
yuslepukhin 8527f1a
Address comments
yuslepukhin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
269 changes: 269 additions & 0 deletions
269
docs/annotated_partitioning/PartitioningWithAnnotationsAndMemoryConstraints.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,269 @@ | ||
| # Graph Partitioning with Annotations and Memory Constraints | ||
|
|
||
| ONNX Runtime automatically partitions a model graph across the execution providers (EPs) registered with a session. This page describes the advanced partitioning features introduced for controlling how nodes are assigned to devices and how GPU memory consumption is managed during partitioning. | ||
|
|
||
| ## Overview | ||
|
|
||
| Large models may exceed the memory capacity of a single accelerator (e.g., a CUDA GPU). These features allow you to: | ||
| 1. Annotate model layers so that specific parts of the model are directed to specific devices (CPU, GPU, NPU). | ||
| 2. Collect per-node memory statistics during a profiling run. | ||
| 3. Set a memory budget for an EP so that ONNX Runtime only places nodes on the accelerator until the budget is exhausted; remaining nodes are then eligible for assignment by the subsequent EPs in the session's provider list (often CPU, but not necessarily). | ||
|
|
||
| Together, these form a two-phase workflow: profile the model once to collect memory data, then partition it in production using that data and a memory limit. | ||
|
|
||
| ## Layer Assignment Annotations | ||
|
|
||
| ### Concept | ||
|
|
||
| Each node in an ONNX model can carry a metadata property called `layer_ann` (layering annotation). This is a free-form string that identifies which logical layer or group the node belongs to. At session creation time, you provide a configuration that maps annotation patterns to target devices. ONNX Runtime then uses these mappings to pre-assign nodes to the corresponding EPs before the normal capability-based partitioning runs. | ||
|
|
||
| ### Annotating the Model | ||
|
|
||
| Annotations are stored in each ONNX `NodeProto`'s `metadata_props` field with the key `layer_ann`. You can add them manually using the ONNX Python API: | ||
|
|
||
| ```python | ||
| import onnx | ||
| model = onnx.load("model.onnx") | ||
| for node in model.graph.node: | ||
| # Assign a layer annotation based on your own logic | ||
| entry = next((prop for prop in node.metadata_props if prop.key == "layer_ann"), None) | ||
| if entry is None: | ||
| entry = node.metadata_props.add() | ||
| entry.key = "layer_ann" | ||
| entry.value = "encoder_layer_0" # your annotation string | ||
|
|
||
| onnx.save(model, "model_annotated.onnx") | ||
| ``` | ||
|
|
||
| ### Annotating with Olive | ||
|
|
||
| Olive provides built-in support for adding layer annotations during ONNX conversion via the `CaptureLayerAnnotations` pass (added in [PR #2361](https://github.com/microsoft/Olive/pull/2361)). You supply a `layer_annotations` dictionary where each **key** is the annotation string to write into `layer_ann`, and each **value** is a list of node-name substrings. During ONNX export, every node whose name contains one of the substrings receives the corresponding `layer_ann` metadata property. If multiple substrings match, the first one in iteration order wins. | ||
|
|
||
| Models that typically originate in MS Foundry have a standard naming schema. Nodes in such a model have recurring patterns (e.g. `embed_tokens`, `self_attn`, `mlp`, `norm`) that can be used to identify layering and annotate accordingly. | ||
|
yuslepukhin marked this conversation as resolved.
Outdated
|
||
|
|
||
| #### Step 1 — Create the workflow config file | ||
|
|
||
| Save the following as `annotate_model.json`: | ||
|
|
||
| ```json | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "microsoft/Phi-3.5-mini-instruct" | ||
| }, | ||
| "passes": { | ||
| "capture_annotations": { | ||
| "type": "CaptureLayerAnnotations", | ||
| "layer_annotations": { | ||
| "embedding_layer": ["embed_tokens"], | ||
| "attention_layer": ["self_attn", "q_proj", "k_proj", "v_proj", "o_proj"], | ||
| "mlp_layer": ["mlp", "gate_proj", "up_proj", "down_proj"], | ||
| "norm_layer": ["norm", "layernorm"] | ||
| } | ||
| }, | ||
| "conversion": { | ||
| "type": "OnnxConversion", | ||
| "target_opset": 16, | ||
| "save_as_external_data": true, | ||
| "all_tensors_to_one_file": true | ||
| } | ||
| }, | ||
| "log_severity_level": 1, | ||
| "output_dir": "models/annotated_phi3" | ||
| } | ||
| ``` | ||
|
|
||
| The `layer_annotations` dictionary maps annotation names to node-name substring patterns: | ||
| - Any node whose name contains `"embed_tokens"` → annotated as `"embedding_layer"` | ||
| - Any node whose name contains `"self_attn"`, `"q_proj"`, etc. → annotated as `"attention_layer"` | ||
| - And so on for `"mlp_layer"` and `"norm_layer"`. | ||
|
|
||
| You can also use `ModelBuilder` instead of `OnnxConversion` — both paths apply the annotations automatically: | ||
|
|
||
| ```json | ||
| "conversion": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4" | ||
| } | ||
| ``` | ||
|
|
||
| #### Step 2 — Run the workflow | ||
|
|
||
| ```bash | ||
| pip install 'olive-ai[auto-opt]' | ||
| olive run --config annotate_model.json | ||
| ``` | ||
|
|
||
| This will: | ||
| 1. Download the model from Hugging Face. | ||
| 2. Store the `layer_annotations` mapping inside `model_attributes` (via `CaptureLayerAnnotations`). | ||
| 3. Convert to ONNX and stamp every matching node with `metadata_props["layer_ann"]` set to the corresponding annotation name. | ||
| 4. Write the annotated ONNX model to `models/annotated_phi3/`. | ||
|
|
||
| #### Verifying the annotations | ||
|
|
||
| You can verify that the annotations were applied: | ||
|
|
||
| ```python | ||
| import onnx | ||
|
|
||
| model = onnx.load("models/annotated_phi3/model.onnx", load_external_data=False) | ||
| for node in model.graph.node: | ||
| for prop in node.metadata_props: | ||
| if prop.key == "layer_ann": | ||
| print(f"{node.name}: {prop.value}") | ||
| ``` | ||
|
|
||
| The annotated model is now ready for use with the ORT session options described below. | ||
|
|
||
| ### Configuring Layer Assignment at Runtime | ||
|
|
||
| Use the session option `session.layer_assignment_settings` to tell ONNX Runtime how to map annotations to devices. | ||
|
|
||
| ``` | ||
| device1(annotation1, annotation2, ...); device2(=annotation3, annotation4, ...) | ||
| ``` | ||
|
|
||
| - `device`: a recognized device designator such as `cpu`, `gpu`, or `npu`, matched against the execution providers registered in the session. | ||
|
tianleiwu marked this conversation as resolved.
Outdated
|
||
| - `annotation`: string to match against the `layer_ann` value on each node. | ||
| - `=` prefix: denotes an exact match. Without `=`, the annotation is treated as a prefix match (any node whose `layer_ann` starts with the string will match). | ||
| - Prefix rules have higher priority than exact-match rules. Within the same match type, priority is left-to-right. | ||
| - Multiple device rules are separated by `;`. | ||
|
|
||
| ```python | ||
| import onnxruntime as ort | ||
|
|
||
| opts = ort.SessionOptions() | ||
|
|
||
| # Nodes annotated with layer_ann starting with "encoder" go to GPU, | ||
| # nodes with exact annotation "final_output" go to CPU. | ||
| opts.add_session_config_entry( | ||
| "session.layer_assignment_settings", | ||
| "gpu(encoder); cpu(=final_output)" | ||
|
tianleiwu marked this conversation as resolved.
|
||
| ) | ||
|
|
||
| session = ort.InferenceSession("model_annotated.onnx", opts, | ||
| providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) | ||
| ``` | ||
|
|
||
| Nodes that do not match any rule fall through to the normal EP capability-based assignment. | ||
|
|
||
| ## Capacity-Aware Partitioning (implemented for CUDA) | ||
|
|
||
| When running models on a CUDA GPU with limited memory, you can set a memory budget so ONNX Runtime stops assigning nodes to the CUDA EP once the estimated memory consumption reaches the limit. Remaining nodes are then eligible for assignment by the subsequent EPs in the session's provider list (often CPU, but not necessarily). | ||
|
|
||
| ### Step 1: Collect Memory Statistics (Profiling Run) | ||
|
|
||
| Run the model once with memory statistics collection enabled. This records per-node allocation data to a CSV file. | ||
|
|
||
| ```python | ||
| import onnxruntime as ort | ||
| import numpy as np | ||
|
|
||
| opts = ort.SessionOptions() | ||
| # Disable memory patterns for accurate per-node measurement | ||
| opts.enable_mem_pattern = False | ||
| opts.add_session_config_entry( | ||
| "session.collect_node_memory_stats_to_file", | ||
| "node_memory_stats.csv" | ||
| ) | ||
|
|
||
| session = ort.InferenceSession("model.onnx", opts, | ||
|
tianleiwu marked this conversation as resolved.
|
||
| providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) | ||
|
|
||
| def make_concrete_shape(shape, default_dim=1): | ||
|
tianleiwu marked this conversation as resolved.
|
||
| """ORT input shapes may contain symbolic dims or None (e.g. batch size). | ||
| Replace those with a small concrete value for profiling.""" | ||
| return tuple( | ||
| dim if isinstance(dim, int) and dim > 0 else default_dim | ||
| for dim in shape | ||
| ) | ||
|
|
||
| # Run inference at least once to collect statistics. | ||
| # For models with dynamic inputs, prefer real sample inputs or model-appropriate | ||
| # concrete shapes instead of relying on the declared ORT input shape directly. | ||
| input_data = { | ||
| inp.name: np.zeros(make_concrete_shape(inp.shape), dtype=np.float32) | ||
|
yuslepukhin marked this conversation as resolved.
|
||
| for inp in session.get_inputs() | ||
| } | ||
| session.run(None, input_data) | ||
| ``` | ||
|
|
||
| This produces a CSV file with columns: | ||
| `#name,input_sizes,initializers_sizes,total_dynamic_sizes,total_temp_allocations` | ||
|
|
||
| In this example, `node_memory_stats.csv` is a relative path. Relative paths are resolved against the model's directory when the model was loaded from a filesystem path. If you provide an absolute path, that path is used as-is. If the model was not loaded from a filesystem path (for example, it was loaded from bytes), the output file is written relative to the current working directory. | ||
|
|
||
| Multiple `session.run()` calls update the stats with the maximum values observed per node. | ||
|
|
||
| ### Step 2: Partition with a Memory Budget | ||
|
|
||
| In a subsequent session, provide the memory limit and the stats file to enable capacity-aware partitioning. | ||
|
|
||
| ```python | ||
| import onnxruntime as ort | ||
|
|
||
| opts = ort.SessionOptions() | ||
|
|
||
| # Format: "memory_limit_in_kb,stats_filename" | ||
| # Set a 4 GB limit and use the stats from the profiling run | ||
| opts.add_session_config_entry( | ||
| "session.resource_cuda_partitioning_settings", | ||
| "4194304,node_memory_stats.csv" | ||
| ) | ||
|
|
||
| session = ort.InferenceSession("model.onnx", opts, | ||
| providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) | ||
| ``` | ||
|
|
||
| ONNX Runtime processes nodes in priority order, accumulating estimated memory. When the cumulative cost exceeds the budget, remaining nodes are not assigned to the CUDA EP and are eligible for assignment by the subsequent EPs in the session's provider list. | ||
|
|
||
| ### Ad-Hoc Mode (No Stats File) | ||
|
|
||
| If you do not have pre-recorded statistics, you can specify only a memory limit. ONNX Runtime will estimate per-node cost from initializer sizes and static output shapes, applying a 1.5x safety multiplier. | ||
|
|
||
| This mode is less accurate than using pre-recorded stats but provides a quick way to constrain GPU memory without a profiling run. | ||
|
yuslepukhin marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```python | ||
| # Memory limit only, no stats file (note the trailing comma) | ||
| opts.add_session_config_entry( | ||
| "session.resource_cuda_partitioning_settings", | ||
| "4194304," | ||
| ) | ||
| ``` | ||
|
|
||
| ### Setting Format Summary | ||
| The value of `session.resource_cuda_partitioning_settings` is a comma-separated pair: | ||
|
|
||
| | Format | Meaning | | ||
| |:------|:-------| | ||
| | `<limit_kb>,<stats_file>` | Use both memory limit and pre-recorded stats | | ||
| | `<limit_kb>,` | Memory limit only (ad-hoc estimation) | | ||
| | `,<stats_file>` | Stats only (no explicit limit) | | ||
| | `,` | Neither (EP attempts auto-detection) | | ||
|
|
||
| The stats file path follows the same resolution rules described above: relative paths are resolved against the model's directory, absolute paths are used as-is. | ||
|
|
||
| ## Combining Both Features | ||
| Layer annotations and capacity-aware partitioning can be used together. When both are configured: | ||
| - Layer annotations provide the initial node-to-device mapping. | ||
| - The capacity-aware partitioner enforces the memory budget, potentially overriding assignments that would exceed the GPU memory limit. | ||
|
|
||
| This combination gives you fine-grained control: use annotations to express logical model structure, and let the memory budget act as a safety net. | ||
|
|
||
| ```python | ||
| opts = ort.SessionOptions() | ||
|
|
||
| opts.add_session_config_entry( | ||
| "session.layer_assignment_settings", | ||
| "gpu(encoder, decoder); cpu(=postprocess)" | ||
| ) | ||
|
|
||
| opts.add_session_config_entry( | ||
| "session.resource_cuda_partitioning_settings", | ||
| "4194304,node_memory_stats.csv" | ||
| ) | ||
|
|
||
| session = ort.InferenceSession("model_annotated.onnx", opts, | ||
| providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) | ||
| ``` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.