Add EP-specific weight layout transformation framework by jchen10 · Pull Request #26554 · microsoft/onnxruntime

jchen10 · 2025-11-12T05:15:35Z

This infrastructure enables execution providers to optimize operator weights with custom memory layouts (such as blocked formats) during session initialization, dramatically improving inference performance through better cache utilization and memory access patterns.

Current Implementation:

HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO layout as the first application of the framework
ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format with 16×4 tiles demonstrates the framework's primary purpose

The framework is generic and extensible, allowing any EP to implement custom weight transformations optimized for their target hardware.

jchen10 · 2025-11-12T05:49:48Z

Weight Layout Transformation Design

Overview

Design Goal: Support EP-specific blocked and optimized weight layouts that match hardware characteristics, enabling significant performance gains without runtime overhead.

Current Implementation:

HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO layout - actively used as the first application of the framework
ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format with 16×4 tiles - demonstrates the framework's primary purpose

Motivation

The Challenge: Hardware-Optimized Memory Layouts

Modern compute accelerators (GPUs, NPUs, specialized AI chips) achieve peak performance when data is laid out in memory patterns that match their hardware architecture. However, ONNX models use standardized layouts (e.g., OIHW for Conv weights) that may not be optimal for specific hardware.

The Problem:

Standard ONNX layouts don't match hardware-optimized blocked/tiled formats
Converting at runtime wastes compute cycles on every inference
Poor memory access patterns lead to cache misses and bandwidth bottlenecks

The Solution: EP-Specific Weight Layout Transformation

This framework allows each EP to transform weights once during session initialization to hardware-optimized layouts, then use them directly at inference time.

Architecture

Core Design Principle

Separation of Concerns: The framework separates EP-specific format decisions from the core transformation infrastructure:

Execution Provider decides WHAT to transform and WHEN
Framework handles HOW to transform and WHERE to store results
Operator Implementation uses transformed weights at runtime

This design allows each EP to implement custom transformations without modifying core ONNX Runtime code.

Two-Phase EP API

Execution providers implement two virtual methods to participate in weight transformation:

class IExecutionProvider {
 public:
  // Phase 1: Query (lightweight) - "Do you want this weight transformed?"
  // Called during graph optimization/partitioning
  virtual Status GetPreferredInitializerFormat(
      const Node& node,           // Which node needs the initializer
      int input_index,            // Which input (e.g., Conv weight is index 1)
      std::string& format_descriptor) const;  // OUT: format name (e.g., "hwio")

  // Phase 2: Transform (heavyweight) - "Transform this weight to requested format"
  // Called once during session initialization per unique (initializer, format) pair
  virtual Status TransformInitializerFormat(
      const Tensor& original_tensor,              // Input: original weight
      const std::string& format_descriptor,       // Which format to transform to
      std::unique_ptr<Tensor>& transformed_tensor) const;  // OUT: transformed result
};

Design Benefits:

Phase 1 is fast: Can be called multiple times during graph analysis without performance penalty
Phase 2 is heavyweight: Called exactly once per transformed initializer
Clear contract: Query returns format name, transform produces result
EP autonomy: Each EP decides its own formats and transformation logic

Format Descriptor String

Formats are identified by string descriptors that encode transformation details:

Format String	Meaning	Primary Use	Status
`"ABcd16a4b"`	Blocked format with 16×4 tiles on O,I dims	Hardware-optimized layouts	Proof-of-concept
`"hwio"`	OIHW → HWIO transpose (permutation {2,3,1,0})	Simple layout reordering	WebGPU EP (active)
`"oihw"`	Original/standard format (no transform)	N/A	Default
`""` (empty)	No transformation needed	Return when EP doesn't need transform	N/A

Benefits:

Self-documenting: Format name describes the transformation
Extensible: New formats added without API changes (e.g., "ABcd8a8b", "ABcd32a8b")
Simple: Just a string - easy to serialize, compare, debug
Flexible: Can encode block sizes, permutations, etc. in the name (e.g., "16a4b" = 16×4 blocks)

Block Size Encoding: The "ABcd16a4b" notation from OneDNN encodes:

A, B, c, d: Dimension order (uppercase = blocked, lowercase = standard)
16a: 16-element blocks on the 'A' dimension (output channels)
4b: 4-element blocks on the 'B' dimension (input channels)

Complete Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│ Session Initialization (One-Time)                                       │
└─────────────────────────────────────────────────────────────────────────┘
     │
     ├─> SessionState::TransformInitializersToPreferredFormat()
     │        │
     │        ├─> For each initializer (weight tensor):
     │        │    │
     │        │    ├─> For each node consuming this initializer:
     │        │    │    │
     │        │    │    └─> EP->GetPreferredInitializerFormat(node, input_idx, format)
     │        │    │         │
     │        │    │         ├─> WebGPU EP: Delegates to ConvGetPreferredKernelFormat()
     │        │    │         │    └─> Analyzes Conv execution paths
     │        │    │         │         └─> Returns "hwio" if transpose needed, else FAIL
     │        │    │         │
     │        │    │         └─> Collects all nodes requesting this format
     │        │    │
     │        │    ├─> If any nodes want transformation:
     │        │    │    │
     │        │    │    ├─> Load original: TensorProtoToTensor(original_proto, cpu_tensor)
     │        │    │    │
     │        │    │    ├─> Transform: EP->TransformInitializerFormat(cpu_tensor, "hwio", transformed)
     │        │    │    │    │
     │        │    │    │    └─> WebGPU EP: Delegates to WeightLayoutTransformer
     │        │    │    │         └─> TransposeOIHWToHWIO() performs actual permutation
     │        │    │    │
     │        │    │    ├─> Attach metadata: transformed->SetFormatDescriptor("hwio")
     │        │    │    │
     │        │    │    ├─> Serialize: TensorToTensorProto(transformed, new_proto)
     │        │    │    │    └─> new_proto.add_string_data("onnxruntime_format:hwio")
     │        │    │    │
     │        │    │    ├─> Add to graph: graph.AddInitializedTensor(new_proto)
     │        │    │    │
     │        │    │    └─> Rewire nodes: node->InputDefs()[i] = transformed_node_arg
     │        │    │
     │        │    └─> Original initializer remains for non-requesting nodes
     │        │
     │        └─> All transformations complete
     │
     ├─> SaveInitializedTensors() - Persist to session state
     │
     └─> For each initializer:
          │
          ├─> DeserializeTensorProto() - Load to memory
          │    │
          │    ├─> TensorProtoToTensor() creates CPU tensor
          │    │    └─> Reads "onnxruntime_format:hwio" from string_data
          │    │         └─> cpu_tensor.SetFormatDescriptor("hwio")
          │    │
          │    └─> For non-CPU devices:
          │         │
          │         └─> CopyTensorFromCPUToDevice()
          │              ├─> Copy data buffer to device memory
          │              ├─> device_tensor.SetFormatDescriptor(cpu_tensor.GetFormatDescriptor())
          │              └─> InitOrtValue(std::move(device_tensor))
          │                   └─> Move constructor preserves format_descriptor_
          │
          └─> Initializer ready with format metadata intact

┌─────────────────────────────────────────────────────────────────────────┐
│ Inference Runtime (Every Inference)                                     │
└─────────────────────────────────────────────────────────────────────────┘
     │
     └─> Conv::ComputeInternal()
          │
          ├─> kernel = context.Input<Tensor>(1)
          ├─> is_kernel_hwio = (kernel->GetFormatDescriptor() == "hwio")
          │
          ├─> if (is_kernel_hwio):
          │    └─> ✅ Use kernel directly, NO runtime transpose
          │
          └─> else:
               └─> ❌ TransposeKernel() at runtime (fallback for backward compat)

Key Points:

Phase separation: Query (fast, multiple calls) vs Transform (slow, once per initializer)
Metadata persistence: Format survives serialization, deserialization, CPU→device copy
Runtime efficiency: Transformed weights used directly, no repeated work
Backward compatibility: Non-transformed weights fall back to runtime transpose

Key Design Decisions and Rationale

1. Why EP-Specific Instead of Global Transformations?

Decision: Let each EP decide its own transformations via virtual methods

Rationale:

Different hardware has different optimal layouts (GPU vs NPU vs CPU)
Different operators may need different formats even on same EP
EP has domain knowledge about its kernels and performance characteristics
Avoids one-size-fits-all solutions that may not be optimal

Example: WebGPU needs HWIO for channels-last Conv, but another EP might prefer a different layout or no transformation at all.

2. Why CPU-Based Transformation?

Decision: Transform weights on CPU during session init, before device loading

Rationale:

✅ Simpler implementation (no GPU kernels needed for transformation)
✅ Works for all device types uniformly
✅ One-time cost during init (not performance-critical path)
✅ Device memory only sees final optimized format

Trade-off: Slightly slower session initialization for much faster inference.

3. Why Tensor Member Instead of External Map?

Decision: Store format_descriptor_ directly in Tensor class

Rationale:

✅ Automatic propagation: Metadata travels with tensor through copy/move
✅ No bookkeeping: No need for separate maps or registries
✅ Minimal overhead: Empty string for 99% of tensors (standard format)
✅ Simple API: tensor->GetFormatDescriptor() is intuitive
✅ Thread-safe: No shared state to synchronize

Alternative rejected: External map would require synchronization and lifetime management.

4. Why Both Original and Transformed Initializers?

Decision: Keep both versions in graph when nodes have different needs

Rationale:

✅ Flexibility: Some nodes may want original, others transformed
✅ Backward compatibility: Non-EP nodes use original
✅ Partial optimization: Transform only what's needed

Memory trade-off: Small duplication for large inference speedup.

jchen10 · 2025-11-12T08:16:58Z

The perf improvement of this PR on LNL, thanks to eliminating the kernel transpose on every inference run in Conv:

model	variance
sd-turbo-unet-fp16-demo-layernorm	-53.12%
sdunet-v1.5-demo-layernorm	-32.53%
jina-clip-v1-version	-27.02%
moondream2-embed-tokens-fp16	-23.68%
moondream2-vision-encoder-fp16	-22.68%
detr-resnet-50	-22.55%
resnet50-v1-f16-demo	-12.41%
gazenet	-11.17%
mobileclip_s0_text_fp32	-10.00%
efficientnet-lite-f16-demo	-8.51%
jina-clip-v1-text-fp16	-8.32%
mobilenetv2-12-f16-demo	-7.21%
florence-2-base-vision-encoder-fp16	-5.41%
detr-resnet-50-fp16	-5.16%
whisper-base-encoder-lm-fp16-layernorm	-5.15%

I assume it's so much good for the sd models, simply because of the regression, #26501 is trying to fix.

@xhcao @JianhuiD @Jiawei-Shao PTAL

This infrastructure enables execution providers to optimize operator weights with custom memory layouts (such as blocked formats) during session initialization, dramatically improving inference performance through better cache utilization and memory access patterns. Current Implementation: HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO layout as the first application of the framework ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format with 16×4 tiles demonstrates the framework's primary purpose The framework is generic and extensible, allowing any EP to implement custom weight transformations optimized for their target hardware.

yuslepukhin · 2025-11-12T19:32:45Z

General question, does this transformation change the shape of the weight? Also, have you considered using PrePack mechanism?

jchen10 · 2025-11-13T07:44:32Z

@yuslepukhin Thanks for your attention.

It's possible to change the shape of the weight as blocked formats may require padding.
I had looked into PrePack, but stopped there because OpKernelContext is not available for PrePack during session initialization. For WebGPU EP, without the context, we can't use WebGPU shaders to do the layout transformation. Another consideration was that we may need input, output shapes and attributes to determine the optimal format. They are also not available in PrePack.
Maybe we can extend PrePack to suit the needs. One approach is to manage to make OpKernelContext available in PrePack. Another one is that we still do CPU format transformation in PrePack and then copy to GPU buffers. I would work out another PR to have a try with PrePack if you'd like to. @guschmue @fs-eire @qjia7 I'd appreciate your perspective, thanks!

fs-eire · 2025-11-14T01:52:18Z

First, I think this is a regression for WebGPU EP. JSEP has the code to do the kernel transpose only once if it's initializer, but the same logic is missing in WebGPU EP.

Then the PrePack feature:

For WebGPU EP, without the context, we can't use WebGPU shaders to do the layout transformation.

This is true, but it's not difficult to construct an instance of ComputeContext in WebGpuKernel::PrePack.

The real problem is doing any GPU calculations during the initialization will fail because the BufferManager::CreateUMA feature: it assumes all storage buffer creation are for uploading so it always creates buffers with mapAtCreation == true. This will cause problem for a program's output.
Another consideration was that we may need input, output shapes and attributes to determine the optimal format.

This is risky because by ORT design it is possible that input can be different shape for the same kernel. Only attributes will not change. If you optimized the kernel based on input_shape_a, then later it will have problem with input_shape_b, because this layout optimization is expected to be done only once.

Sometimes the shape inference feature will tell the input shape at initialization (when the graph is a static graph), so it is possible to get input tensor shape but not guaranteed. If the optimization do require the runtime data (ie. input shape), it should happen at runtime but not initialization time.

So the summary:

If this optimization only happen to very limited operators, maybe using the JSEP implementation is a good approach.
- check if input[1] is initializer, and only cache the optimized tensor when it is
- doing this at runtime may also help when you want to use input shape.
If using CPU to do the optimization is acceptable, we can use the existing PrePack feature.

jchen10 · 2025-11-14T06:53:09Z

Thank you @fs-eire for the insightful comments.
I slightly prefer the first one, as the input, output shapes could be naturally available, and it would require no core framework changes. I would cache all transformed weight buffers in the webgpu context rather than the op context, so that there could be a chance that they could be shared by multiple nodes if possible. As the format transformations would happen during the first inference run, my worry is that could it impact graph capture? @qjia7

qjia7 · 2025-11-14T08:48:20Z

As the format transformations would happen during the first inference run, my worry is that could it impact graph capture? @qjia7

Currently, graph capture is recorded when the regular run count > min_num_runs_before_graph_capture_(1). So it won't impact the correctness if the first run is different with the second run. See https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc#L1041-L1043.

yuslepukhin · 2025-11-14T17:36:21Z

There are many concerns with this PR.

We certainly do not want to interrogate every node when EP does not even need any layout modifications.
We also want to address the case when an optimized model is saved and then reloaded (including ORT format)
Graph transformations (including weights manipulations) are usually performed in optimizers, and they create new initializers. when transformed. This addresses the cases where weights are consumed by multiple EPs which is a pretty frequent case.
We do not usually modify the original weights. They are removed on Resolve() if there are no longer referenced by any node. Resolve() also performs shape inferencing and many other things important for the graph to say valid after each optimization iteration.
EP interfaces should stay generic. While we do have HCHWc layout interface query, the actual transformation is performed by inserting converter nodes into the graph, not by modifying the weights in place. The same applied to casts and DQD.
string_data is used for string data in tensor and it is mutually exclusive with other types of data. NodeProto has metadata_props for metadata.

I have not looked in-depth yet, so I do not have a good proposal yet. One clue may be taken from compiling EPs that perform weights transformations internally and retain them while removing a reference to the original weight from the model.

fs-eire

According to the discussion, let's redo this optimization:

use the JSEP approach and use a new PR.
do we already have the model that multiple Conv shares the same kernel?
- if no, then let's start with a simpler version that uses a map<input_shape, tensor> inside the Conv node for the weight cache.
- if yes, then we need to think about how to put a map<initializer_tensor, map<runtime_data, tensor>> inside the EP object for the weight cache.
only cache when weight is initializer, and when the cache reaches the size limit(for example, you want to only cache at most N different input shape, otherwise you may run into OOM. this is unlikely to happen to a real model usage, I just mention for information)

jchen10 · 2025-11-17T01:05:36Z

Thank you all for the comments. I will work a new PR as you proposed.

jchen10 force-pushed the weight branch from c58116d to bc4be9e Compare November 12, 2025 11:43

jchen10 mentioned this pull request Nov 14, 2025

webgpu: split transpose perm{2310} in two steps #26573

Closed

fs-eire requested changes Nov 14, 2025

View reviewed changes

jchen10 closed this Nov 17, 2025

jchen10 mentioned this pull request Nov 18, 2025

Add weight layout transformation cache for Conv operator #26595

Closed

fs-eire mentioned this pull request Nov 25, 2025

[webgpu] refactor a few "context" classes #26602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EP-specific weight layout transformation framework#26554

Add EP-specific weight layout transformation framework#26554
jchen10 wants to merge 1 commit intomicrosoft:mainfrom
jchen10:weight

jchen10 commented Nov 12, 2025

Uh oh!

jchen10 commented Nov 12, 2025

Uh oh!

jchen10 commented Nov 12, 2025

Uh oh!

yuslepukhin commented Nov 12, 2025

Uh oh!

jchen10 commented Nov 13, 2025

Uh oh!

fs-eire commented Nov 14, 2025

Uh oh!

jchen10 commented Nov 14, 2025

Uh oh!

qjia7 commented Nov 14, 2025

Uh oh!

yuslepukhin commented Nov 14, 2025 •

edited

Loading

Uh oh!

fs-eire left a comment

Uh oh!

jchen10 commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jchen10 commented Nov 12, 2025

Uh oh!

jchen10 commented Nov 12, 2025

Weight Layout Transformation Design

Overview

Motivation

The Challenge: Hardware-Optimized Memory Layouts

The Solution: EP-Specific Weight Layout Transformation

Architecture

Core Design Principle

Two-Phase EP API

Format Descriptor String

Complete Data Flow

Key Design Decisions and Rationale

1. Why EP-Specific Instead of Global Transformations?

2. Why CPU-Based Transformation?

3. Why Tensor Member Instead of External Map?

4. Why Both Original and Transformed Initializers?

Uh oh!

jchen10 commented Nov 12, 2025

Uh oh!

yuslepukhin commented Nov 12, 2025

Uh oh!

jchen10 commented Nov 13, 2025

Uh oh!

fs-eire commented Nov 14, 2025

Uh oh!

jchen10 commented Nov 14, 2025

Uh oh!

qjia7 commented Nov 14, 2025

Uh oh!

yuslepukhin commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fs-eire left a comment

Choose a reason for hiding this comment

Uh oh!

jchen10 commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuslepukhin commented Nov 14, 2025 •

edited

Loading