Skip to content

Add EP-specific weight layout transformation framework#26554

Closed
jchen10 wants to merge 1 commit intomicrosoft:mainfrom
jchen10:weight
Closed

Add EP-specific weight layout transformation framework#26554
jchen10 wants to merge 1 commit intomicrosoft:mainfrom
jchen10:weight

Conversation

@jchen10
Copy link
Copy Markdown
Contributor

@jchen10 jchen10 commented Nov 12, 2025

This infrastructure enables execution providers to optimize operator weights with custom memory layouts (such as blocked formats) during session initialization, dramatically improving inference performance through better cache utilization and memory access patterns.

Current Implementation:

  • HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO layout as the first application of the framework
  • ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format with 16×4 tiles demonstrates the framework's primary purpose

The framework is generic and extensible, allowing any EP to implement custom weight transformations optimized for their target hardware.

@jchen10
Copy link
Copy Markdown
Contributor Author

jchen10 commented Nov 12, 2025

Weight Layout Transformation Design

Overview

Design Goal: Support EP-specific blocked and optimized weight layouts that match hardware characteristics, enabling significant performance gains without runtime overhead.

Current Implementation:

  • HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO layout - actively used as the first application of the framework
  • ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format with 16×4 tiles - demonstrates the framework's primary purpose

Motivation

The Challenge: Hardware-Optimized Memory Layouts

Modern compute accelerators (GPUs, NPUs, specialized AI chips) achieve peak performance when data is laid out in memory patterns that match their hardware architecture. However, ONNX models use standardized layouts (e.g., OIHW for Conv weights) that may not be optimal for specific hardware.

The Problem:

  • Standard ONNX layouts don't match hardware-optimized blocked/tiled formats
  • Converting at runtime wastes compute cycles on every inference
  • Poor memory access patterns lead to cache misses and bandwidth bottlenecks

The Solution: EP-Specific Weight Layout Transformation

This framework allows each EP to transform weights once during session initialization to hardware-optimized layouts, then use them directly at inference time.

Architecture

Core Design Principle

Separation of Concerns: The framework separates EP-specific format decisions from the core transformation infrastructure:

  1. Execution Provider decides WHAT to transform and WHEN
  2. Framework handles HOW to transform and WHERE to store results
  3. Operator Implementation uses transformed weights at runtime

This design allows each EP to implement custom transformations without modifying core ONNX Runtime code.

Two-Phase EP API

Execution providers implement two virtual methods to participate in weight transformation:

class IExecutionProvider {
 public:
  // Phase 1: Query (lightweight) - "Do you want this weight transformed?"
  // Called during graph optimization/partitioning
  virtual Status GetPreferredInitializerFormat(
      const Node& node,           // Which node needs the initializer
      int input_index,            // Which input (e.g., Conv weight is index 1)
      std::string& format_descriptor) const;  // OUT: format name (e.g., "hwio")

  // Phase 2: Transform (heavyweight) - "Transform this weight to requested format"
  // Called once during session initialization per unique (initializer, format) pair
  virtual Status TransformInitializerFormat(
      const Tensor& original_tensor,              // Input: original weight
      const std::string& format_descriptor,       // Which format to transform to
      std::unique_ptr<Tensor>& transformed_tensor) const;  // OUT: transformed result
};

Design Benefits:

  • Phase 1 is fast: Can be called multiple times during graph analysis without performance penalty
  • Phase 2 is heavyweight: Called exactly once per transformed initializer
  • Clear contract: Query returns format name, transform produces result
  • EP autonomy: Each EP decides its own formats and transformation logic

Format Descriptor String

Formats are identified by string descriptors that encode transformation details:

Format String Meaning Primary Use Status
"ABcd16a4b" Blocked format with 16×4 tiles on O,I dims Hardware-optimized layouts Proof-of-concept
"hwio" OIHW → HWIO transpose (permutation {2,3,1,0}) Simple layout reordering WebGPU EP (active)
"oihw" Original/standard format (no transform) N/A Default
"" (empty) No transformation needed Return when EP doesn't need transform N/A

Benefits:

  • Self-documenting: Format name describes the transformation
  • Extensible: New formats added without API changes (e.g., "ABcd8a8b", "ABcd32a8b")
  • Simple: Just a string - easy to serialize, compare, debug
  • Flexible: Can encode block sizes, permutations, etc. in the name (e.g., "16a4b" = 16×4 blocks)

Block Size Encoding: The "ABcd16a4b" notation from OneDNN encodes:

  • A, B, c, d: Dimension order (uppercase = blocked, lowercase = standard)
  • 16a: 16-element blocks on the 'A' dimension (output channels)
  • 4b: 4-element blocks on the 'B' dimension (input channels)

Complete Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│ Session Initialization (One-Time)                                       │
└─────────────────────────────────────────────────────────────────────────┘
     │
     ├─> SessionState::TransformInitializersToPreferredFormat()
     │        │
     │        ├─> For each initializer (weight tensor):
     │        │    │
     │        │    ├─> For each node consuming this initializer:
     │        │    │    │
     │        │    │    └─> EP->GetPreferredInitializerFormat(node, input_idx, format)
     │        │    │         │
     │        │    │         ├─> WebGPU EP: Delegates to ConvGetPreferredKernelFormat()
     │        │    │         │    └─> Analyzes Conv execution paths
     │        │    │         │         └─> Returns "hwio" if transpose needed, else FAIL
     │        │    │         │
     │        │    │         └─> Collects all nodes requesting this format
     │        │    │
     │        │    ├─> If any nodes want transformation:
     │        │    │    │
     │        │    │    ├─> Load original: TensorProtoToTensor(original_proto, cpu_tensor)
     │        │    │    │
     │        │    │    ├─> Transform: EP->TransformInitializerFormat(cpu_tensor, "hwio", transformed)
     │        │    │    │    │
     │        │    │    │    └─> WebGPU EP: Delegates to WeightLayoutTransformer
     │        │    │    │         └─> TransposeOIHWToHWIO() performs actual permutation
     │        │    │    │
     │        │    │    ├─> Attach metadata: transformed->SetFormatDescriptor("hwio")
     │        │    │    │
     │        │    │    ├─> Serialize: TensorToTensorProto(transformed, new_proto)
     │        │    │    │    └─> new_proto.add_string_data("onnxruntime_format:hwio")
     │        │    │    │
     │        │    │    ├─> Add to graph: graph.AddInitializedTensor(new_proto)
     │        │    │    │
     │        │    │    └─> Rewire nodes: node->InputDefs()[i] = transformed_node_arg
     │        │    │
     │        │    └─> Original initializer remains for non-requesting nodes
     │        │
     │        └─> All transformations complete
     │
     ├─> SaveInitializedTensors() - Persist to session state
     │
     └─> For each initializer:
          │
          ├─> DeserializeTensorProto() - Load to memory
          │    │
          │    ├─> TensorProtoToTensor() creates CPU tensor
          │    │    └─> Reads "onnxruntime_format:hwio" from string_data
          │    │         └─> cpu_tensor.SetFormatDescriptor("hwio")
          │    │
          │    └─> For non-CPU devices:
          │         │
          │         └─> CopyTensorFromCPUToDevice()
          │              ├─> Copy data buffer to device memory
          │              ├─> device_tensor.SetFormatDescriptor(cpu_tensor.GetFormatDescriptor())
          │              └─> InitOrtValue(std::move(device_tensor))
          │                   └─> Move constructor preserves format_descriptor_
          │
          └─> Initializer ready with format metadata intact

┌─────────────────────────────────────────────────────────────────────────┐
│ Inference Runtime (Every Inference)                                     │
└─────────────────────────────────────────────────────────────────────────┘
     │
     └─> Conv::ComputeInternal()
          │
          ├─> kernel = context.Input<Tensor>(1)
          ├─> is_kernel_hwio = (kernel->GetFormatDescriptor() == "hwio")
          │
          ├─> if (is_kernel_hwio):
          │    └─> ✅ Use kernel directly, NO runtime transpose
          │
          └─> else:
               └─> ❌ TransposeKernel() at runtime (fallback for backward compat)

Key Points:

  1. Phase separation: Query (fast, multiple calls) vs Transform (slow, once per initializer)
  2. Metadata persistence: Format survives serialization, deserialization, CPU→device copy
  3. Runtime efficiency: Transformed weights used directly, no repeated work
  4. Backward compatibility: Non-transformed weights fall back to runtime transpose

Key Design Decisions and Rationale

1. Why EP-Specific Instead of Global Transformations?

Decision: Let each EP decide its own transformations via virtual methods

Rationale:

  • Different hardware has different optimal layouts (GPU vs NPU vs CPU)
  • Different operators may need different formats even on same EP
  • EP has domain knowledge about its kernels and performance characteristics
  • Avoids one-size-fits-all solutions that may not be optimal

Example: WebGPU needs HWIO for channels-last Conv, but another EP might prefer a different layout or no transformation at all.

2. Why CPU-Based Transformation?

Decision: Transform weights on CPU during session init, before device loading

Rationale:

  • ✅ Simpler implementation (no GPU kernels needed for transformation)
  • ✅ Works for all device types uniformly
  • ✅ One-time cost during init (not performance-critical path)
  • ✅ Device memory only sees final optimized format

Trade-off: Slightly slower session initialization for much faster inference.

3. Why Tensor Member Instead of External Map?

Decision: Store format_descriptor_ directly in Tensor class

Rationale:

  • Automatic propagation: Metadata travels with tensor through copy/move
  • No bookkeeping: No need for separate maps or registries
  • Minimal overhead: Empty string for 99% of tensors (standard format)
  • Simple API: tensor->GetFormatDescriptor() is intuitive
  • Thread-safe: No shared state to synchronize

Alternative rejected: External map would require synchronization and lifetime management.

4. Why Both Original and Transformed Initializers?

Decision: Keep both versions in graph when nodes have different needs

Rationale:

  • Flexibility: Some nodes may want original, others transformed
  • Backward compatibility: Non-EP nodes use original
  • Partial optimization: Transform only what's needed

Memory trade-off: Small duplication for large inference speedup.

@jchen10
Copy link
Copy Markdown
Contributor Author

jchen10 commented Nov 12, 2025

The perf improvement of this PR on LNL, thanks to eliminating the kernel transpose on every inference run in Conv:

model variance
sd-turbo-unet-fp16-demo-layernorm -53.12%
sdunet-v1.5-demo-layernorm -32.53%
jina-clip-v1-version -27.02%
moondream2-embed-tokens-fp16 -23.68%
moondream2-vision-encoder-fp16 -22.68%
detr-resnet-50 -22.55%
resnet50-v1-f16-demo -12.41%
gazenet -11.17%
mobileclip_s0_text_fp32 -10.00%
efficientnet-lite-f16-demo -8.51%
jina-clip-v1-text-fp16 -8.32%
mobilenetv2-12-f16-demo -7.21%
florence-2-base-vision-encoder-fp16 -5.41%
detr-resnet-50-fp16 -5.16%
whisper-base-encoder-lm-fp16-layernorm -5.15%

I assume it's so much good for the sd models, simply because of the regression, #26501 is trying to fix.

@xhcao @JianhuiD @Jiawei-Shao PTAL

This infrastructure enables execution providers to optimize operator
weights with custom memory layouts (such as blocked formats) during
session initialization, dramatically improving inference performance
through better cache utilization and memory access patterns.

Current Implementation:
  HWIO Transpose (WebGPU EP): Transposes Conv weights from OIHW to HWIO
layout as the first application of the framework
  ABcd16a4b Blocking (Proof-of-Concept): OneDNN-style blocked format
with 16×4 tiles demonstrates the framework's primary purpose

The framework is generic and extensible, allowing any EP to implement
custom weight transformations optimized for their target hardware.
@yuslepukhin
Copy link
Copy Markdown
Member

General question, does this transformation change the shape of the weight? Also, have you considered using PrePack mechanism?

@jchen10
Copy link
Copy Markdown
Contributor Author

jchen10 commented Nov 13, 2025

@yuslepukhin Thanks for your attention.

  1. It's possible to change the shape of the weight as blocked formats may require padding.
  2. I had looked into PrePack, but stopped there because OpKernelContext is not available for PrePack during session initialization. For WebGPU EP, without the context, we can't use WebGPU shaders to do the layout transformation. Another consideration was that we may need input, output shapes and attributes to determine the optimal format. They are also not available in PrePack.
    Maybe we can extend PrePack to suit the needs. One approach is to manage to make OpKernelContext available in PrePack. Another one is that we still do CPU format transformation in PrePack and then copy to GPU buffers. I would work out another PR to have a try with PrePack if you'd like to. @guschmue @fs-eire @qjia7 I'd appreciate your perspective, thanks!

@fs-eire
Copy link
Copy Markdown
Contributor

fs-eire commented Nov 14, 2025

First, I think this is a regression for WebGPU EP. JSEP has the code to do the kernel transpose only once if it's initializer, but the same logic is missing in WebGPU EP.

Then the PrePack feature:

  • For WebGPU EP, without the context, we can't use WebGPU shaders to do the layout transformation.

    This is true, but it's not difficult to construct an instance of ComputeContext in WebGpuKernel::PrePack.

    The real problem is doing any GPU calculations during the initialization will fail because the BufferManager::CreateUMA feature: it assumes all storage buffer creation are for uploading so it always creates buffers with mapAtCreation == true. This will cause problem for a program's output.

  • Another consideration was that we may need input, output shapes and attributes to determine the optimal format.

    This is risky because by ORT design it is possible that input can be different shape for the same kernel. Only attributes will not change. If you optimized the kernel based on input_shape_a, then later it will have problem with input_shape_b, because this layout optimization is expected to be done only once.

    Sometimes the shape inference feature will tell the input shape at initialization (when the graph is a static graph), so it is possible to get input tensor shape but not guaranteed. If the optimization do require the runtime data (ie. input shape), it should happen at runtime but not initialization time.

So the summary:

  • If this optimization only happen to very limited operators, maybe using the JSEP implementation is a good approach.
    • check if input[1] is initializer, and only cache the optimized tensor when it is
    • doing this at runtime may also help when you want to use input shape.
  • If using CPU to do the optimization is acceptable, we can use the existing PrePack feature.

@jchen10
Copy link
Copy Markdown
Contributor Author

jchen10 commented Nov 14, 2025

Thank you @fs-eire for the insightful comments.
I slightly prefer the first one, as the input, output shapes could be naturally available, and it would require no core framework changes. I would cache all transformed weight buffers in the webgpu context rather than the op context, so that there could be a chance that they could be shared by multiple nodes if possible. As the format transformations would happen during the first inference run, my worry is that could it impact graph capture? @qjia7

@qjia7
Copy link
Copy Markdown
Contributor

qjia7 commented Nov 14, 2025

As the format transformations would happen during the first inference run, my worry is that could it impact graph capture? @qjia7

Currently, graph capture is recorded when the regular run count > min_num_runs_before_graph_capture_(1). So it won't impact the correctness if the first run is different with the second run. See https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc#L1041-L1043.

@yuslepukhin
Copy link
Copy Markdown
Member

yuslepukhin commented Nov 14, 2025

There are many concerns with this PR.

  • We certainly do not want to interrogate every node when EP does not even need any layout modifications.
  • We also want to address the case when an optimized model is saved and then reloaded (including ORT format)
  • Graph transformations (including weights manipulations) are usually performed in optimizers, and they create new initializers. when transformed. This addresses the cases where weights are consumed by multiple EPs which is a pretty frequent case.
  • We do not usually modify the original weights. They are removed on Resolve() if there are no longer referenced by any node. Resolve() also performs shape inferencing and many other things important for the graph to say valid after each optimization iteration.
  • EP interfaces should stay generic. While we do have HCHWc layout interface query, the actual transformation is performed by inserting converter nodes into the graph, not by modifying the weights in place. The same applied to casts and DQD.
  • string_data is used for string data in tensor and it is mutually exclusive with other types of data. NodeProto has metadata_props for metadata.

I have not looked in-depth yet, so I do not have a good proposal yet. One clue may be taken from compiling EPs that perform weights transformations internally and retain them while removing a reference to the original weight from the model.

Copy link
Copy Markdown
Contributor

@fs-eire fs-eire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion, let's redo this optimization:

  • use the JSEP approach and use a new PR.
  • do we already have the model that multiple Conv shares the same kernel?
    • if no, then let's start with a simpler version that uses a map<input_shape, tensor> inside the Conv node for the weight cache.
    • if yes, then we need to think about how to put a map<initializer_tensor, map<runtime_data, tensor>> inside the EP object for the weight cache.
  • only cache when weight is initializer, and when the cache reaches the size limit(for example, you want to only cache at most N different input shape, otherwise you may run into OOM. this is unlikely to happen to a real model usage, I just mention for information)

@jchen10
Copy link
Copy Markdown
Contributor Author

jchen10 commented Nov 17, 2025

Thank you all for the comments. I will work a new PR as you proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants