[TRITON] Conv Kernels First Commit to AITER by saeid-rostami · Pull Request #2886 · ROCm/aiter

saeid-rostami · 2026-04-23T21:32:43Z

Adds a Triton conv2d library targeted at AMD RDNA GPUs, plus a
correctness + benchmark harness that compares against PyTorch / MIOpen.

Motivation

PyTorch on AMD goes through MIOpen, whose hand-tuned solvers cover some
dtype × layout × architecture combinations well and others poorly — bf16 in
particular falls back to direct/GEMM solvers on RDNA4 that are noticeably
slower at large channel counts. Most modern checkpoints (LLMs, diffusion VAEs)
ship in bf16, so the gap matters.

This op takes the opposite approach: a single set of Triton kernels that runs
fp16 and bf16 through the same code path, supports NCHW and NHWC, and gets reasonable performance across
the full matrix without per-architecture hand tuning.

What's added

Library (aiter/ops/triton/conv/):

conv2d.py — public API + shape-driven router
_launch.py — grid setup + _select_3x3_method heuristic
_prepack.py — LRU-cached weight/input repack
_utils.py — shape math, dtype/activation enums, tolerance model

Kernels (aiter/ops/triton/_triton_kernels/conv/), five families:

Family	When it runs
1×1 GEMM	`R==1, S==1`
3×3 cblocked (NCHW)	3×3, channel-blocked input for coalesced loads
3×3 NHWC	3×3 with channels-last input — no input repack
Winograd F(4×4, 3×3)	3×3, stride=1, dilation=1, `C ≥ 512`, `K ≥ 512`, enough output tiles
General	anything else (5×5, 7×7, dilated, strided)

Test/bench harness (op_tests/triton_tests/conv/):

cli.py — --test-mode {edge,random,stability,activations,models,all}
suite.py — correctness checking + bench accumulation + result tables
bench.py — timing + precompute_miopen_solvers (subprocess + MIOPEN_LOG_LEVEL=6
to label each PyTorch baseline row with the MIOpen solver it picked)
test_edge.py / test_fuzz.py / test_models.py — shape sources
test_pytest.py — parametrized over fp16/bf16 × nchw/nhwc
_registry.py — single source of truth for kernel methods (used by CLI,
suite, comparison tables, tolerance dispatch)

Bench shim (op_tests/op_benchmarks/triton/bench_conv2d.py) — convenience
entry that injects --benchmark --test-mode models.

Docs:

aiter/ops/triton/conv/README.md — quick start, headline results, constraints,
reproducing instructions
aiter/ops/triton/conv/DESIGN.md — architecture, per-kernel deep-dive, full
Winograd F(4,3) derivation (G/Bᵀ/Aᵀ matrices, 361× amplification analysis,
why Winograd is disabled for C < 4), the routing heuristic, memory layouts,
numerical model, extension guide

Performance

See aiter/ops/triton/conv/README.md#headline-results
for the full chart set (resnet50 / SD3.5 VAE / FLUX.2 VAE × fp16/bf16 ×
nchw/nhwc × multiple batch sizes, on RDNA4).

Note on TFLOPS: numbers are direct-convolution-equivalent throughput , applied identically to
both backends. Winograd kernels execute fewer literal hardware MACs than this
denominator counts (~4× fewer for F(4,3)). The comparison is apples-to-apples.

Constraints

groups must equal 1 — depthwise / grouped not yet implemented.
Test harness skips grouped layers and prints a banner showing how many were
skipped (so coverage % is visible).
padding_mode must be "zeros". Pad amount is unrestricted; only pad
value — "reflect", "replicate", "circular" are out of scope.
Inputs must be fp16 or bf16.
Forward only (no backward / training).

Testing

All run on ROCm 7.2 / PyTorch 2.9.1 / Triton 3.7 (commit 23f4e522d).

Run	Result
`cli --test-mode all --layout both --dtype fp16`	484 / 484 passed
`cli --test-mode all --layout both --dtype bf16`	484 / 484 passed
`pytest test_pytest.py -k test_no_bias` (× fp16/bf16 × nchw/nhwc)	4 / 4 passed
`bench_conv2d --model-name resnet50 --num-layers 5 --layout both` (fp16, bf16)	exit 0

Per-method correctness: each kernel family is exercised across 12 edge-case
shapes, 200 random shapes, 4 fused activations (none/relu/relu6/gelu), and the
real per-layer shapes captured by hooking ResNet-50 / SD3.5 VAE / FLUX.2 VAE
forwards.

How to use

from aiter.ops.triton.conv.conv2d import conv2d

y = conv2d(
    x, w, bias=None,
    stride=(1, 1), padding=(1, 1), dilation=(1, 1),
    activation="relu",          # "none" | "relu" | "relu6" | "gelu"
    out_dtype=torch.float16,
    layout="nchw",              # "nchw" or "nhwc"
)

Drop-in replacement: walk an nn.Module, swap each nn.Conv2d.forward for one
that calls conv2d(...). Numerical agreement on FLUX.2 VAE end-to-end:
max pixel diff 6/255, mean 0.17/255.

Reproducing the benchmarks

From repo root:

# Correctness (full matrix)
python -m op_tests.triton_tests.conv.cli --test-mode all --layout both --dtype fp16
python -m op_tests.triton_tests.conv.cli --test-mode all --layout both --dtype bf16

# Per-layer TFLOPS table vs PyTorch / MIOpen
python -m op_tests.op_benchmarks.triton.bench_conv2d --model-name resnet50 --num-layers 53
python -m op_tests.op_benchmarks.triton.bench_conv2d --model-name sd35_vae \
    --model-path <path to model>/stable-diffusion-3.5-medium

# 3×3 method comparison (cblocked vs Winograd vs nhwc, side-by-side)
python -m op_tests.op_benchmarks.triton.bench_conv2d --method all --model-name resnet50

# Pytest matrix
pytest op_tests/triton_tests/conv/test_pytest.py

Files

aiter/ops/triton/conv/                        # wrapper + docs
  conv2d.py, _launch.py, _prepack.py, _utils.py, __init__.py
  README.md, DESIGN.md, images/

aiter/ops/triton/_triton_kernels/conv/        # @triton.jit kernels
  conv_1x1.py, conv_3x3.py, conv_3x3_winograd_f4x3.py,
  conv_general.py, helpers.py, __init__.py

op_tests/triton_tests/conv/                   # correctness + bench harness
  cli.py, suite.py, bench.py, _registry.py,
  test_edge.py, test_fuzz.py, test_models.py, test_pytest.py, __init__.py

op_tests/op_benchmarks/triton/bench_conv2d.py # convenience bench shim

github-actions · 2026-04-23T21:32:53Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2886 --add-label <label>

cagrikymk · 2026-04-27T19:37:55Z

+    padding=(0, 0),
+    dilation=(1, 1),
+    activation="none",
+    out_dtype=torch.float16,


We can have a "None" default maybe?

If none, the out dtype can be same as input dtype?

cagrikymk · 2026-05-06T18:44:40Z

Others can chime in here but typically we dont do auto tuning in the runtime. It adds certain randomness and can behave unexpectedly when part of cuda graph. Same inputs can lead to different kernel config being picked if time delta between them is relatively small.

However, assuming it gets handled correctly by the kernel consumer, I dont see much of a problem given the config size is small (<8 configs per kernel).

I assume the config list is selected for RDNA4 based on what the README contains? Maybe we can note that in the helper.

Also, we have certain artifacts like README.md and DESIGN.md that also come with pngs embedded in them. I dont have strong opinions about where to store that info.

@rahulbatra85 @brunomazzottiamd @vgokhale @azaidy

brunomazzottiamd · 2026-05-07T14:31:00Z

Others can chime in here but typically we dont do auto tuning in the runtime. It adds certain randomness and can behave unexpectedly when part of cuda graph. Same inputs can lead to different kernel config being picked if time delta between them is relatively small.

However, assuming it gets handled correctly by the kernel consumer, I dont see much of a problem given the config size is small (<8 configs per kernel).

I assume the config list is selected for RDNA4 based on what the README contains? Maybe we can note that in the helper.

Also, we have certain artifacts like README.md and DESIGN.md that also come with pngs embedded in them. I dont have strong opinions about where to store that info.

@rahulbatra85 @brunomazzottiamd @vgokhale @azaidy

Hi @cagrikymk. To be honest with you, I didn't review this PR in depth, but I see some red flags at first glance:

Triton auto tuning. We should rely on config files.
Why do we need images in AITER repo? This will add unnecessary fat to repo cloning and CI in general. Why can't we host a PDF paper somewhere and just add a link to it in AITER repo?
I don't like the idea of having artifacts in the repo that state about performance metrics. Performance can always change over time. For instance, a Triton compiler update can change performance and the proper way to address it would be to update the performance artifacts. This is cumbersome... IMHO, you should run a benchmark script every time you need performance data. This is replicable and more lightweight when compared to charts stored as raster images.

brunomazzottiamd · 2026-05-07T14:32:08Z

Added some labels to this PR, to be sure CI will run on all supported CDNA architectures.

saeid-rostami · 2026-05-08T15:35:28Z

@cagrikymk @brunomazzottiamd

Thanks for taking the time to review this PR.

Both points addressed in the latest revision:
Images in the repo: removed all 5 PNGs from aiter/ops/triton/conv/images/ (3.5 MB total — the FLUX 2 generated images were 1.6 MB each, the perf charts 91-93 KB each) and deleted the directory.
DESIGN.md is unchanged — it had no embedded images (the diagrams in there are Mermaid, which renders client-side from text). Let me know if there's anything else you'd like adjusted.

On runtime autotune — I'd appreciate a bit more guidance here. Looking at the existing Triton kernels in AITER, several already use @triton.autotune for runtime config selection, so the conv kernels followed that pattern.
The conv kernels already pass cache_results=True to @triton.autotune for autotune-result persistence across runs. If there's a different convention you'd like Triton kernels to move toward (config files for Triton ops, a particular autotune mode, etc.), could you point me at an example I can use as the template? Happy to refactor once I understand the target pattern.

saeid-rostami · 2026-05-08T15:51:42Z

Added some labels to this PR, to be sure CI will run on all supported CDNA architectures.

Thanks for helping to review this PR.
On platform targeting: these kernels are written and tuned specifically for AMD RDNA GPUs, with primary target RDNA4 (gfx1201). The design and optimization choices reflect current Triton compiler behavior/limitation on RDNA — software pipelining isn't generated, pingpong / warp-specialization scheduling isn't available, and matrix instructions are emitted as WMMA. Loop ordering, tile sizes, and the explicit kernel-template structure compensate for these.

brunomazzottiamd · 2026-05-08T20:02:19Z

@saeid-rostami

Regarding Triton auto tuning:

I'm sure that the vast majority of Triton kernels in AITER rely on config files instead of auto tuning.
I did a quick search for @triton.autotune decorators and, as far as I can see, the cases we have in AITER aren't the textbook ones presented in Triton tutorials. Sometimes we have a config list with just one config or even a slightly larger list of configs that's backed by an environment variable - so the user has to actively set something to trigger autotuning. (Warning: I may be wrong in this case, my search was a very quick one.)
As Cagri has pointed out before, auto tuning can introduce unpredictable performance behavior. I'd add that auto tuning introduces a toll tax for the first kernel run.
Our CI pipeline always run without a kernel cache because we want to test the latest Triton compiler with our kernels and always compile every kernel. Auto tuning would add to CI runtime.
Please check aiter/ops/triton/configs directory. It contains JSON files with kernel configs, by GPU arch. For instance we can have gfx942-MY-KERNEL.json (CDNA3) and gfx950-MY-KERNEL.json (CDNA4). In my opinion, there's no problem at all to add kernel config files for RDNA archs.
Triton GEMMs rely on get_gemm_config function from aiter/ops/triton/utils/gemm_config_utils.py to read the JSON files. There's a logic to split the best configs by the size of $M$ dimension. You don't need to do such a thing for your convolution kernels, but it's a great source of inspiration on how to organize and deal with config files.
Please don't get me wrong, I love Triton auto tuning, but it's good for development and it isn't that good for production.

Regarding tests:

Now I see a lot of files in op_tests/triton_tests/conv! Can you please explain what's going on with a brief description of each test file? Are you doing fuzz testing too?
How much time does each test file takes to run? Be mindful of the test execution time, prioritizing maximum coverage with the fewest possible test cases. We recently made a significant effort to optimize the CI running time of Triton unit tests.
Our CDNA3 and CDNA4 CI runners will pick everything under op_tests/triton_tests and run it as part of our test job. You should only put unit tests in op_tests/triton_tests directory.
I know you're targeting RDNA, but please don't break anything CDNA related (unit tests, CI, whatever...). You don't need to tune for CDNA, just add fallbacks and be defensive.

There's a conv regorg going on in #3048, please take a look at this PR and try to follow a similar struture.

Triton Conv Kernels First Commit to AITER

20376c7

cagrikymk changed the title ~~Triton Conv Kernels First Commit to AITER~~ [TRITON] Conv Kernels First Commit to AITER Apr 27, 2026

cagrikymk reviewed Apr 27, 2026

View reviewed changes

Fixed output data type match with input

d4a4673

brunomazzottiamd added ci:triton-355 ci:triton-300x labels May 7, 2026

Removed png files

d32b1ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRITON] Conv Kernels First Commit to AITER#2886

[TRITON] Conv Kernels First Commit to AITER#2886
saeid-rostami wants to merge 3 commits into
ROCm:mainfrom
saeid-rostami:conv2d-initial

saeid-rostami commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

cagrikymk Apr 27, 2026

Uh oh!

cagrikymk commented May 6, 2026

Uh oh!

brunomazzottiamd commented May 7, 2026

Uh oh!

brunomazzottiamd commented May 7, 2026

Uh oh!

saeid-rostami commented May 8, 2026 •

edited

Loading

Uh oh!

saeid-rostami commented May 8, 2026 •

edited

Loading

Uh oh!

brunomazzottiamd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saeid-rostami commented Apr 23, 2026

Motivation

What's added

Performance

Constraints

Testing

How to use

Uh oh!

github-actions Bot commented Apr 23, 2026

🏷️ CI Guide

Uh oh!

cagrikymk Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

cagrikymk commented May 6, 2026

Uh oh!

brunomazzottiamd commented May 7, 2026

Uh oh!

brunomazzottiamd commented May 7, 2026

Uh oh!

saeid-rostami commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saeid-rostami commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brunomazzottiamd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saeid-rostami commented May 8, 2026 •

edited

Loading

saeid-rostami commented May 8, 2026 •

edited

Loading