Skip to content

[TRITON] Conv Kernels First Commit to AITER#2886

Draft
saeid-rostami wants to merge 3 commits into
ROCm:mainfrom
saeid-rostami:conv2d-initial
Draft

[TRITON] Conv Kernels First Commit to AITER#2886
saeid-rostami wants to merge 3 commits into
ROCm:mainfrom
saeid-rostami:conv2d-initial

Conversation

@saeid-rostami
Copy link
Copy Markdown
Contributor

Adds a Triton conv2d library targeted at AMD RDNA GPUs, plus a
correctness + benchmark harness that compares against PyTorch / MIOpen.

Motivation

PyTorch on AMD goes through MIOpen, whose hand-tuned solvers cover some
dtype × layout × architecture combinations well and others poorly — bf16 in
particular falls back to direct/GEMM solvers on RDNA4 that are noticeably
slower at large channel counts. Most modern checkpoints (LLMs, diffusion VAEs)
ship in bf16, so the gap matters.

This op takes the opposite approach: a single set of Triton kernels that runs
fp16 and bf16 through the same code path, supports NCHW and NHWC, and gets reasonable performance across
the full matrix without per-architecture hand tuning.

What's added

Library (aiter/ops/triton/conv/):

  • conv2d.py — public API + shape-driven router
  • _launch.py — grid setup + _select_3x3_method heuristic
  • _prepack.py — LRU-cached weight/input repack
  • _utils.py — shape math, dtype/activation enums, tolerance model

Kernels (aiter/ops/triton/_triton_kernels/conv/), five families:

Family When it runs
1×1 GEMM R==1, S==1
3×3 cblocked (NCHW) 3×3, channel-blocked input for coalesced loads
3×3 NHWC 3×3 with channels-last input — no input repack
Winograd F(4×4, 3×3) 3×3, stride=1, dilation=1, C ≥ 512, K ≥ 512, enough output tiles
General anything else (5×5, 7×7, dilated, strided)

Test/bench harness (op_tests/triton_tests/conv/):

  • cli.py--test-mode {edge,random,stability,activations,models,all}

  • suite.py — correctness checking + bench accumulation + result tables

  • bench.py — timing + precompute_miopen_solvers (subprocess + MIOPEN_LOG_LEVEL=6
    to label each PyTorch baseline row with the MIOpen solver it picked)

  • test_edge.py / test_fuzz.py / test_models.py — shape sources

  • test_pytest.py — parametrized over fp16/bf16 × nchw/nhwc

  • _registry.py — single source of truth for kernel methods (used by CLI,
    suite, comparison tables, tolerance dispatch)

    Bench shim (op_tests/op_benchmarks/triton/bench_conv2d.py) — convenience
    entry that injects --benchmark --test-mode models.

Docs:

  • aiter/ops/triton/conv/README.md — quick start, headline results, constraints,
    reproducing instructions
  • aiter/ops/triton/conv/DESIGN.md — architecture, per-kernel deep-dive, full
    Winograd F(4,3) derivation (G/Bᵀ/Aᵀ matrices, 361× amplification analysis,
    why Winograd is disabled for C < 4), the routing heuristic, memory layouts,
    numerical model, extension guide

Performance

See aiter/ops/triton/conv/README.md#headline-results
for the full chart set (resnet50 / SD3.5 VAE / FLUX.2 VAE × fp16/bf16 ×
nchw/nhwc × multiple batch sizes, on RDNA4).

Note on TFLOPS: numbers are direct-convolution-equivalent throughput , applied identically to
both backends. Winograd kernels execute fewer literal hardware MACs than this
denominator counts (~4× fewer for F(4,3)). The comparison is apples-to-apples.

Constraints

  • groups must equal 1 — depthwise / grouped not yet implemented.
    Test harness skips grouped layers and prints a banner showing how many were
    skipped (so coverage % is visible).
  • padding_mode must be "zeros". Pad amount is unrestricted; only pad
    value"reflect", "replicate", "circular" are out of scope.
  • Inputs must be fp16 or bf16.
  • Forward only (no backward / training).

Testing

All run on ROCm 7.2 / PyTorch 2.9.1 / Triton 3.7 (commit 23f4e522d).

Run Result
cli --test-mode all --layout both --dtype fp16 484 / 484 passed
cli --test-mode all --layout both --dtype bf16 484 / 484 passed
pytest test_pytest.py -k test_no_bias (× fp16/bf16 × nchw/nhwc) 4 / 4 passed
bench_conv2d --model-name resnet50 --num-layers 5 --layout both (fp16, bf16) exit 0

Per-method correctness: each kernel family is exercised across 12 edge-case
shapes, 200 random shapes, 4 fused activations (none/relu/relu6/gelu), and the
real per-layer shapes captured by hooking ResNet-50 / SD3.5 VAE / FLUX.2 VAE
forwards.

How to use

from aiter.ops.triton.conv.conv2d import conv2d

y = conv2d(
    x, w, bias=None,
    stride=(1, 1), padding=(1, 1), dilation=(1, 1),
    activation="relu",          # "none" | "relu" | "relu6" | "gelu"
    out_dtype=torch.float16,
    layout="nchw",              # "nchw" or "nhwc"
)

Drop-in replacement: walk an nn.Module, swap each nn.Conv2d.forward for one
that calls conv2d(...). Numerical agreement on FLUX.2 VAE end-to-end:
max pixel diff 6/255, mean 0.17/255.

Reproducing the benchmarks

From repo root:

# Correctness (full matrix)
python -m op_tests.triton_tests.conv.cli --test-mode all --layout both --dtype fp16
python -m op_tests.triton_tests.conv.cli --test-mode all --layout both --dtype bf16

# Per-layer TFLOPS table vs PyTorch / MIOpen
python -m op_tests.op_benchmarks.triton.bench_conv2d --model-name resnet50 --num-layers 53
python -m op_tests.op_benchmarks.triton.bench_conv2d --model-name sd35_vae \
    --model-path <path to model>/stable-diffusion-3.5-medium

# 3×3 method comparison (cblocked vs Winograd vs nhwc, side-by-side)
python -m op_tests.op_benchmarks.triton.bench_conv2d --method all --model-name resnet50

# Pytest matrix
pytest op_tests/triton_tests/conv/test_pytest.py

Files

aiter/ops/triton/conv/                        # wrapper + docs
  conv2d.py, _launch.py, _prepack.py, _utils.py, __init__.py
  README.md, DESIGN.md, images/

aiter/ops/triton/_triton_kernels/conv/        # @triton.jit kernels
  conv_1x1.py, conv_3x3.py, conv_3x3_winograd_f4x3.py,
  conv_general.py, helpers.py, __init__.py

op_tests/triton_tests/conv/                   # correctness + bench harness
  cli.py, suite.py, bench.py, _registry.py,
  test_edge.py, test_fuzz.py, test_models.py, test_pytest.py, __init__.py

op_tests/op_benchmarks/triton/bench_conv2d.py # convenience bench shim

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2886 --add-label <label>

@cagrikymk cagrikymk changed the title Triton Conv Kernels First Commit to AITER [TRITON] Conv Kernels First Commit to AITER Apr 27, 2026
Comment thread aiter/ops/triton/conv/conv2d.py Outdated
padding=(0, 0),
dilation=(1, 1),
activation="none",
out_dtype=torch.float16,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a "None" default maybe?

If none, the out dtype can be same as input dtype?

@cagrikymk
Copy link
Copy Markdown
Contributor

Others can chime in here but typically we dont do auto tuning in the runtime. It adds certain randomness and can behave unexpectedly when part of cuda graph. Same inputs can lead to different kernel config being picked if time delta between them is relatively small.

However, assuming it gets handled correctly by the kernel consumer, I dont see much of a problem given the config size is small (<8 configs per kernel).

I assume the config list is selected for RDNA4 based on what the README contains? Maybe we can note that in the helper.

Also, we have certain artifacts like README.md and DESIGN.md that also come with pngs embedded in them. I dont have strong opinions about where to store that info.

@rahulbatra85 @brunomazzottiamd @vgokhale @azaidy

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

Others can chime in here but typically we dont do auto tuning in the runtime. It adds certain randomness and can behave unexpectedly when part of cuda graph. Same inputs can lead to different kernel config being picked if time delta between them is relatively small.

However, assuming it gets handled correctly by the kernel consumer, I dont see much of a problem given the config size is small (<8 configs per kernel).

I assume the config list is selected for RDNA4 based on what the README contains? Maybe we can note that in the helper.

Also, we have certain artifacts like README.md and DESIGN.md that also come with pngs embedded in them. I dont have strong opinions about where to store that info.

@rahulbatra85 @brunomazzottiamd @vgokhale @azaidy

Hi @cagrikymk. To be honest with you, I didn't review this PR in depth, but I see some red flags at first glance:

  • Triton auto tuning. We should rely on config files.
  • Why do we need images in AITER repo? This will add unnecessary fat to repo cloning and CI in general. Why can't we host a PDF paper somewhere and just add a link to it in AITER repo?
  • I don't like the idea of having artifacts in the repo that state about performance metrics. Performance can always change over time. For instance, a Triton compiler update can change performance and the proper way to address it would be to update the performance artifacts. This is cumbersome... IMHO, you should run a benchmark script every time you need performance data. This is replicable and more lightweight when compared to charts stored as raster images.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

Added some labels to this PR, to be sure CI will run on all supported CDNA architectures.

@saeid-rostami
Copy link
Copy Markdown
Contributor Author

saeid-rostami commented May 8, 2026

@cagrikymk @brunomazzottiamd

Thanks for taking the time to review this PR.

Both points addressed in the latest revision:
Images in the repo: removed all 5 PNGs from aiter/ops/triton/conv/images/ (3.5 MB total — the FLUX 2 generated images were 1.6 MB each, the perf charts 91-93 KB each) and deleted the directory.
DESIGN.md is unchanged — it had no embedded images (the diagrams in there are Mermaid, which renders client-side from text). Let me know if there's anything else you'd like adjusted.

On runtime autotune — I'd appreciate a bit more guidance here. Looking at the existing Triton kernels in AITER, several already use @triton.autotune for runtime config selection, so the conv kernels followed that pattern.
The conv kernels already pass cache_results=True to @triton.autotune for autotune-result persistence across runs. If there's a different convention you'd like Triton kernels to move toward (config files for Triton ops, a particular autotune mode, etc.), could you point me at an example I can use as the template? Happy to refactor once I understand the target pattern.

@saeid-rostami
Copy link
Copy Markdown
Contributor Author

saeid-rostami commented May 8, 2026

Added some labels to this PR, to be sure CI will run on all supported CDNA architectures.

Thanks for helping to review this PR.
On platform targeting: these kernels are written and tuned specifically for AMD RDNA GPUs, with primary target RDNA4 (gfx1201). The design and optimization choices reflect current Triton compiler behavior/limitation on RDNA — software pipelining isn't generated, pingpong / warp-specialization scheduling isn't available, and matrix instructions are emitted as WMMA. Loop ordering, tile sizes, and the explicit kernel-template structure compensate for these.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

@saeid-rostami

Regarding Triton auto tuning:

  • I'm sure that the vast majority of Triton kernels in AITER rely on config files instead of auto tuning.
  • I did a quick search for @triton.autotune decorators and, as far as I can see, the cases we have in AITER aren't the textbook ones presented in Triton tutorials. Sometimes we have a config list with just one config or even a slightly larger list of configs that's backed by an environment variable - so the user has to actively set something to trigger autotuning. (Warning: I may be wrong in this case, my search was a very quick one.)
  • As Cagri has pointed out before, auto tuning can introduce unpredictable performance behavior. I'd add that auto tuning introduces a toll tax for the first kernel run.
  • Our CI pipeline always run without a kernel cache because we want to test the latest Triton compiler with our kernels and always compile every kernel. Auto tuning would add to CI runtime.
  • Please check aiter/ops/triton/configs directory. It contains JSON files with kernel configs, by GPU arch. For instance we can have gfx942-MY-KERNEL.json (CDNA3) and gfx950-MY-KERNEL.json (CDNA4). In my opinion, there's no problem at all to add kernel config files for RDNA archs.
  • Triton GEMMs rely on get_gemm_config function from aiter/ops/triton/utils/gemm_config_utils.py to read the JSON files. There's a logic to split the best configs by the size of $M$ dimension. You don't need to do such a thing for your convolution kernels, but it's a great source of inspiration on how to organize and deal with config files.
  • Please don't get me wrong, I love Triton auto tuning, but it's good for development and it isn't that good for production.

Regarding tests:

  • Now I see a lot of files in op_tests/triton_tests/conv! Can you please explain what's going on with a brief description of each test file? Are you doing fuzz testing too?
  • How much time does each test file takes to run? Be mindful of the test execution time, prioritizing maximum coverage with the fewest possible test cases. We recently made a significant effort to optimize the CI running time of Triton unit tests.
  • Our CDNA3 and CDNA4 CI runners will pick everything under op_tests/triton_tests and run it as part of our test job. You should only put unit tests in op_tests/triton_tests directory.
  • I know you're targeting RDNA, but please don't break anything CDNA related (unit tests, CI, whatever...). You don't need to tune for CDNA, just add fallbacks and be defensive.

There's a conv regorg going on in #3048, please take a look at this PR and try to follow a similar struture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants