Skip to content

refactor(triton): reorganize conv modules and unify gated FP8 quant path#3048

Open
hellozhuo-amd wants to merge 8 commits intomainfrom
zhuo/triton-pr2423-reorg
Open

refactor(triton): reorganize conv modules and unify gated FP8 quant path#3048
hellozhuo-amd wants to merge 8 commits intomainfrom
zhuo/triton-pr2423-reorg

Conversation

@hellozhuo-amd
Copy link
Copy Markdown
Contributor

@hellozhuo-amd hellozhuo-amd commented May 6, 2026

Summary

This PR reorganizes Triton conv1d code into dedicated conv modules and folds gated RMS+FP8 quantization into the existing fused FP8 kernel path.
Related comments: #3005 (comment)

What changed

  • Conv module reorganization

    • Moved conv1d Triton kernels under:
      • aiter/ops/triton/_triton_kernels/conv/
    • Moved conv1d Python wrappers under:
      • aiter/ops/triton/conv/
    • Added __init__.py exports for the new conv packages.
  • FP8 quant integration

    • Unified gated RMS+FP8 flow as a feature path in the existing fused FP8 implementation instead of maintaining a separate standalone wrapper path.
    • Kept both:
      • classic path (GATED_RMS_FP8=False), and
      • gated path (GATED_RMS_FP8=True),
        in the same fused kernel flow.
    • Preserved gated launch behavior with dynamic ROWS_PER_BLOCK (calc_rows_per_block) for gated execution.
  • Test layout + script updates

    • Moved conv tests to:
      • op_tests/triton_tests/conv/test_causal_conv1d.py
      • op_tests/triton_tests/conv/test_causal_conv1d_update_single_token.py
    • Updated split test mapping:
      • .github/scripts/split_tests.sh

Motivation

  • Align conv1d structure with existing Triton package organization patterns (attention, quant, etc.) for easier discovery and maintenance.
  • Reduce duplicate code paths in FP8 quantization by treating gating as an additional feature on top of the existing fused kernel path.
  • Keep test coverage explicit for both gated and classic quant paths after refactor.

Technical Details

  • Introduced new conv package roots in both kernel and wrapper layers.
  • Updated conv test paths accordingly.
  • Consolidated gated/classic quant control flow under the fused FP8 kernel path with feature flags.
  • Retained behavior-sensitive launch configuration for gated quant via dynamic rows-per-block heuristic.

Test Plan

Ran:

python3 -m pytest \
  op_tests/triton_tests/test_fused_rearrange_sigmoid_gdr.py \
  op_tests/triton_tests/conv/test_causal_conv1d_update_single_token.py \
  op_tests/triton_tests/quant/test_fused_rms_gated_fp8_group_quant.py \
  op_tests/triton_tests/quant/test_fused_fp8_quant.py::test_fused_rms_fp8_group_quant \
  -v

Additional validation:

  • black --check and ruff check on touched Triton conv/quant files.

Test Result

  • Targeted test bundle passed.
  • Gated quant sweep + classic fused FP8 quant tests passed.
  • Formatting/lint checks on touched files passed.

From the vllm side, the speed of the gated fp8 kernel was maintained.

Notes

  • No functional regressions observed in covered paths.
  • Includes path/layout refactor, so downstream imports should use the new conv module paths.

Submission Checklist

Move causal conv1d Triton code into dedicated conv subpackages by relocating
kernels and wrappers under aiter.ops.triton._triton_kernels.conv and
aiter.ops.triton.conv, and move related Triton tests into
op_tests/triton_tests/conv for consistent structure.

Integrate gated RMS+FP8 quantization as an additional feature path in the
existing fused FP8 kernel flow instead of maintaining a separate standalone
gated kernel wrapper path. Keep both gated and classic test coverage in the
test commands and update split-tests mapping to the new conv test path.
@hellozhuo-amd hellozhuo-amd requested review from a team and Copilot May 6, 2026 08:51
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3048 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes the Triton causal conv1d implementation into dedicated conv/ subpackages and folds the previously separate gated RMSNorm+FP8 quantization path into the existing fused FP8 kernel flow.

Changes:

  • Relocate causal conv1d wrappers/kernels under aiter.ops.triton.conv and aiter.ops.triton._triton_kernels.conv, updating tests and split-test mapping accordingly.
  • Unify gated RMSNorm+FP8 group quantization by adding a gated mode to _fused_rms_fp8_group_quant_kernel and routing the gated wrapper through it.
  • Minor repo hygiene updates (ignore vim swap files; normalize .ipynb_checkpoints ignore pattern).

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
op_tests/triton_tests/conv/test_causal_conv1d.py Update imports to new aiter.ops.triton.conv / kernel conv subpackage paths.
op_tests/triton_tests/conv/test_causal_conv1d_update_single_token.py Update imports to new conv subpackage paths.
aiter/ops/triton/quant/fused_fp8_quant.py Route gated RMS+FP8 quant through unified fused kernel path; minor compatibility tweaks for fp8 dtype bounds + heuristics.
aiter/ops/triton/conv/causal_conv1d.py Update kernel import path to _triton_kernels.conv.
aiter/ops/triton/conv/causal_conv1d_update_single_token.py Update kernel import paths to _triton_kernels.conv.
aiter/ops/triton/conv/init.py New conv package exports for causal conv1d APIs.
aiter/ops/triton/_triton_kernels/quant/fused_fp8_quant.py Extend fused RMS+FP8 group quant kernel to support gated mode; remove standalone gated kernel.
aiter/ops/triton/_triton_kernels/conv/causal_conv1d.py New conv kernel module location for causal conv1d fwd/update kernels.
aiter/ops/triton/_triton_kernels/conv/causal_conv1d_update_single_token.py New conv kernel module location for single-token update kernels.
aiter/ops/triton/_triton_kernels/conv/init.py New conv kernel subpackage init/export.
.gitignore Add vim swap ignores; fix .ipynb_checkpoints glob indentation.
.github/scripts/split_tests.sh Update split-test mapping to new conv test path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/triton/_triton_kernels/quant/fused_fp8_quant.py
Comment thread aiter/ops/triton/quant/fused_fp8_quant.py Outdated
Agent-Logs-Url: https://github.com/ROCm/aiter/sessions/68e13a64-8717-4b09-b862-c8bb8f4eb642

Co-authored-by: hellozhuo-amd <225919697+hellozhuo-amd@users.noreply.github.com>
…d FP8 launch

Agent-Logs-Url: https://github.com/ROCm/aiter/sessions/85686171-3024-472f-818f-6ed9d52ee761

Co-authored-by: hellozhuo-amd <225919697+hellozhuo-amd@users.noreply.github.com>
hellozhuo-amd and others added 3 commits May 6, 2026 12:57
- Map causal_conv1d and causal_conv1d_update_single_token in
  _BACKWARD_COMPAT_MAP for legacy flat imports (e.g. vLLM).
- Import conv APIs in op_tests via flat aiter.ops.triton.* paths.
- Silence ruff F401 on intentional comms re-exports in triton __init__.

Co-authored-by: Cursor <cursoragent@cursor.com>
@azaidy azaidy requested review from Boss2002n and k50112113 May 7, 2026 14:39
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this file empty

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can empty this file too, unless theres a reason as to why we need this.
It will be consistent with the rest of the repo if we empty it :)

Comment on lines 457 to +490
x,
weight,
bias,
z,
dummy,
dummy,
dummy,
x_quant,
scales,
dummy,
dummy,
dummy,
eps,
0.0,
M,
N,
0,
x.stride(0),
z.stride(0),
1,
x.stride(1),
1,
1,
1,
x_quant.stride(0),
x_quant.stride(1),
stride_s_row,
stride_s_g,
M,
N,
eps,
1,
1,
1,
1,
1,
1,
z,
bias_ptr,
z.stride(0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not do positional args for this, keyword args make it easy to read, it is very hard to follow given we have so many args being passed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can split these into 2 different kernels instead of having an if-else.
would help making it easier to read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants