Skip to content

docs: add ISA-level kernel optimization guide#2707

Closed
sunway513 wants to merge 5 commits into
ROCm:mainfrom
sunway513:docs/isa-kernel-optimization
Closed

docs: add ISA-level kernel optimization guide#2707
sunway513 wants to merge 5 commits into
ROCm:mainfrom
sunway513:docs/isa-kernel-optimization

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

Add a comprehensive how-to guide for ISA-level kernel optimization using the ROCm LLVM toolchain.

The guide covers the full workflow:

  • Disassemble .co kernel objects with llvm-objdump
  • Analyze instruction mix (MFMA, buffer loads, LDS, DPP)
  • Extract reassemblable .s files with correct branch label resolution (word-offset addressing)
  • Recompile with clang++ and verify binary-identical .text sections
  • Inject modified .text back into original .co via llvm-objcopy --update-section (preserves AMDHSA metadata)
  • Profile with rocprofv3 --kernel-trace (SQLite query examples, output format options)
  • Build rocprof-trace-decoder from source for ATT instruction-level tracing

All tools used are open-source ROCm LLVM components — no proprietary tools required.

Verified on a PA decode kernel: round-trip recompile produces binary-identical .text with zero performance regression (±3% over 3 runs × 500 iterations).

Test plan

  • Verify llvm-objdump, clang++, llvm-objcopy commands work on gfx90a and gfx942
  • Verify Python extraction script handles branch labels correctly
  • Verify rocprofv3 --kernel-trace SQLite query examples work
  • Confirm doc renders correctly on GitHub

sunway513 and others added 5 commits April 9, 2026 14:35
…ocker fix

- Create release notes with categorized changelog (83 features, 53 perf,
  88 fixes, 55 refactors, 61 CI across 334 commits since v0.1.11.post1)
- Add changelog generation script (scripts/generate_changelog.sh)
- Add release validation checklist (scripts/release_checklist.md)
- Update 5 CI workflows to trigger on release/** branches
- Revert problematic GEMM config for Issue ROCm#2656 (DSR1-MXFP4 accuracy
  regression from PR ROCm#2434)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Release workflow improvements (Gaps 1+2+3+7C from #58):
- Auto-trigger on tag push (v*) in addition to manual dispatch
- Add smoke test step: pip install + import validation after build
- Add create_release job: auto-generate changelog, create GitHub
  Release, attach .whl assets
- Add skip_github_release option for test builds
- Provide defaults for all inputs so tag-push path works without
  manual parameters

Release process documentation (Gaps 6+8 from #58):
- RELEASE_PROCESS.md: full release lifecycle (branch → RC → release)
- .postX policy: hotfix only, max 3, must have release notes
- RC tag flow: vX.Y.Z-rc1 for downstream validation before final tag
- Downstream test coverage matrix with release gate designations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mage, S3

Key changes to aiter-release.yaml, modeled after aiter-test.yaml nightly:

1. Prebuild kernel validation: verify >= 10 .so files in installed wheel,
   fail the build if prebuild didn't work (prevents shipping 37MB wheels
   without precompiled kernels vs expected 200MB+)

2. Docker image push on tag: commit the build container as
   rocm/aiter-ci:{tag}-py{ver} and push to registry, matching the
   pre-build-{SHA} pattern from CI nightly

3. S3 upload on tag: upload wheels to both
   s3://framework-whls-nightlies/whl-releases/{tag}/ (permanent) and
   whl-staging/ (downstream CI compat)

4. Smoke test: pip install + import aiter + import triton gemm

5. Wheel size reporting: log wheel size so prebuild issues are visible

These changes close the gap where release wheels could ship without
precompiled kernels (37MB vs 243MB) and without a reproducible Docker
image, unlike the nightly CI path which has both.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Actions does not support top-level 'description:' in workflow
YAML. This caused the entire workflow to fail parsing, preventing
workflow_dispatch triggers from being recognized.

Moved the description to a comment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sunway513 sunway513 requested review from a team and Copilot April 13, 2026 01:39
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2707 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation and automation to support both low-level kernel/ISA investigations and a more repeatable release process (release notes, CI gates, and tag-driven packaging).

Changes:

  • Add an ISA-level kernel optimization workflow guide using ROCm LLVM tools and rocprofv3.
  • Introduce release process docs + checklist and a changelog generation script.
  • Update CI/release workflows to run on release/** branches and trigger release packaging on v* tag pushes; tweak one gfx950 AFP4WFP4 GEMM config.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/isa_kernel_optimization.md New ISA optimization/how-to guide (LLVM disassembly → round-trip rebuild → profiling).
RELEASE_PROCESS.md Documents end-to-end release procedure, tagging, and hotfix policy.
scripts/release_checklist.md Adds a concrete pre/post-release validation checklist and downstream matrix.
scripts/generate_changelog.sh Generates categorized release notes from git history.
RELEASE_NOTES_v0.1.12.md Adds a hand-written release notes file for v0.1.12.
aiter/ops/triton/configs/gemm/gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json Updates the tuning params for the M_LEQ_64 bucket.
.github/workflows/aiter-release.yaml Tag-triggered release workflow: build wheels, validate prebuild, upload to S3, optionally create GitHub Release, and push Docker tags.
.github/workflows/aiter-test.yaml Expands workflow triggers to include release/** branches.
.github/workflows/triton-test.yaml Expands workflow triggers to include release/** branches.
.github/workflows/atom-test.yaml Expands workflow triggers to include release/** branches.
.github/workflows/sglang_downstream.yaml Expands workflow triggers to include release/** branches.
.github/workflows/vllm_benchmark.yaml Expands workflow triggers to include release/** branches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +18
# Get all commit subjects with PR numbers
COMMITS=$(git log "${FROM_REF}..${TO_REF}" --format="%s" --reverse)
TOTAL=$(echo "$COMMITS" | wc -l)

Comment on lines +107 to +114
- Issue #2656: DeepSeek-R1-MXFP4 accuracy regression from Triton GEMM config retune (PR #2434). Partial fix applied on release branch.

## Compatibility

- **GPU Architectures:** gfx942 (MI300X), gfx950 (MI355X)
- **Python:** 3.10, 3.12
- **ROCm:** 7.0+
- **Triton:** 3.6.0
set -e
TAG="${GITHUB_REF#refs/tags/}"
# Find previous tag
PREV_TAG=$(git tag --sort=-version:refname | grep -v "$TAG" | head -1)
Comment on lines 1 to +8
name: Aiter Release Package

description: This workflow builds the Aiter Python package as .whl files for Python 3.10 and 3.12, and uploads them as artifacts.
# Build Aiter precompiled kernel wheels (.whl) for Python 3.10/3.12,
# push a Docker image with prebaked kernels, upload wheels to S3 and
# GitHub Release.

on:
push:
tags: ['v*']
Comment on lines +224 to +233
After swapping, benchmark both versions to confirm identical performance:

```bash
# Benchmark original
cp original_kernel.co $INSTALL_PATH/kernel.co
python benchmark.py # record time

# Benchmark modified
cp kernel_modified.co $INSTALL_PATH/kernel.co
python benchmark.py # compare time
Comment on lines +1 to +15
# ISA-Level Kernel Optimization with LLVM Tools

A guide to inspecting, analyzing, modifying, and recompiling AITER GPU kernel ISA using the ROCm LLVM toolchain.

## Overview

AITER ships optimized GPU kernels as compiled code objects (`.co` files). Sometimes you need to go deeper than source-level optimization. This guide shows how to:

1. Disassemble a `.co` kernel to human-readable ISA
2. Analyze instruction mix (MFMA, memory, LDS, DPP)
3. Extract a reassemblable `.s` file
4. Modify ISA instructions and recompile
5. Profile kernel performance with `rocprofv3`

All tools used are open-source ROCm components. No proprietary tools required.
@sunway513 sunway513 closed this Apr 13, 2026
@sunway513 sunway513 deleted the docs/isa-kernel-optimization branch April 13, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants