docs: add ISA-level kernel optimization guide#2707
Closed
sunway513 wants to merge 5 commits into
Closed
Conversation
…ocker fix - Create release notes with categorized changelog (83 features, 53 perf, 88 fixes, 55 refactors, 61 CI across 334 commits since v0.1.11.post1) - Add changelog generation script (scripts/generate_changelog.sh) - Add release validation checklist (scripts/release_checklist.md) - Update 5 CI workflows to trigger on release/** branches - Revert problematic GEMM config for Issue ROCm#2656 (DSR1-MXFP4 accuracy regression from PR ROCm#2434) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Release workflow improvements (Gaps 1+2+3+7C from #58): - Auto-trigger on tag push (v*) in addition to manual dispatch - Add smoke test step: pip install + import validation after build - Add create_release job: auto-generate changelog, create GitHub Release, attach .whl assets - Add skip_github_release option for test builds - Provide defaults for all inputs so tag-push path works without manual parameters Release process documentation (Gaps 6+8 from #58): - RELEASE_PROCESS.md: full release lifecycle (branch → RC → release) - .postX policy: hotfix only, max 3, must have release notes - RC tag flow: vX.Y.Z-rc1 for downstream validation before final tag - Downstream test coverage matrix with release gate designations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mage, S3
Key changes to aiter-release.yaml, modeled after aiter-test.yaml nightly:
1. Prebuild kernel validation: verify >= 10 .so files in installed wheel,
fail the build if prebuild didn't work (prevents shipping 37MB wheels
without precompiled kernels vs expected 200MB+)
2. Docker image push on tag: commit the build container as
rocm/aiter-ci:{tag}-py{ver} and push to registry, matching the
pre-build-{SHA} pattern from CI nightly
3. S3 upload on tag: upload wheels to both
s3://framework-whls-nightlies/whl-releases/{tag}/ (permanent) and
whl-staging/ (downstream CI compat)
4. Smoke test: pip install + import aiter + import triton gemm
5. Wheel size reporting: log wheel size so prebuild issues are visible
These changes close the gap where release wheels could ship without
precompiled kernels (37MB vs 243MB) and without a reproducible Docker
image, unlike the nightly CI path which has both.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Actions does not support top-level 'description:' in workflow YAML. This caused the entire workflow to fail parsing, preventing workflow_dispatch triggers from being recognized. Moved the description to a comment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation and automation to support both low-level kernel/ISA investigations and a more repeatable release process (release notes, CI gates, and tag-driven packaging).
Changes:
- Add an ISA-level kernel optimization workflow guide using ROCm LLVM tools and rocprofv3.
- Introduce release process docs + checklist and a changelog generation script.
- Update CI/release workflows to run on
release/**branches and trigger release packaging onv*tag pushes; tweak one gfx950 AFP4WFP4 GEMM config.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
docs/isa_kernel_optimization.md |
New ISA optimization/how-to guide (LLVM disassembly → round-trip rebuild → profiling). |
RELEASE_PROCESS.md |
Documents end-to-end release procedure, tagging, and hotfix policy. |
scripts/release_checklist.md |
Adds a concrete pre/post-release validation checklist and downstream matrix. |
scripts/generate_changelog.sh |
Generates categorized release notes from git history. |
RELEASE_NOTES_v0.1.12.md |
Adds a hand-written release notes file for v0.1.12. |
aiter/ops/triton/configs/gemm/gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json |
Updates the tuning params for the M_LEQ_64 bucket. |
.github/workflows/aiter-release.yaml |
Tag-triggered release workflow: build wheels, validate prebuild, upload to S3, optionally create GitHub Release, and push Docker tags. |
.github/workflows/aiter-test.yaml |
Expands workflow triggers to include release/** branches. |
.github/workflows/triton-test.yaml |
Expands workflow triggers to include release/** branches. |
.github/workflows/atom-test.yaml |
Expands workflow triggers to include release/** branches. |
.github/workflows/sglang_downstream.yaml |
Expands workflow triggers to include release/** branches. |
.github/workflows/vllm_benchmark.yaml |
Expands workflow triggers to include release/** branches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+15
to
+18
| # Get all commit subjects with PR numbers | ||
| COMMITS=$(git log "${FROM_REF}..${TO_REF}" --format="%s" --reverse) | ||
| TOTAL=$(echo "$COMMITS" | wc -l) | ||
|
|
Comment on lines
+107
to
+114
| - Issue #2656: DeepSeek-R1-MXFP4 accuracy regression from Triton GEMM config retune (PR #2434). Partial fix applied on release branch. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| - **GPU Architectures:** gfx942 (MI300X), gfx950 (MI355X) | ||
| - **Python:** 3.10, 3.12 | ||
| - **ROCm:** 7.0+ | ||
| - **Triton:** 3.6.0 |
| set -e | ||
| TAG="${GITHUB_REF#refs/tags/}" | ||
| # Find previous tag | ||
| PREV_TAG=$(git tag --sort=-version:refname | grep -v "$TAG" | head -1) |
Comment on lines
1
to
+8
| name: Aiter Release Package | ||
|
|
||
| description: This workflow builds the Aiter Python package as .whl files for Python 3.10 and 3.12, and uploads them as artifacts. | ||
| # Build Aiter precompiled kernel wheels (.whl) for Python 3.10/3.12, | ||
| # push a Docker image with prebaked kernels, upload wheels to S3 and | ||
| # GitHub Release. | ||
|
|
||
| on: | ||
| push: | ||
| tags: ['v*'] |
Comment on lines
+224
to
+233
| After swapping, benchmark both versions to confirm identical performance: | ||
|
|
||
| ```bash | ||
| # Benchmark original | ||
| cp original_kernel.co $INSTALL_PATH/kernel.co | ||
| python benchmark.py # record time | ||
|
|
||
| # Benchmark modified | ||
| cp kernel_modified.co $INSTALL_PATH/kernel.co | ||
| python benchmark.py # compare time |
Comment on lines
+1
to
+15
| # ISA-Level Kernel Optimization with LLVM Tools | ||
|
|
||
| A guide to inspecting, analyzing, modifying, and recompiling AITER GPU kernel ISA using the ROCm LLVM toolchain. | ||
|
|
||
| ## Overview | ||
|
|
||
| AITER ships optimized GPU kernels as compiled code objects (`.co` files). Sometimes you need to go deeper than source-level optimization. This guide shows how to: | ||
|
|
||
| 1. Disassemble a `.co` kernel to human-readable ISA | ||
| 2. Analyze instruction mix (MFMA, memory, LDS, DPP) | ||
| 3. Extract a reassemblable `.s` file | ||
| 4. Modify ISA instructions and recompile | ||
| 5. Profile kernel performance with `rocprofv3` | ||
|
|
||
| All tools used are open-source ROCm components. No proprietary tools required. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a comprehensive how-to guide for ISA-level kernel optimization using the ROCm LLVM toolchain.
The guide covers the full workflow:
.cokernel objects withllvm-objdump.sfiles with correct branch label resolution (word-offset addressing)clang++and verify binary-identical.textsections.textback into original.coviallvm-objcopy --update-section(preserves AMDHSA metadata)rocprofv3 --kernel-trace(SQLite query examples, output format options)rocprof-trace-decoderfrom source for ATT instruction-level tracingAll tools used are open-source ROCm LLVM components — no proprietary tools required.
Verified on a PA decode kernel: round-trip recompile produces binary-identical
.textwith zero performance regression (±3% over 3 runs × 500 iterations).Test plan
llvm-objdump,clang++,llvm-objcopycommands work on gfx90a and gfx942rocprofv3 --kernel-traceSQLite query examples work