assert when found duplicated tuned shape by yzhou103 · Pull Request #2376 · ROCm/aiter

yzhou103 · 2026-03-20T06:52:57Z

Motivation

assert error when there is duplicate tuned shape in configs

Technical Details

Duplicate shape detection and auto-dedup in config merge

When update_config_files() merges tuned config CSVs from configs/ and model_configs/, it detects and handles
duplicate shape entries as follows:

1). Config loading: All matching CSV files are loaded into source_pairs (a list of (path, DataFrame) tuples) and
concatenated into a single merge_df with ignore_index=True.
2). Dedup key derivation: The dedup keys are derived from the corresponding untuned config file's columns (plus
cu_num if absent, and _tag if present). These keys represent the shape dimensions that uniquely identify a tuning
entry.
3). Duplicate detection: merge_df.duplicated(subset=dedup_keys, keep=False) marks all rows that share the same
shape.
4). Handling duplicates:
- If the merged config has no us column, a RuntimeError is raised asking the user to remove
duplicates manually.
- If us is available, auto-dedup is performed: the merged DataFrame is sorted by us (stable sort) and
drop_duplicates(keep="first") retains only the best-performing (lowest latency) row for each shape. The resulting
index set (best_row_index) is used to filter each source file back to only its winning rows. Updated source files
are saved in-place, and a RuntimeError is raised listing the changes and asking the user to re-run.
5. Source file write-back: Each source file's row range in merge_df is tracked via a running offset. Rows not in
best_row_index are dropped, and if the filtered DataFrame is shorter than the original, the source CSV is
overwritten.
5). re-run test, the duplicated shapes are remove in previous step. re-run the test
2. add untuned shapes for re-tuning
when retune with "-i untuned.csv -o tuned.csv -all " it will retune all shapes in untuned.csv and replace results in tuned.csv

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-20T06:53:24Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2376 --add-label <label>

Copilot

Pull request overview

This PR tightens config-merge correctness by making duplicate tuned shapes a hard failure during config merges, and adds model-specific untuned shape CSVs to support retuning workflows (with some tuned CSV cleanup to remove duplicate shapes).

Changes:

Update update_config_files() to detect duplicate shapes during merges and fail fast instead of auto-dropping duplicates.
Add several new model-specific *_untuned_*.csv shape lists under aiter/configs/model_configs/ to enable retuning coverage for specific models.
Remove some duplicate tuned GEMM entries from model-specific tuned config CSVs to avoid collisions during merge.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`aiter/jit/core.py`	Changes merge-time duplicate handling to error out on duplicate shapes (rather than sorting/deduping).
`aiter/configs/model_configs/kimik2_fp4_untuned_fmoe.csv`	Adds untuned FMoE shapes for kimik2 fp4 retuning.
`aiter/configs/model_configs/kimik2_bf16_untuned_gemm.csv`	Adds untuned GEMM shapes for kimik2 bf16 retuning.
`aiter/configs/model_configs/kimik2_bf16_tuned_gemm.csv`	Removes a tuned GEMM entry to eliminate duplicate-shape overlap during merges.
`aiter/configs/model_configs/gptoss_bf16_untuned_gemm.csv`	Adds untuned GEMM shapes for gptoss bf16 retuning.
`aiter/configs/model_configs/dsv3_fp4_untuned_fmoe.csv`	Adds untuned FMoE shapes for dsv3 fp4 retuning.
`aiter/configs/model_configs/dsv3_bf16_untuned_gemm.csv`	Adds untuned GEMM shapes for dsv3 bf16 retuning.
`aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv`	Removes tuned GEMM entries to eliminate duplicate-shape overlap during merges.

Comments suppressed due to low confidence (1)

aiter/jit/core.py:237

The else branch logs: "Using all columns for deduplication.", but after this change there is no deduplication or duplicate detection performed when the untuned file is missing (and no fallback keys are computed). This makes the log message misleading and can allow duplicate entries to be written without any error. Either implement the intended fallback (e.g., detect/handle duplicates using merge_df.columns or similar) or update the message/behavior so it matches what the code actually does.

        else:
            logger.warning(
                f"Untuned config file not found: {untuned_path}. Using all columns for deduplication."
            )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yzhou103 · 2026-03-25T10:19:26Z

* add fused_qknorm hip kernel (#2442) * base impl * improve big M * optimize v2 * fallback split norm for bigger M * fix rebase failed * code format * fix include error * support outplace works for torch compile * add output args --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * feat: add fast gelu (#2220) * add fast gelu * fix ut * refactor * add more log * change to empty to avoid perf downgrade * make black lint happy * make ruff happy * fix import error * Fix build CK pipeline (#2399) * Fix renamed CK pipeline * Fix CK version * Rename eight_warps to eight_waves for consistency --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [OPUS] gfx1250 support for opus wmma scale and moe_sorting kernel (#2449) * [OPUS] Add gfx1250 support for opus moe_sorting kernel Add gfx1250 (MI450) to the GPU architecture maps and fix the opus moe_sorting kernel to use the correct wait counter instruction on gfx12 ISA (s_wait_dscnt instead of s_waitcnt_lgkmcnt). * [OPUS] Add WMMA scale instruction support for gfx1250 Add scaled WMMA dispatch (BX32 int / BX16 long) to struct wmma for: - wmma_scale[16]_f32_16x16x128_f8f6f4 (fp8 fmt=0, fp4 fmt=4) - wmma_scale[16]_f32_32x16x128_f4 (dedicated fp4) Key changes in opus.hpp: - Add fmt_a/fmt_b format codes and per-lane E8M0 scale operator() overloads with compile-time scale_sel (OPSEL) for lane group selection - Fix vtype for packed fp4: use reg_bytes_a/b to compute correct hardware register size (elem * bits / 8 bytes, not elem * sizeof) - Add zero-pad helper for fp4-via-f8f6f4 (i32x8 -> i32x16) - Add scaled overloads to wmma_adaptor_swap_ab Device tests (op_tests/opus/device/test_wmma_scale.cu): - Raw warp-level: 6 variants (fp8/fp4 x BX32/BX16 + 32x16 fp4) - Tiled MMA: 1x1, 2x2, 4x1 wave configurations via make_tiled_mma - Per-lane random scale: E8M0 exponents in [122..133], bitwise exact - All tests PASS on gfx1250, SKIP on gfx942/gfx950 * [OPUS] Fix black/ruff lint: rename ambiguous var `l` to `lane` * reduce wasted get_module overhead for module with custom module name (#2455) * fix moe gemm tuned config (#2463) * fix * add e=256 k=8 tuned config * revert t=1024,2048 back to ck * update * update * Skip `test_metadata_redirect.py` on archs other than `gfx942` (#2456) This test was designed to run only on `gfx942`. We want to run all Triton tests on `gfx950` as well - this is the reason behind the proposed changes. * [Triton] Fix bench_mha (#2317) * fix bench_mha * fix bench_mha parsing logic of bench_attn_models * fix bench_mla_decode parsing logic of bench_attn_models --------- Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * test_mla_persistent.py split kv reference fix max_seq_q != 1 error (#2363) * CI: fix dubious ownership for sglang checkout (#2477) * CI: mark sglang checkout paths as safe directories Add safe.directory entries before dependency installation for both host and container checkout paths to avoid git dubious ownership failures on /sglang-checkout. * CI: align sglang safe.directory fix with upstream Move safe.directory configuration to run right after container startup and only mark /sglang-checkout, matching upstream sglang handling for dubious ownership. * CI: use pip editable install and safe.directory in runtime CI (#2474) * Fix CKTile blockscale GEMM to read strides from tensor metadata (#2466) The CKTile blockscale GEMM wrapper hardcoded leading-dimension strides as stride_A = K, stride_B = K, stride_C = N, assuming fully contiguous row-major layout. This produced silently wrong results when input tensors had non-standard strides. In vLLM on ROCm, _maybe_pad_fp8_weight pads FP8 weight tensors for alignment and then creates a narrowed view, producing tensors whose logical shape is [N, K] but whose physical stride is [K+pad, 1]. The hardcoded stride_B = K caused the kernel to read from wrong memory offsets, leading to garbage output. Fix: read leading-dimension strides from the PyTorch tensor metadata (XQ.stride(0), WQ.stride(0), etc.) instead of assuming dense layout. Add TORCH_CHECK assertions to verify inner-dimension contiguity (stride == 1), which is required by the CKTile kernel. The old CK backend (gemm_a8w8_blockscale_common.cuh) already reads strides from tensor metadata, which is why it was unaffected. Made-with: Cursor * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gf… (#2440) * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gfx950) * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 --------- Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> * tuned qwen3.5 gemm (#2485) Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> * [Triton] Flash Attention Triton Windows build support (#2433) * Initial FA-2 Triton Windows build support * Continue Work lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message --------- Co-authored-by: 0xDELUXA <djernovevo@gmail.com> * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm(#2482) * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor * fix(gemm): add EVEN_MN heuristic to batched_gemm_a8w8 kernel Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor --------- Co-authored-by: jianlian <jianlian@amd.com> * rm gemm_common bind (#2425) * rm gemm_commona= and quant type bind * retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv * recover enum * assert when found duplicated tuned shape (#2376) * assert when found duplicated tuned shape * rm duplicated tuned shape and update tuned file in model_configs * fix lint --------- Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ChenYou <youchen@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: 0xDELUXA <djernovevo@gmail.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

This reverts commit d3ed648.

This reverts commit 46afd48.

* Reapply "assert when found duplicated tuned shape (#2376)" (#2502) This reverts commit 46afd48. * update blockscale_bpreshuffle_gemm * Update aiter/configs/model_configs/dsv3_bf16_untuned_gemm.csv Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add fused_qk_norm_group_quant kernel * update untuned bf16 gemm csv * remove unrelated non-core changes from duplicate-shape PR Keep PR #2503 focused on duplicate-shape handling in aiter/jit/core.py and model-config CSV updates by reverting unintentionally included fused_qk/allreduce/test changes. Made-with: Cursor * revert remaining non-csv files from PR #2503 Keep the PR scope limited to aiter/jit/core.py and model config CSV changes by removing the remaining communication and multigpu test file deltas. Made-with: Cursor * revert hook-only non-csv churn from PR #2503 Drop remaining non-core, non-csv diffs introduced by hook normalization so the PR scope stays limited to aiter/jit/core.py and model config CSV changes. Made-with: Cursor --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

assert when found duplicated tuned shape

c91cb64

yzhou103 requested review from a team and Copilot March 20, 2026 06:52

Copilot started reviewing on behalf of yzhou103 March 20, 2026 06:54 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread aiter/jit/core.py Outdated

yzhou103 added the ci:all label Mar 24, 2026

yzhou103 added 3 commits March 24, 2026 14:48

Merge branch 'main' into assert_duplicate_tuned_shape

bc62b6a

rm duplicated tuned shape and update tuned file in model_configs

5f8cc1b

fix lint

0eb74fb

Merge branch 'main' into assert_duplicate_tuned_shape

08295a4

valarLip approved these changes Mar 27, 2026

View reviewed changes

valarLip merged commit d3ed648 into main Mar 27, 2026
40 of 43 checks passed

valarLip deleted the assert_duplicate_tuned_shape branch March 27, 2026 02:09

gyohuangxin added a commit that referenced this pull request Mar 27, 2026

Revert "assert when found duplicated tuned shape (#2376)"

3e7930f

This reverts commit d3ed648.

gyohuangxin mentioned this pull request Mar 27, 2026

Revert "assert when found duplicated tuned shape" #2502

Merged

gyohuangxin added a commit that referenced this pull request Mar 27, 2026

Revert "assert when found duplicated tuned shape (#2376)" (#2502)

46afd48

This reverts commit d3ed648.

yzhou103 added a commit that referenced this pull request Mar 27, 2026

Reapply "assert when found duplicated tuned shape (#2376)" (#2502)

a9e8fff

This reverts commit 46afd48.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assert when found duplicated tuned shape#2376

assert when found duplicated tuned shape#2376
valarLip merged 5 commits intomainfrom
assert_duplicate_tuned_shape

yzhou103 commented Mar 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

yzhou103 commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzhou103 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 20, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

yzhou103 commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzhou103 commented Mar 20, 2026 •

edited

Loading