Skip to content

[Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3#1467

Merged
wuhuikx merged 115 commits intodev/perffrom
yuhua/dev/perf
Nov 24, 2025
Merged

[Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3#1467
wuhuikx merged 115 commits intodev/perffrom
yuhua/dev/perf

Conversation

@zhuyuhua-v
Copy link
Contributor

@zhuyuhua-v zhuyuhua-v commented Nov 22, 2025

Motivation

  1. [Update] update dev/perf branch to main branch commit 9c2df5a
    Author: TennyWang1223
    Date: Fri Nov 21 09:46:03 2025 +0800
    [fix]: add ar switch ([fix]: add ar switch #1376)
  2. cherry-pick mla_fix, remove maxsize for mla lru cache #1445 and mmd/fix/fwd_v3, fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error #1442 to fix ds-r1 acc issue.

Test Plan

gsm8k

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9477 ± 0.0061
strict-match 5 exact_match 0.9477 ± 0.0061

lalala-sh and others added 29 commits November 23, 2025 22:24
* integrate m grouped gemm

* update ck

* add limit for 950

* rename deepgeem
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* Enable large batch size and optimization of non-Ragged batching

* Add RAGGED_BATCH to test_la.py and bench_la.py
* CI: Optimize autotuning pipeline and inital the docs
* initial commit for topk per row kernel

* topk per row kernel initial commit

* Fix the typo issue

* Add the topk per row kernel

* optimizations for topk_per_row kernel

* fix overflow

* add unit test for topk_per_row_decode

* update test for decode

* apply vector dispatch from carlus

---------

Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>
* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
* Fix fp8 issue in torch compile

* use less code
* CI: Optimize autotuning pipeline docs

* Update docs/autotuning_pipeline.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix fwd perf calc error

* black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py
* add the asm kernel performance of the attention forwards and attention backwards

* modify perf data

* fix perf data

* add a16 perf data
* Add topk softmax

* Add test for topk sigmoid

* register the op properly

* apply black

* don't use constexpr with std::string

* bump ck to include topk sigmoid commit

* hipify

* add argparse to the topk sigmoid test, also add pytest

* use own module instead of asm moe

* black formatting

* add missing file

* revert changes to module_moe_asm
* [fea]: add fused allreduce rmsnorm kernel

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ck branch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* update ar interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache

* ndexer_k_quant_and_cache opt kernel and add test

* update

* update2

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Currently it will look like this:
```log
File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core
  raise KeyError(
KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]'
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>
* using standalone pybind

* fix

* update
* update codegen.py

* update kernels & kernel launch

* fix fa bwd dq_acc shape

* remove mask in python api
Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts
Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* [CK_TILE] fmha: Add backward pass support for padded inputs

Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd).
- Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments.
- Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in
     the CK `fmha_bwd_args` struct.
- Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data.
- Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various
     padding scenarios.

* [CK_TILE] fmha: Adapt to composable_kernel padding API changes

Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`.

- Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic.
- API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly.
- Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths.

* fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp
* add num_kv_splits_indptr to mla for mtp<=4 case for now

* update

* update new kernel

* infrastructures

* 1st version of split kernel

* Fix issues raised by Lingpeng and fix the issue on batch_size

* update mla

* update mla_stage2

* 1st draft of v1 split program

* add kv_offset

* mla_splitkv_enhance_split_alg_inte

* splitkv debug

* 1st version of reduce kernel

* metadata & kernel finish

* add reduce

* final_lse is optional now.

* update kernel

* bug fix

* bug fix 1

* modify reduce api

* update kernel

* fix max splits

* bug fix 3

* fix s80 early return

* udpate calculation of partial_indx

* add per split test

* make lse support by ref

* test split

* fix redundant calculation of head offset in reduce kernel

* add custom test

* Add support of 128 head size

Fix how to get head count

fff

* update comments

* 1. Let large work be assigned first.
2. Add tolerance to the tile which is slightly smaller than kv_limit.

* Calculate kv_limit dynamically

* Fix bug about difference in split_kv(bool)

* add test

* fix seed

* Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well.

* Fix warp=1 error

* Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well.

fix comm

* fp8 setup

* first version of device metadata

aaa

* Add work_ptrs

* Compatibility to CUDA Graph

* Refactor code. Merge 2 iterations of generate work together.

* Make sure that each batch of workload can never be splited to more than #cluster of tiles.

* Adjust metadata. Get 1% perf gain.

* Paralize most of metadata kernel

Make get_cost_top() paralized.

aaa

* add scale

* 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops.

* fp8 function pass

* Fix issues:
1. avg_workload cannot handle any batch!
2. split_kv(bool) is not correct when all the clusters are full.

* fp8 ready

* fix

* persistent ready

* add nv acc test

* rename

* updata metashape

* update reduce cu num

* update optest for mla

* fix cu num

* Update metadata and reduce kernels.

* rename kernels

* Add new param kv_granularity to metadata kernel.

* Introduce cal_workload_limit_global_v2

* Support qhead=128 cases.

* Change get_mla_metadata() api. Make some not important parameters be optional through a dict.

* fix potential problem on calculating tot_qo_tiles

typo

* refactor metadata files

aaa

* update metadata v1_2

* update gqa_128 mla_ps & fix metadata v1_2

* Optimize mla metadata v1.2

* Optimize mla metadata v1.2 Part.2

* Optimize mla metadata v1.2 Part.3

* update qlen <=4

* fix mla qlen1

* Optimize mla metadata v1.2 Part.4

* Make reduce_final_map be optional in mla_reduce_v1

* Slightly increase reduce perf

* Add persistent mode for mla reduce kernel

* add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* update deepseekv32 sparse mla metadata

* update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* Adjust code for sparse attn

* Optimize the a16w8 kernel

* Improve metadata v1.1 perf

* Make metadata v1.1 support sparse attn

bug fix

tiny fix

* Remove redundant code in mla_reduce

* futile struggle

* Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata.

* Adjust metadata v1.1 and make this branch be ready to be merged to main branch.

* remove invalid co kernel

* Fix issue brought from f794ae4 which disabled hipify by default.

* support qolen>1 for sparse mla

* make code become prettier

* Fix issue in metadata v1.1

* Fix issue in test_mla.py

* Fix lint fails

* Fix sub-test fails in op_test/test_mla.py

* Fix regression in test_mla.py where mtp>1

* Add head_dim=128 support to reduce

* Add nhead=8 for pa and add assert to make sure the input tensors are in
float32.

* fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph

* fix lint

* Revert all the change about mi350 gemm.

* add a8w8 and a16w8 kernel in mla mi350

* add A8W8 Non-persistent mode kernel

* Fix issue reported by Copilot

* add mla non-persistent test

* script: update a16w8 kernel

* rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py

* add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co

* fix a8w8 num_kv_split=1

* Fix issue in metadata v1.2 on qo_tiles > 1

* fix ut bandwidth

* Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16)

aaa

Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue

* Add new api get_mla_metadata_info

* fix lint format issues

* Adjust get_mla_metadata_info_v1's parameters.

* update A16W8 kernel

* update A16W8 kernel2

* update A16W8 for mi300

* fix ut and rename some kernels

* rename mla kernel name for head 128

* remove log

* fix format

* add nativly back

* change zeros into empty

* fix with comments

---------

Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>
* workaround-retry tuning when encounter invalid pointer

* workaround-retry tuning when encounter invalid pointer

* fix  lint error

* Update gemm_tuner.py

em timeout
…s when setting up triton (#1325)

* CI: Skip triton in Aiter standard and multigpu tests

* Add retries when building triton

* Add ninja installation
`hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues:

- There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization.

- When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid.

The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization

To repro the issue:
1. fork a child process
2. call torch.empty on the child process

It will get a `hipErrorInvalidValue` error.

Co-authored-by: Hui Zhou <huizhou@meta.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…nning (#1315)

* update AOT, always pad x_scale when generating input, add UT

* update act_mul_mxfp4_quant, fused_rms_mxfp4_quant

* add LL FP4 configs and AOT files for TP8 shapes

* fix UT bug

* add LL TP2 and TP4 shapes
* Kernel naming: add reusable constexpr repr helper for gemm a16w16

* add missing params to the repr
junhaha666 and others added 22 commits November 23, 2025 22:25
* add ptpc deepseek ep moe tuned config

* add block deepseek ep moe tune config

* using 1stage moe for ptpc deepseek
* fuse routing kernels for small batches

* tune batch=1024
…#1295)

* Apply kernel_repr to attention kernels

Applied make_kernel_repr helper to 4 attention kernel files:
- pa_decode.py (6 kernels)
- pa_prefill.py (2 kernels)
- chunked_pa_prefill.py (1 kernel)
- mla_decode_rope.py (2 kernels)

Each kernel now has config-aware naming with constexpr parameters
included in the kernel metadata name.

Base: amd/satya/kernel_config_to_name

* Apply kernel_repr to attention kernels

* fix indentation error and add kernel_repr to a missed kernel

* Add descriptions to missing API descriptions

* remove unused imports

* fix runtime error

* revert lean atten to main

* lean atten repr and API desc

* formatting fix

* Update aiter/ops/triton/pod_attention.py
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
FP8 MQA optimizations AND bench. script
…ng (#1366)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* Add missing comma

* saved modified tuned fmoe config for testing purposes

* apply black required formatting

* remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage

* remove unnecesary arguments

* apply black formatting

* simplify aiter.fmoe_g1u1_tkw1 call

* add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True

* add doweight_stage1 to query key

* modidy elif to select run_stage=True for tokens > 16

* apply black formatting

* removing csv and .co files as they will come in separate commit

* removing log logger.info(f[get_2stage_cfgs] run_1stage)

---------

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
…1446)

* CI: Move some tests to MI355 due to the network issue of TW cluster

* Modify the GPU_ARCH of sglang tests
* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: call ar naive

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
@zhuyuhua-v zhuyuhua-v changed the title [Update] update dev/perf branch to main branch commit e0fb38e and cherry-pick mla_fix, mmd/fix/fwd_v3 [Update] update dev/perf branch to main branch commit 9a30baf and cherry-pick mla_fix, mmd/fix/fwd_v3 Nov 23, 2025
@zhuyuhua-v zhuyuhua-v changed the title [Update] update dev/perf branch to main branch commit 9a30baf and cherry-pick mla_fix, mmd/fix/fwd_v3 [Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3 Nov 23, 2025
@zhuyuhua-v zhuyuhua-v marked this pull request as ready for review November 24, 2025 09:12
@zhuyuhua-v zhuyuhua-v requested a review from wuhuikx November 24, 2025 09:13
@wuhuikx wuhuikx merged commit 5f4c65e into dev/perf Nov 24, 2025
@wuhuikx wuhuikx deleted the yuhua/dev/perf branch November 24, 2025 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.