Skip to content

[FLYDSL][GEMM] Full re-tuning for mixed stream-k a16w16 gemm & enhance co-issue#3469

Merged
valarLip merged 20 commits into
mainfrom
xyt/hgemm_spk2_cfg2
Jun 5, 2026
Merged

[FLYDSL][GEMM] Full re-tuning for mixed stream-k a16w16 gemm & enhance co-issue#3469
valarLip merged 20 commits into
mainfrom
xyt/hgemm_spk2_cfg2

Conversation

@xytpai

@xytpai xytpai commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

A16W16 Gemm Tuning Distribution

Summary

Rank Model flydsl Share flydsl Rows / Total Rows
1 gptoss 83.15% 74 / 89
2 kimi & kimik2 77.70% 115 / 148
3 glm5 74.71% 65 / 87
4 deepseekv3 56.90% 33 / 58
5 deepseekv4 + opus backend 50.86% 178 / 350
- Overall 63.52% 465 / 732

gptoss

N K flydsl Rows Total Rows flydsl Share
128 2880 13 13 100.00%
640 2880 13 13 100.00%
2560 2880 11 13 84.62%
2880 512 6 13 46.15%
2880 2048 12 13 92.31%
2880 4096 8 12 66.67%
5120 2880 11 12 91.67%
Overall All 74 89 83.15%

kimi&kimik2

N K flydsl Rows Total Rows flydsl Share
384 7168 61 64 95.31%
1024 7168 30 49 61.22%
2112 7168 10 15 66.67%
3072 1536 1 3 33.33%
4096 512 3 5 60.00%
7168 512 10 12 83.33%
Overall All 115 148 77.70%

glm5

N K flydsl Rows Total Rows flydsl Share
32 6144 0 11 0.00%
128 6144 10 11 90.91%
256 6144 9 11 81.82%
2624 6144 8 11 72.73%
4096 2048 10 11 90.91%
6144 3072 10 11 90.91%
6144 4096 9 10 90.00%
6144 6144 8 10 80.00%
38720 6144 1 1 100.00%
Overall All 65 87 74.71%

deepseekv3

N K flydsl Rows Total Rows flydsl Share
256 7168 10 13 76.92%
2112 7168 3 6 50.00%
3072 1536 12 13 92.31%
7168 2048 8 13 61.54%
16160 7168 0 13 0.00%
Overall All 33 58 56.90%

deepseekv4 (+opus backend)

N K flydsl Rows Total Rows flydsl Share
64 7168 63 68 92.65%
384 7168 15 53 28.30%
512 7168 30 68 44.12%
1024 7168 40 93 43.01%
2048 7168 30 68 44.12%
Overall All 178 350 50.86%

gptoss acc

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9014|±  |0.0082|
|     |       |strict-match    |     3|exact_match|↑  |0.3146|±  |0.0128|

CMD for re-tune

# gptoss
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/gptoss_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/gptoss_bf16_untuned_gemm.csv

# glm5
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/glm5_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/glm5_untuned_gemm_bf16.csv

# dsv3
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/dsv3_bf16_untuned_gemm.csv

# kimi
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/kimi_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/kimi_bf16_untuned_gemm.csv

# kimik2
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/kimik2_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/kimik2_bf16_untuned_gemm.csv

# dsv4
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/dsv4_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/dsv4_bf16_untuned_gemm.csv

---

# llama70b
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/llama70B_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/llama70B_untuned_gemm_bf16.csv

python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/llama405B_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/llama405B_bf16_untuned_gemm.csv

# qwen32b
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/model_configs/qwen32B_bf16_tuned_gemm.csv --input_file aiter/configs/model_configs/qwen32B_untuned_gemm_bf16.csv

python3 gradlib/gradlib/gemm_tuner.py --tuned_file /home/yuxu/workspace/prof/dsv4_tuned.csv --input_file /home/yuxu/workspace/prof/dsv4.csv --libtype triton

@xytpai xytpai requested a review from a team June 1, 2026 15:25
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3469 --add-label <label>

@xytpai xytpai marked this pull request as draft June 1, 2026 15:36
@xytpai xytpai changed the title [FLYDSL A16W16 GEMM] Full re-tuning & enhance co-issue [FLYDSL A16W16 Mixed GEMM] Full re-tuning & enhance co-issue Jun 1, 2026
@xytpai xytpai changed the title [FLYDSL A16W16 Mixed GEMM] Full re-tuning & enhance co-issue [FLYDSL][GEMM] Full re-tuning for mixed stream-k a16w16 gemm & enhance co-issue Jun 1, 2026
@xytpai xytpai marked this pull request as ready for review June 2, 2026 13:34
@xytpai xytpai requested a review from valarLip June 4, 2026 16:10
@valarLip valarLip merged commit 4689f48 into main Jun 5, 2026
47 of 48 checks passed
@valarLip valarLip deleted the xyt/hgemm_spk2_cfg2 branch June 5, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants