Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx.#34100
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx.#34100gshtras merged 5 commits intovllm-project:mainfrom
Conversation
Reduces reg pressure and improves perf too. Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
There was a problem hiding this comment.
Code Review
The pull request successfully converts the wvSplitKQ kernel to use 16x16 MFMA instructions, which should improve performance and reduce register pressure as intended. The changes in csrc/rocm/skinny_gemms.cu reflect this conversion, including type changes and updated MFMA intrinsics. The test file tests/kernels/quantization/test_rocm_skinny_gemms.py has increased test coverage by adding more seeds, which is a positive change. However, there are a couple of issues that need to be addressed.
| """ | ||
| @pytest.mark.parametrize("xnorm", [False, True]) | ||
| @pytest.mark.parametrize("n", N_FACTORS_WVSPLITKRC) | ||
| @pytest.mark.parametrize("k", K_FACTORS_WVSPLITKRC) |
There was a problem hiding this comment.
The test_rocm_wvsplitkrc_kernel test function has been commented out. If this kernel (wvSplitKrc) is still a functional part of the codebase, its tests should remain active to prevent regressions. If the wvSplitKrc kernel is no longer needed or is being deprecated, please remove it and its associated tests entirely. Commenting out tests without a clear explanation in the PR description can lead to undetected issues for the affected functionality.
| """ | ||
|
|
| accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][2], 0x102, 0xf, 0xf, | ||
| 1); // row_shl2 | ||
| accm16 += __builtin_amdgcn_mov_dpp(sum[n][y][10], 0x102, 0xf, 0xf, 1); | ||
| 1); // row_shl1 | ||
| accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][3], 0x103, 0xf, 0xf, |
There was a problem hiding this comment.
The comments for the __builtin_amdgcn_mov_dpp operations are inconsistent. While sum[n][y][1] uses 0x101 (row_shl1), sum[n][y][2] uses 0x102 and sum[n][y][3] uses 0x103. The comment // row_shl1 is repeated for all three, which is misleading. Please update the comments to accurately reflect the shift values (e.g., // row_shl2, // row_shl3).
1); // row_shl1
accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][2], 0x102, 0xf, 0xf,
1); // row_shl2
accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][3], 0x103, 0xf, 0xf,
1); // row_shl3
csrc/rocm/skinny_gemms.cu
Outdated
| 1); // row_shl1 | ||
| accm16 += __builtin_amdgcn_mov_dpp(sum[n][y][9], 0x101, 0xf, 0xf, 1); | ||
| accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][2], 0x102, 0xf, 0xf, | ||
| 1); // row_shl2 | ||
| accm16 += __builtin_amdgcn_mov_dpp(sum[n][y][10], 0x102, 0xf, 0xf, 1); | ||
| 1); // row_shl1 |
There was a problem hiding this comment.
Similar to the previous comment, the comments for the __builtin_amdgcn_mov_dpp operations here are inconsistent. The comment // row_shl1 is repeated for different shift values (0x101, 0x102, 0x103). Please update the comments to accurately reflect the shift values (e.g., // row_shl2, // row_shl3).
1); // row_shl1
accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][2], 0x102, 0xf, 0xf,
1); // row_shl2
accm0 += __builtin_amdgcn_mov_dpp(sum[n][y][3], 0x103, 0xf, 0xf,
1); // row_shl3
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
@amd-hhashemi Will these changes affect performance ongfx942 and gfx950 because I do not see any code path logic for different Can you share some perf results for gfx942 and gfx950 especially end to end as well? For Pull Request related to algorithm, we would like to understand the impact of the changes. Please also provide some e2e accuracy evaluation (gsm8k will do) to understand if the optimization impacted accuracy or not. As algorithm changes, error accumulation pattern might change. Do you know if we need to keep multiple version of the wvSplitK? It seems the condition of triggering the kernel keeps on changing based on the optimization approaches. It potentially makes |
I just added the mi355 e2e Llama-3.1-70B-Instruct-FP8 perf numbers to the description. This fp8 one should be silid now, has bias and padding. Might just tweak the dispatch config conditions for important models. Would be good if we could combine the different versions. The problem is the different solutions diverge in ways that cause overheads when trying to merge them. What is best way to report e2e accuracy? I usually juts run prompt tests and make sure before/after output has not changed. |
Checking prompt is not enough as usually you can only send one prompt at a time. We also need to check if the boundary condition handling of an algorithm, so we have to validate at large batch size. Evaluating accuracy depends on case by case. |
If this is the case, rather than modify existing code, can we have them as two different kernels implementations and dispatch them based on different condition? We still want to maintain support for older architecture as well, while optimizing for newer ones. |
@tjtanaa Sorry - to clarify the different kernel solutions are not targeting different products, they target different GEMM scenarios that are currently difficult for hipBlasLt to handle well.
I was just saying that trying to address these 3 in one kernel without introducing overheads is difficult. There's no intention of having different kernel paths for different products. |
|
@tjtanaa Accuracy and perf numbers added to description. |
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Reduces reg pressure and improves perf of wvSplitKQ too.
It only changes the fp8 skinny gemms.
Run CI validation of test_rocm_skinny_gemms.py after #34013
mi350 tests all PASS, see before/after this PR (run with 5x more seed variations) attached.
test_rocm_skinny_gemms_MI355_after.log
test_rocm_skinny_gemms_MI355_befor.log
test_rocm_skinny_gemms_mi325_after34100.log
MI355 Performance:
vllm bench serve
<style> </style>model=amd/Llama-3.1-70B-Instruct-FP8-KV
max_concurrency=4
num_prompts=400
in=1024
out=1024
Accuracy
branch-main:
local-completions ({'model': 'deepseek-r1-FP8-Dynamic', 'base_url': 'http://127.0.0.1:8000/v1/completions'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4
branch-wvSplitKfp16x16
local-completions ({'model': 'deepseek-r1-FP8-Dynamic', 'base_url': 'http://127.0.0.1:8000/v1/completions'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4
branch-wvSplitKfp16x16+FULL_AND_PIECEWISE
local-completions ({'model': 'deepseek-r1-FP8-Dynamic', 'base_url': 'http://127.0.0.1:8000/v1/completions'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.