Skip to content

Layernorm performance optimization and partition size pybind#3

Closed
mawong-amd wants to merge 8 commits intovllm_upstreamfrom
vllm_upstream_mattwong_expmtal
Closed

Layernorm performance optimization and partition size pybind#3
mawong-amd wants to merge 8 commits intovllm_upstreamfrom
vllm_upstream_mattwong_expmtal

Conversation

@mawong-amd
Copy link
Copy Markdown

@mawong-amd mawong-amd commented Mar 4, 2024

This PR does two things:

  1. Provide performance optimizations of the fused_add_rms_norm kernels (used in some layernorms, e.g. input and post_attention per Llama decode layer). The performance benefits largely originate from two optimizations: use of shared memory to cache intermediate computation results (as opposed to the existing use of global memory), and use of packed operations for FP16 inputs. Unrolling was attempted but this was not found to affect performance. Another optimization attempted was the specialization of blockReduceSum/warpReduceSum to AMD wavesizes of 64 (as opposed to the existing CUDA-compatible warp size of 32), which should in theory reduce the number of shuffles by (1024/32 * 5 + 5) / (1024/64 * 6 + 4) = 1.65 times, but this was also not found to measurably affect performance.

Typical performance improvements are on the order of 10% based on average runtime of the kernels as tested on Llama2-7B and Llama2-70B models on MI300X. Performance can probably be improved further, but we are running into diminishing returns as the total runtime of layernorm is not large. One interesting observation is that the post-attention layernorm on LL2-70B consistently takes 4-5 us longer than the input layernorm to complete (16 us vs 20 us), while this behavior is not observed for LL2-7B. It is unclear why this is the case, could be cache-related.

  1. Add platform-specific paged attention v2 partition sizes and expose these to Python.

@mawong-amd mawong-amd requested a review from dllehr-amd March 4, 2024 19:03
@mawong-amd mawong-amd self-assigned this Mar 4, 2024
@dllehr-amd dllehr-amd requested a review from sanyalington March 4, 2024 20:28
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 73a9c85 to 85277d9 Compare March 4, 2024 21:04
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 85277d9 to 1396c2c Compare March 4, 2024 21:07
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 8c6423d to dcfc084 Compare March 5, 2024 21:13
@mawong-amd mawong-amd marked this pull request as draft March 8, 2024 19:26
@mawong-amd
Copy link
Copy Markdown
Author

Pending work on the prefill side of things

AdrianAbeyta pushed a commit that referenced this pull request Mar 8, 2024
Rename remaining fp8_e5m2 to general fp8
@mawong-amd mawong-amd closed this Mar 27, 2024
@mawong-amd mawong-amd deleted the vllm_upstream_mattwong_expmtal branch March 27, 2024 19:50
gshtras pushed a commit that referenced this pull request Sep 27, 2024
Use 12660 and 12969 as base builds for v3 and v4 Dockerfiles
mawong-amd pushed a commit that referenced this pull request May 14, 2025
Co-authored-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant