[ROCm] Clean up a bit the AITER FA backend#41942
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the ROCm Aiter Flash Attention backend by removing several unused or redundant fields from various metadata dataclasses, such as AiterFlashAttentionMetadata, AiterFlashAttentionDecodeMetadata, and AiterChunkContextMetadata. It also optimizes the build process by making the seq_lens CPU transfer conditional on the presence of prefill or extend operations, reducing unnecessary device-to-host transfers. Additionally, the min_seqlen_q parameter was removed from the extend_forward method and hardcoded to 1 in internal calls. I have no feedback to provide as there were no review comments to evaluate.
2894882 to
a9467f5
Compare
Some of the computed meta data is not used and can be removed. Also, avoid a CPU sync when batch contains decode only. Signed-off-by: Patrick Schlangen <pschlan@amd.com>
75154e6 to
3ebd35f
Compare
|
@pschlan-amd can you help to run with this environment flag? |
Done:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the ROCm Aiter Flash Attention backend by removing unused fields from several metadata dataclasses and simplifying the build_for_cudagraph_capture method. Additionally, it optimizes performance by conditionally copying seq_lens to the CPU only when prefills or extends are present, thereby avoiding unnecessary blocking device-to-host transfers during decode-only iterations. I have no feedback to provide.
|
Hi @pschlan-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Head branch was pushed to by a user without write access
|
Hi @pschlan-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com> Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Purpose
Some of the meta data is not used and doesn't need to be computed.
Also, avoid a CPU sync when batch contains decode only. (In that case, it's not needed to copy seq_lens from GPU to CPU.)
Test Plan
Run an exemplary model with the AITER FA backend and check lm_eval results.
Also, run a latency benchmark to gauge impact of the change.
Test Result
vllm bench latencyPre change (3x)
Post change (3x)
lm_evalPre change
Post change
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.