Skip to content

[ROCm] Clean up a bit the AITER FA backend#41942

Merged
tjtanaa merged 9 commits into
vllm-project:mainfrom
pschlan-amd:rocm_aiter_fa_cleanup
May 11, 2026
Merged

[ROCm] Clean up a bit the AITER FA backend#41942
tjtanaa merged 9 commits into
vllm-project:mainfrom
pschlan-amd:rocm_aiter_fa_cleanup

Conversation

@pschlan-amd

@pschlan-amd pschlan-amd commented May 7, 2026

Copy link
Copy Markdown
Contributor

Purpose

Some of the meta data is not used and doesn't need to be computed.
Also, avoid a CPU sync when batch contains decode only. (In that case, it's not needed to copy seq_lens from GPU to CPU.)

Test Plan

Run an exemplary model with the AITER FA backend and check lm_eval results.
Also, run a latency benchmark to gauge impact of the change.

Test Result

vllm bench latency

python -m vllm.entrypoints.cli.main bench latency --model  meta-llama/Llama-3.1-8B --attention-backend ROCM_AITER_FA --numa-bind

Pre change (3x)

Avg latency: 0.6936792448939134 seconds
10% percentile latency: 0.6928847291506827 seconds
25% percentile latency: 0.6932902210392058 seconds
50% percentile latency: 0.6937075436580926 seconds
75% percentile latency: 0.6940363898174837 seconds
90% percentile latency: 0.694325171969831 seconds
99% percentile latency: 0.6948660616995767 seconds

Avg latency: 0.6860194624556849 seconds
10% percentile latency: 0.6854304550681263 seconds
25% percentile latency: 0.685635639121756 seconds
50% percentile latency: 0.6860365103930235 seconds
75% percentile latency: 0.6863664753036574 seconds
90% percentile latency: 0.686600391054526 seconds
99% percentile latency: 0.6873566200444475 seconds

Avg latency: 0.6915244271357854 seconds
10% percentile latency: 0.6910421045031399 seconds
25% percentile latency: 0.6911932142684236 seconds
50% percentile latency: 0.691451808437705 seconds
75% percentile latency: 0.6918098365422338 seconds
90% percentile latency: 0.6919960006605834 seconds
99% percentile latency: 0.6923786349594593 seconds

Post change (3x)

Avg latency: 0.6745571053897341 seconds
10% percentile latency: 0.6739433178212494 seconds
25% percentile latency: 0.6740635462338105 seconds
50% percentile latency: 0.6744294965174049 seconds
75% percentile latency: 0.674802612978965 seconds
90% percentile latency: 0.6753872703760863 seconds
99% percentile latency: 0.6759530151821673 seconds

Avg latency: 0.6746878373747071 seconds
10% percentile latency: 0.674094793247059 seconds
25% percentile latency: 0.6742286222288385 seconds
50% percentile latency: 0.6745670435484499 seconds
75% percentile latency: 0.6750414413399994 seconds
90% percentile latency: 0.6753575840964914 seconds
99% percentile latency: 0.6759539076080546 seconds

Avg latency: 0.6784423670886706 seconds
10% percentile latency: 0.6778691275510937 seconds
25% percentile latency: 0.6780405268073082 seconds
50% percentile latency: 0.6784036844037473 seconds
75% percentile latency: 0.6787772759562358 seconds
90% percentile latency: 0.6790890667587519 seconds
99% percentile latency: 0.6794963776925579 seconds

lm_eval

lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9090/v1/chat/completions,tokenized_requests=False,trust_remote_code=True,num_concurrent=8 --tasks gsm8k --output_path ./PRE-llama-3.1-8b-instruct --apply_chat_template --fewshot_as_multiturn --num_fewshot 8

Pre change

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8302|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8173|±  |0.0106|

Post change

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8324|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8234|±  |0.0105|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added rocm Related to AMD ROCm v1 labels May 7, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 7, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ROCm Aiter Flash Attention backend by removing several unused or redundant fields from various metadata dataclasses, such as AiterFlashAttentionMetadata, AiterFlashAttentionDecodeMetadata, and AiterChunkContextMetadata. It also optimizes the build process by making the seq_lens CPU transfer conditional on the presence of prefill or extend operations, reducing unnecessary device-to-host transfers. Additionally, the min_seqlen_q parameter was removed from the extend_forward method and hardcoded to 1 in internal calls. I have no feedback to provide as there were no review comments to evaluate.

@pschlan-amd pschlan-amd force-pushed the rocm_aiter_fa_cleanup branch from 2894882 to a9467f5 Compare May 7, 2026 12:14
@pschlan-amd pschlan-amd changed the title [ROCm] Clean up AITER FA backend [ROCm] Clean up a bit the AITER FA backend May 7, 2026
@pschlan-amd pschlan-amd marked this pull request as ready for review May 7, 2026 12:16
@pschlan-amd pschlan-amd requested a review from tjtanaa as a code owner May 7, 2026 12:16

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Some of the computed meta data is not used and can be removed.
Also, avoid a CPU sync when batch contains decode only.

Signed-off-by: Patrick Schlangen <pschlan@amd.com>
@pschlan-amd pschlan-amd force-pushed the rocm_aiter_fa_cleanup branch from 75154e6 to 3ebd35f Compare May 7, 2026 12:18
@tjtanaa

tjtanaa commented May 7, 2026

Copy link
Copy Markdown
Member

@pschlan-amd can you help to run with this environment flag? VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1, ROCM_AITER_FA supports two modes.

@pschlan-amd

pschlan-amd commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@pschlan-amd can you help to run with this environment flag? VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1, ROCM_AITER_FA supports two modes.

Done:

vllm bench latency

Avg latency: 0.6988510643442472 seconds
10% percentile latency: 0.6875904533080757 seconds
25% percentile latency: 0.6977792493999004 seconds
50% percentile latency: 0.7008764864876866 seconds
75% percentile latency: 0.7022336339578032 seconds
90% percentile latency: 0.7028782699257136 seconds
99% percentile latency: 0.703364502042532 seconds

Note: I don't have access to the machine right now where I did the original benchmarks from the PR description, so I had to use another one, so numbers here are not 100 % comparable to the initial benchmarks.

Edit: Got access to the original machine, here's the result:

Avg latency: 0.6671127228687207 seconds
10% percentile latency: 0.6657795118633658 seconds
25% percentile latency: 0.6664963202783838 seconds
50% percentile latency: 0.6668758261948824 seconds
75% percentile latency: 0.6678977521369234 seconds
90% percentile latency: 0.6687722432427108 seconds
99% percentile latency: 0.6694269294058904 seconds

lm_eval

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8309|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8196|±  |0.0106|

(The PR shouldn't cause any functional difference; it's just removing written-but-never-used meta data.)

@pschlan-amd

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ROCm Aiter Flash Attention backend by removing unused fields from several metadata dataclasses and simplifying the build_for_cudagraph_capture method. Additionally, it optimizes performance by conditionally copying seq_lens to the CPU only when prefills or extends are present, thereby avoiding unnecessary blocking device-to-host transfers during decode-only iterations. I have no feedback to provide.

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) May 8, 2026 08:52
@mergify

mergify Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Hi @pschlan-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Patrick Schlangen <pschlan@amd.com>
auto-merge was automatically disabled May 8, 2026 09:01

Head branch was pushed to by a user without write access

@mergify

mergify Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Hi @pschlan-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@tjtanaa tjtanaa merged commit 5f1b313 into vllm-project:main May 11, 2026
65 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 11, 2026
@pschlan-amd pschlan-amd deleted the rocm_aiter_fa_cleanup branch May 11, 2026 14:45
hissu-hyvarinen pushed a commit to hissu-hyvarinen/vllm that referenced this pull request May 11, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants