[ROCm] Clean up a bit the AITER FA backend by pschlan-amd · Pull Request #41942 · vllm-project/vllm

pschlan-amd · 2026-05-07T10:33:15Z

Purpose

Some of the meta data is not used and doesn't need to be computed.
Also, avoid a CPU sync when batch contains decode only. (In that case, it's not needed to copy seq_lens from GPU to CPU.)

Test Plan

Run an exemplary model with the AITER FA backend and check lm_eval results.
Also, run a latency benchmark to gauge impact of the change.

Test Result

`vllm bench latency`

python -m vllm.entrypoints.cli.main bench latency --model  meta-llama/Llama-3.1-8B --attention-backend ROCM_AITER_FA --numa-bind

Pre change (3x)

Avg latency: 0.6936792448939134 seconds
10% percentile latency: 0.6928847291506827 seconds
25% percentile latency: 0.6932902210392058 seconds
50% percentile latency: 0.6937075436580926 seconds
75% percentile latency: 0.6940363898174837 seconds
90% percentile latency: 0.694325171969831 seconds
99% percentile latency: 0.6948660616995767 seconds

Avg latency: 0.6860194624556849 seconds
10% percentile latency: 0.6854304550681263 seconds
25% percentile latency: 0.685635639121756 seconds
50% percentile latency: 0.6860365103930235 seconds
75% percentile latency: 0.6863664753036574 seconds
90% percentile latency: 0.686600391054526 seconds
99% percentile latency: 0.6873566200444475 seconds

Avg latency: 0.6915244271357854 seconds
10% percentile latency: 0.6910421045031399 seconds
25% percentile latency: 0.6911932142684236 seconds
50% percentile latency: 0.691451808437705 seconds
75% percentile latency: 0.6918098365422338 seconds
90% percentile latency: 0.6919960006605834 seconds
99% percentile latency: 0.6923786349594593 seconds

Post change (3x)

Avg latency: 0.6745571053897341 seconds
10% percentile latency: 0.6739433178212494 seconds
25% percentile latency: 0.6740635462338105 seconds
50% percentile latency: 0.6744294965174049 seconds
75% percentile latency: 0.674802612978965 seconds
90% percentile latency: 0.6753872703760863 seconds
99% percentile latency: 0.6759530151821673 seconds

Avg latency: 0.6746878373747071 seconds
10% percentile latency: 0.674094793247059 seconds
25% percentile latency: 0.6742286222288385 seconds
50% percentile latency: 0.6745670435484499 seconds
75% percentile latency: 0.6750414413399994 seconds
90% percentile latency: 0.6753575840964914 seconds
99% percentile latency: 0.6759539076080546 seconds

Avg latency: 0.6784423670886706 seconds
10% percentile latency: 0.6778691275510937 seconds
25% percentile latency: 0.6780405268073082 seconds
50% percentile latency: 0.6784036844037473 seconds
75% percentile latency: 0.6787772759562358 seconds
90% percentile latency: 0.6790890667587519 seconds
99% percentile latency: 0.6794963776925579 seconds

`lm_eval`

lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9090/v1/chat/completions,tokenized_requests=False,trust_remote_code=True,num_concurrent=8 --tasks gsm8k --output_path ./PRE-llama-3.1-8b-instruct --apply_chat_template --fewshot_as_multiturn --num_fewshot 8

Pre change

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8302|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8173|±  |0.0106|

Post change

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8324|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8234|±  |0.0105|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

gemini-code-assist

Code Review

This pull request refactors the ROCm Aiter Flash Attention backend by removing several unused or redundant fields from various metadata dataclasses, such as AiterFlashAttentionMetadata, AiterFlashAttentionDecodeMetadata, and AiterChunkContextMetadata. It also optimizes the build process by making the seq_lens CPU transfer conditional on the presence of prefill or extend operations, reducing unnecessary device-to-host transfers. Additionally, the min_seqlen_q parameter was removed from the extend_forward method and hardcoded to 1 in internal calls. I have no feedback to provide as there were no review comments to evaluate.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Some of the computed meta data is not used and can be removed. Also, avoid a CPU sync when batch contains decode only. Signed-off-by: Patrick Schlangen <pschlan@amd.com>

tjtanaa · 2026-05-07T13:11:39Z

@pschlan-amd can you help to run with this environment flag? VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1, ROCM_AITER_FA supports two modes.

pschlan-amd · 2026-05-07T13:23:16Z

@pschlan-amd can you help to run with this environment flag? VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1, ROCM_AITER_FA supports two modes.

Done:

`vllm bench latency`

Avg latency: 0.6988510643442472 seconds
10% percentile latency: 0.6875904533080757 seconds
25% percentile latency: 0.6977792493999004 seconds
50% percentile latency: 0.7008764864876866 seconds
75% percentile latency: 0.7022336339578032 seconds
90% percentile latency: 0.7028782699257136 seconds
99% percentile latency: 0.703364502042532 seconds

Note: I don't have access to the machine right now where I did the original benchmarks from the PR description, so I had to use another one, so numbers here are not 100 % comparable to the initial benchmarks.

Edit: Got access to the original machine, here's the result:

Avg latency: 0.6671127228687207 seconds
10% percentile latency: 0.6657795118633658 seconds
25% percentile latency: 0.6664963202783838 seconds
50% percentile latency: 0.6668758261948824 seconds
75% percentile latency: 0.6678977521369234 seconds
90% percentile latency: 0.6687722432427108 seconds
99% percentile latency: 0.6694269294058904 seconds

`lm_eval`

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.8309|±  |0.0103|
|     |       |strict-match    |     8|exact_match|↑  |0.8196|±  |0.0106|

(The PR shouldn't cause any functional difference; it's just removing written-but-never-used meta data.)

pschlan-amd · 2026-05-07T16:10:37Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the ROCm Aiter Flash Attention backend by removing unused fields from several metadata dataclasses and simplifying the build_for_cudagraph_capture method. Additionally, it optimizes performance by conditionally copying seq_lens to the CPU only when prefills or extends are present, thereby avoiding unnecessary blocking device-to-host transfers during decode-only iterations. I have no feedback to provide.

tjtanaa

LGTM

mergify · 2026-05-08T08:56:41Z

Hi @pschlan-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

mergify · 2026-05-08T09:05:59Z

Hi @pschlan-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

Signed-off-by: Patrick Schlangen <pschlan@amd.com> Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

Signed-off-by: Patrick Schlangen <pschlan@amd.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

mergify Bot added rocm Related to AMD ROCm v1 labels May 7, 2026

github-project-automation Bot added this to AMD May 7, 2026

github-project-automation Bot moved this to Todo in AMD May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

pschlan-amd force-pushed the rocm_aiter_fa_cleanup branch from 2894882 to a9467f5 Compare May 7, 2026 12:14

pschlan-amd changed the title ~~[ROCm] Clean up AITER FA backend~~ [ROCm] Clean up a bit the AITER FA backend May 7, 2026

pschlan-amd marked this pull request as ready for review May 7, 2026 12:16

pschlan-amd requested a review from tjtanaa as a code owner May 7, 2026 12:16

claude Bot reviewed May 7, 2026

View reviewed changes

[ROCm] Clean up AITER FA backend

3ebd35f

Some of the computed meta data is not used and can be removed. Also, avoid a CPU sync when batch contains decode only. Signed-off-by: Patrick Schlangen <pschlan@amd.com>

pschlan-amd force-pushed the rocm_aiter_fa_cleanup branch from 75154e6 to 3ebd35f Compare May 7, 2026 12:18

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

tjtanaa approved these changes May 8, 2026

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026

Merge branch 'main' into rocm_aiter_fa_cleanup

7e12cae

tjtanaa enabled auto-merge (squash) May 8, 2026 08:52

Fix ruff style issue

3f3eefb

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

auto-merge was automatically disabled May 8, 2026 09:01
Head branch was pushed to by a user without write access

pschlan-amd added 6 commits May 8, 2026 11:34

Fix mypy/ruff checks

1256b9a

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

Merge branch 'main' into rocm_aiter_fa_cleanup

bab678b

Merge branch 'main' into rocm_aiter_fa_cleanup

4e3fc84

Merge branch 'main' into rocm_aiter_fa_cleanup

72ade9d

Merge branch 'main' into rocm_aiter_fa_cleanup

e9f1213

Merge branch 'main' into rocm_aiter_fa_cleanup

4b42cc2

tjtanaa merged commit 5f1b313 into vllm-project:main May 11, 2026
65 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 11, 2026

pschlan-amd deleted the rocm_aiter_fa_cleanup branch May 11, 2026 14:45

hissu-hyvarinen pushed a commit to hissu-hyvarinen/vllm that referenced this pull request May 11, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

59855cd

Signed-off-by: Patrick Schlangen <pschlan@amd.com> Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

0daf4d4

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

6e01a18

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

2003aba

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

4e4ba8f

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[ROCm] Clean up a bit the AITER FA backend (vllm-project#41942)

dc852df

Signed-off-by: Patrick Schlangen <pschlan@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Clean up a bit the AITER FA backend#41942

[ROCm] Clean up a bit the AITER FA backend#41942
tjtanaa merged 9 commits into
vllm-project:mainfrom
pschlan-amd:rocm_aiter_fa_cleanup

pschlan-amd commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

tjtanaa commented May 7, 2026

Uh oh!

pschlan-amd commented May 7, 2026 •

edited

Loading

Uh oh!

pschlan-amd commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

tjtanaa left a comment

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pschlan-amd commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

vllm bench latency

Pre change (3x)

Post change (3x)

lm_eval

Pre change

Post change

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

tjtanaa commented May 7, 2026

Uh oh!

pschlan-amd commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vllm bench latency

lm_eval

Uh oh!

pschlan-amd commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pschlan-amd commented May 7, 2026 •

edited

Loading

`vllm bench latency`

`lm_eval`

pschlan-amd commented May 7, 2026 •

edited

Loading

`vllm bench latency`

`lm_eval`