Upstreaming aiter triton attention backend as a new backend by maleksan85 · Pull Request #28701 · vllm-project/vllm

maleksan85 · 2025-11-14T04:39:00Z

Adding new Triton MLA to handle prefills in DS

Server command:

VLLM_DISABLE_COMPILE_CACHE=1 \
AMDGCN_USE_BUFFER_OPS=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ATTENTION_BACKEND=ROCM_AITER_TRITON_MLA \
VLLM_ROCM_USE_AITER_MLA=1 \
VLLM_ROCM_USE_TRITON_ROPE=1 \
vllm serve /data/models/deepseek-ai/DeepSeek-R1-0528 \
    --host localhost \
    --port 8000 \
    --swap-space 64 \
    --disable-log-requests \
    --dtype auto \
    --max-model-len 8192 \
    --tensor-parallel-size 8 \
    --max-num-seqs 1024  \
   --trust-remote-code \
    --block-size 1 \
    --gpu-memory-utilization 0.90  \
   --max-num-batched-tokens 131072 \
    --no-enable-prefix-caching  \
   --async-scheduling  \
   --enforce-eager

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data/models/deepseek-ai/DeepSeek-R1-0528",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2023?"}
    ]}'

<think>\nOkay, the user is asking about the 2023 World Series winner. Hmm, this seems like a straightforward sports fact 
question, but I should double-check because the 2023 MLB season actually concluded in late 2023 (fall classic) for the 2023 
championship title. \n\nWait, I recall the Texas Rangers won their first-ever championship last year against the Diamondbacks. 
Let me mentally verify the details: It was a 4-1 series, Corey Seager got MVP, and Bruce Bochy managed them. The clinching 
Game 5 was especially dramatic with that late-inning comeback. \n\nThe user might just want a quick answer, but including the 
\"first title in franchise history\" context could add value since it's historically significant. No need to overcomplicate this unless 
they follow up with deeper questions about the series. \n\n...Though I wonder if they're testing if I know recent sports results? 
Either way, accurate and concise is best here. No signs this is homework – seems like casual curiosity. Just serve the facts 
cleanly.\n</think>....

Correctness

lm_eval \
  --model local-completions \
  --model_args "base_url=http://localhost:8000/v1/completions,pretrained=/data/models/deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8,add_bos_token=true,trust_remote_code=true" \
  --tasks gsm8k --num_fewshot 5 --limit 250   --batch_size 64

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.976|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.972|±  |0.0105|

Benchmark

vllm bench serve \
  --host localhost \
  --port 8000 \
  --model /data/models/deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 64 \
  --num-prompts 256 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --ignore-eos

ROCM_AITER_TRITON_MLA

============ Serving Benchmark Result ============
Successful requests:                     256
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  126.35
Total input tokens:                      261888
Total generated tokens:                  262144
Request throughput (req/s):              2.03
Output token throughput (tok/s):         2074.70
Peak output token throughput (tok/s):    2304.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          4147.37
---------------Time to First Token----------------
Mean TTFT (ms):                          1971.39
Median TTFT (ms):                        1965.70
P99 TTFT (ms):                           3307.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.94
Median TPOT (ms):                        29.04
P99 TPOT (ms):                           30.26
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.94
Median ITL (ms):                         28.58
P99 ITL (ms):                            32.11
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31580.53
Median E2EL (ms):                        31330.76
P99 E2EL (ms):                           32561.38
==================================================

ROCM_AITER_MLA

============ Serving Benchmark Result ============
Successful requests:                     256
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  125.53
Total input tokens:                      261888
Total generated tokens:                  262144
Request throughput (req/s):              2.04
Output token throughput (tok/s):         2088.33
Peak output token throughput (tok/s):    2304.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          4174.63
---------------Time to First Token----------------
Mean TTFT (ms):                          1655.11
Median TTFT (ms):                        1503.20
P99 TTFT (ms):                           2111.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.05
Median TPOT (ms):                        29.15
P99 TPOT (ms):                           30.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.05
Median ITL (ms):                         28.71
P99 ITL (ms):                            29.96
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31375.56
Median E2EL (ms):                        31374.27
P99 E2EL (ms):                           31465.52
==================================================

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/platforms/rocm.py

vllm/v1/attention/backends/mla/triton_mla.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

vllm/v1/attention/backends/mla/triton_mla.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

vllm/v1/attention/backends/mla/aiter_triton_mla.py

vllm/v1/attention/backends/mla/triton_mla.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

…ject#28701) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

adding aiter triton attention backend along with current triton one

49fd6ca

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot added rocm Related to AMD ROCm v1 labels Nov 14, 2025

maleksan85 marked this pull request as ready for review November 14, 2025 17:53

maleksan85 requested review from pavanimajety and tjtanaa as code owners November 14, 2025 17:53

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

vllm/platforms/rocm.py Show resolved Hide resolved

vllm/v1/attention/backends/mla/triton_mla.py Outdated Show resolved Hide resolved

Aleksandr Malyshev added 2 commits November 14, 2025 23:36

Merge branch 'upstream/main' into upstreaming_triton_mla

810c607

correcting AI pointed errors

86a5a79

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

tjtanaa reviewed Nov 17, 2025

View reviewed changes

vllm/v1/attention/backends/mla/triton_mla.py Outdated Show resolved Hide resolved

extracted new mla backend into separate file

481a52f

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

tjtanaa reviewed Nov 18, 2025

View reviewed changes

vllm/v1/attention/backends/mla/aiter_triton_mla.py Show resolved Hide resolved

gshtras reviewed Nov 18, 2025

View reviewed changes

vllm/v1/attention/backends/mla/triton_mla.py Outdated Show resolved Hide resolved

Extracting to separate backend ROCM_AITER_TRITON_MLA

1acba27

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 requested a review from LucasWilkinson as a code owner November 19, 2025 00:38

Aleksandr Malyshev added 2 commits November 19, 2025 00:41

Clean up

0390319

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

Clean up

12080d9

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

SageMoore approved these changes Nov 19, 2025

View reviewed changes

maleksan85 changed the title ~~Upstreaming aiter triton attention backend along with current triton one~~ Upstreaming aiter triton attention backend as a new backend Nov 19, 2025

Merge branch 'upstream/main' into upstreaming_triton_mla

e98f7c6

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

gshtras approved these changes Nov 19, 2025

View reviewed changes

gshtras enabled auto-merge (squash) November 19, 2025 17:58

gshtras merged commit ac10fd3 into vllm-project:main Nov 19, 2025
48 checks passed

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

Upstreaming aiter triton attention backend as a new backend (vllm-pro…

331dfc0

…ject#28701) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

huangye123 mentioned this pull request Dec 9, 2025

Request to compile and package a cuda12.9-vllm0.12.0 docker image. gpustack/gpustack#3789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upstreaming aiter triton attention backend as a new backend#28701

Upstreaming aiter triton attention backend as a new backend#28701
gshtras merged 8 commits intovllm-project:mainfrom
ROCm:upstreaming_triton_mla

maleksan85 commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

maleksan85 commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Server command:

Correctness

Benchmark

ROCM_AITER_TRITON_MLA

ROCM_AITER_MLA

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maleksan85 commented Nov 14, 2025 •

edited by github-actions bot

Loading