Skip to content

Upstreaming aiter triton attention backend as a new backend#28701

Merged
gshtras merged 8 commits intovllm-project:mainfrom
ROCm:upstreaming_triton_mla
Nov 19, 2025
Merged

Upstreaming aiter triton attention backend as a new backend#28701
gshtras merged 8 commits intovllm-project:mainfrom
ROCm:upstreaming_triton_mla

Conversation

@maleksan85
Copy link
Copy Markdown
Contributor

@maleksan85 maleksan85 commented Nov 14, 2025

Adding new Triton MLA to handle prefills in DS

Server command:

VLLM_DISABLE_COMPILE_CACHE=1 \
AMDGCN_USE_BUFFER_OPS=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ATTENTION_BACKEND=ROCM_AITER_TRITON_MLA \
VLLM_ROCM_USE_AITER_MLA=1 \
VLLM_ROCM_USE_TRITON_ROPE=1 \
vllm serve /data/models/deepseek-ai/DeepSeek-R1-0528 \
    --host localhost \
    --port 8000 \
    --swap-space 64 \
    --disable-log-requests \
    --dtype auto \
    --max-model-len 8192 \
    --tensor-parallel-size 8 \
    --max-num-seqs 1024  \
   --trust-remote-code \
    --block-size 1 \
    --gpu-memory-utilization 0.90  \
   --max-num-batched-tokens 131072 \
    --no-enable-prefix-caching  \
   --async-scheduling  \
   --enforce-eager
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data/models/deepseek-ai/DeepSeek-R1-0528",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2023?"}
    ]}'
<think>\nOkay, the user is asking about the 2023 World Series winner. Hmm, this seems like a straightforward sports fact 
question, but I should double-check because the 2023 MLB season actually concluded in late 2023 (fall classic) for the 2023 
championship title. \n\nWait, I recall the Texas Rangers won their first-ever championship last year against the Diamondbacks. 
Let me mentally verify the details: It was a 4-1 series, Corey Seager got MVP, and Bruce Bochy managed them. The clinching 
Game 5 was especially dramatic with that late-inning comeback. \n\nThe user might just want a quick answer, but including the 
\"first title in franchise history\" context could add value since it's historically significant. No need to overcomplicate this unless 
they follow up with deeper questions about the series. \n\n...Though I wonder if they're testing if I know recent sports results? 
Either way, accurate and concise is best here. No signs this is homework – seems like casual curiosity. Just serve the facts 
cleanly.\n</think>....

Correctness

lm_eval \
  --model local-completions \
  --model_args "base_url=http://localhost:8000/v1/completions,pretrained=/data/models/deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8,add_bos_token=true,trust_remote_code=true" \
  --tasks gsm8k --num_fewshot 5 --limit 250   --batch_size 64
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.976|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.972|±  |0.0105|

Benchmark

vllm bench serve \
  --host localhost \
  --port 8000 \
  --model /data/models/deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 64 \
  --num-prompts 256 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --ignore-eos

ROCM_AITER_TRITON_MLA

============ Serving Benchmark Result ============
Successful requests:                     256
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  126.35
Total input tokens:                      261888
Total generated tokens:                  262144
Request throughput (req/s):              2.03
Output token throughput (tok/s):         2074.70
Peak output token throughput (tok/s):    2304.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          4147.37
---------------Time to First Token----------------
Mean TTFT (ms):                          1971.39
Median TTFT (ms):                        1965.70
P99 TTFT (ms):                           3307.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.94
Median TPOT (ms):                        29.04
P99 TPOT (ms):                           30.26
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.94
Median ITL (ms):                         28.58
P99 ITL (ms):                            32.11
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31580.53
Median E2EL (ms):                        31330.76
P99 E2EL (ms):                           32561.38
==================================================

ROCM_AITER_MLA

============ Serving Benchmark Result ============
Successful requests:                     256
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  125.53
Total input tokens:                      261888
Total generated tokens:                  262144
Request throughput (req/s):              2.04
Output token throughput (tok/s):         2088.33
Peak output token throughput (tok/s):    2304.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          4174.63
---------------Time to First Token----------------
Mean TTFT (ms):                          1655.11
Median TTFT (ms):                        1503.20
P99 TTFT (ms):                           2111.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.05
Median TPOT (ms):                        29.15
P99 TPOT (ms):                           30.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.05
Median ITL (ms):                         28.71
P99 ITL (ms):                            29.96
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31375.56
Median E2EL (ms):                        31374.27
P99 E2EL (ms):                           31465.52
==================================================

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
@mergify mergify bot added rocm Related to AMD ROCm v1 labels Nov 14, 2025
@maleksan85 maleksan85 marked this pull request as ready for review November 14, 2025 17:53
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Aleksandr Malyshev added 2 commits November 14, 2025 23:36
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Aleksandr Malyshev added 2 commits November 19, 2025 00:41
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
@maleksan85 maleksan85 changed the title Upstreaming aiter triton attention backend along with current triton one Upstreaming aiter triton attention backend as a new backend Nov 19, 2025
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@gshtras gshtras enabled auto-merge (squash) November 19, 2025 17:58
@gshtras gshtras merged commit ac10fd3 into vllm-project:main Nov 19, 2025
48 checks passed
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…ject#28701)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…ject#28701)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants