[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 by MatthewBonanni · Pull Request #27368 · vllm-project/vllm

MatthewBonanni · 2025-10-22T20:23:44Z

Purpose

As of #26541, FlashMLA now supports q_len > 1 in the decode pipeline. The get_mla_metadata call was not updated, however, leading to poor performance (and potentially, crashes) in these cases. This PR is a simple bug fix achieving a substantial speedup, especially at small batch sizes.

Note: uses the benchmarks in #26835 (not yet merged)

cc @LucasWilkinson

Test Plan

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/flashmla_bugfix_demo.yaml

Test Result

Batch Size = 1
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000051 |     0.000050 |    1.01x
2          |     0.000051 |     0.000050 |    1.02x
4          |     0.000052 |     0.000048 |    1.10x
8          |     0.000098 |     0.000053 |    1.87x
16         |     0.000192 |     0.000057 |    3.39x
32         |     0.000359 |     0.000067 |    5.35x
64         |     0.000702 |     0.000067 |   10.52x
128        |     0.001350 |     0.000131 |   10.27x
256        |     0.002630 |     0.000257 |   10.23x
512        |     0.005094 |     0.000471 |   10.82x

Batch Size = 2
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000050 |     0.000050 |    1.01x
2          |     0.000057 |     0.000047 |    1.23x
4          |     0.000101 |     0.000052 |    1.94x
8          |     0.000190 |     0.000056 |    3.38x
16         |     0.000348 |     0.000091 |    3.81x
32         |     0.000680 |     0.000065 |   10.39x
64         |     0.001325 |     0.000128 |   10.36x
128        |     0.002601 |     0.000256 |   10.17x
256        |     0.005099 |     0.000487 |   10.48x
512        |     0.009949 |     0.000895 |   11.12x

Batch Size = 4
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000047 |     0.000046 |    1.02x
2          |     0.000053 |     0.000051 |    1.04x
4          |     0.000098 |     0.000054 |    1.81x
8          |     0.000185 |     0.000091 |    2.03x
16         |     0.000360 |     0.000065 |    5.54x
32         |     0.000702 |     0.000126 |    5.56x
64         |     0.001369 |     0.000248 |    5.51x
128        |     0.002692 |     0.000498 |    5.41x
256        |     0.005233 |     0.000955 |    5.48x
512        |     0.010128 |     0.001747 |    5.80x

Batch Size = 8
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000051 |     0.000050 |    1.01x
2          |     0.000090 |     0.000053 |    1.69x
4          |     0.000146 |     0.000090 |    1.62x
8          |     0.000278 |     0.000065 |    4.28x
16         |     0.000547 |     0.000126 |    4.33x
32         |     0.001079 |     0.000247 |    4.36x
64         |     0.002115 |     0.000486 |    4.35x
128        |     0.004116 |     0.000973 |    4.23x
256        |     0.008027 |     0.001878 |    4.27x
512        |     0.015573 |     0.003449 |    4.51x

Batch Size = 16
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000056 |     0.000055 |    1.02x
2          |     0.000101 |     0.000092 |    1.10x
4          |     0.000160 |     0.000065 |    2.45x
8          |     0.000309 |     0.000125 |    2.47x
16         |     0.000601 |     0.000245 |    2.45x
32         |     0.001190 |     0.000481 |    2.47x
64         |     0.002328 |     0.000962 |    2.42x
128        |     0.004619 |     0.001903 |    2.43x
256        |     0.008915 |     0.003644 |    2.45x
512        |     0.016957 |     0.006763 |    2.51x

Batch Size = 32
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000094 |     0.000092 |    1.02x
2          |     0.000181 |     0.000064 |    2.82x
4          |     0.000312 |     0.000123 |    2.53x
8          |     0.000579 |     0.000242 |    2.40x
16         |     0.001082 |     0.000484 |    2.24x
32         |     0.002111 |     0.000965 |    2.19x
64         |     0.004150 |     0.001909 |    2.17x
128        |     0.008164 |     0.003767 |    2.17x
256        |     0.015964 |     0.007230 |    2.21x
512        |        CRASH |     0.013545 |      N/A

Batch Size = 64
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000071 |     0.000069 |    1.02x
2          |     0.000129 |     0.000124 |    1.04x
4          |     0.000257 |     0.000244 |    1.05x
8          |     0.000499 |     0.000476 |    1.05x
16         |     0.000973 |     0.000948 |    1.03x
32         |     0.001925 |     0.001906 |    1.01x
64         |     0.003774 |     0.003772 |    1.00x
128        |     0.007667 |     0.007483 |    1.02x
256        |     0.014950 |     0.014518 |    1.03x
512        |        CRASH |     0.026947 |      N/A

Batch Size = 128
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000133 |     0.000130 |    1.03x
2          |     0.000257 |     0.000247 |    1.04x
4          |     0.000494 |     0.000477 |    1.04x
8          |     0.000964 |     0.000945 |    1.02x
16         |     0.001877 |     0.001894 |    0.99x
32         |     0.003683 |     0.003770 |    0.98x
64         |     0.007270 |     0.007508 |    0.97x
128        |     0.014790 |     0.015020 |    0.98x
256        |     0.028690 |     0.029102 |    0.99x
512        |        CRASH |     0.054098 |      N/A

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request correctly fixes a bug in the FlashMLA metadata builder for decode scenarios with a query length greater than one. The change properly calculates num_q_tokens_per_head_k and passes it to get_mla_metadata, which resolves the performance degradation and crashes noted in the description. The provided benchmarks clearly demonstrate the significant speedup achieved by this fix. The implementation is correct and well-targeted. Overall, this is an excellent and important bug fix.

mgoin

Is there an eval we can run to validate this? I assume we could do deepseek with mtp enabled

LucasWilkinson

LGTM; thanks for tracking this down!

nit: can you make a small note that we use the max but all the query lens should be the same

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni · 2025-10-22T21:35:05Z

@mgoin will do!
@LucasWilkinson done, thanks!

LucasWilkinson

LGTM (assuming evals path; dont merge till then; but I dont see any reason the wont)

MatthewBonanni · 2025-10-23T20:35:49Z

@mgoin @LucasWilkinson confirmed evals look good:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.953|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.950|±  |0.0060|

…lm-project#27368) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…lm-project#27368) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

fix num_q_tokens_per_head_k

9e44596

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni requested a review from pavanimajety as a code owner October 22, 2025 20:23

mergify bot added the v1 label Oct 22, 2025

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

MatthewBonanni mentioned this pull request Oct 22, 2025

Prefer FlashAttention MLA as default over FlashMLA #27363

Merged

5 tasks

mgoin reviewed Oct 22, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Oct 22, 2025

LucasWilkinson reviewed Oct 22, 2025

View reviewed changes

add note

4fc3912

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson approved these changes Oct 22, 2025

View reviewed changes

LucasWilkinson merged commit dbfbf9f into vllm-project:main Oct 23, 2025
47 checks passed

MatthewBonanni deleted the fix_fmla_metadata branch October 23, 2025 20:35

MatthewBonanni mentioned this pull request Nov 4, 2025

Add attention benchmarking tools #26835

Merged

5 tasks

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (vl…

c7d252c

…lm-project#27368) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (vl…

53fde04

…lm-project#27368) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (vl…

482ef90

…lm-project#27368) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Fix FlashMLA metadata builder arguments for q_len > 1#27368

[Attention] Fix FlashMLA metadata builder arguments for q_len > 1#27368
LucasWilkinson merged 2 commits intovllm-project:mainfrom
MatthewBonanni:fix_fmla_metadata

MatthewBonanni commented Oct 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

MatthewBonanni commented Oct 22, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

MatthewBonanni commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

MatthewBonanni commented Oct 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni commented Oct 22, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MatthewBonanni commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatthewBonanni commented Oct 22, 2025 •

edited by github-actions bot

Loading