[DSv4] Tune default value of `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` by ywang96 · Pull Request #41526 · vllm-project/vllm

ywang96 · 2026-05-03T01:21:57Z

Empirically 1024 is a better default value for turning on multi-stream. In a disgg setting, this env var already disable multi-stream for prefill nodes and enable for decode nodes (since it's rare to have one rank handling > 1024 requests), which is what we intend to have.

For agg setup, data on blackwell indicates 1024 being a better default than 4096.

GB300 DEP4, 8k/1k

  ┌───────────┬────────────────────┐
  │ Threshold │ Throughput (tok/s) │
  ├───────────┼────────────────────┤
  │ 1024      │ 25,881.9           │
  ├───────────┼────────────────────┤
  │ 256       │ 25,529.8           │
  ├───────────┼────────────────────┤
  │ 512       │ 24,351.5           │
  ├───────────┼────────────────────┤
  │ 2048      │ 23,443.1           │
  └───────────┴────────────────────┘

GB200 DEP8, 8k/1k

  ┌───────────┬────────────────────┐                                                                                                  
  │ Threshold │ Throughput (tok/s) │                              
  ├───────────┼────────────────────┤                                                                                                  
  │ 4096      │ 33,435.4           │
  ├───────────┼────────────────────┤                                                                                                  
  │ 1024      │ 33,044.5           │                                                                                                  
  ├───────────┼────────────────────┤
  │ 2048      │ 33,042.6           │                                                                                                  
  ├───────────┼────────────────────┤                              
  │ 512       │ 32,085.7           │                                                                                                  
  ├───────────┼────────────────────┤
  │ 128       │ 32,015.5           │                                                                                                  
  ├───────────┼────────────────────┤                              
  │ 256       │ 31,323.8           │                                                                                                  
  └───────────┴────────────────────┘

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Co-authored-by: Copilot <copilot@github.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

gemini-code-assist

Code Review

This pull request updates the default value for VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD from 4096 to 1024 in vllm/envs.py. A review comment points out that the existing documentation comments on lines 1689-1690 are now outdated and should be updated to align with the new threshold and its rationale.

gemini-code-assist · 2026-05-03T01:23:06Z

    # is ~4096; B200 (132 SMs) is expected ~3072.
    "VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD": lambda: int(
-        os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "4096")
+        os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "1024")


The documentation comment on lines 1689-1690 is now outdated and contradicts the new default value. It states that the empirical crossover is ~4096 for B300 and ~3072 for B200, which no longer aligns with the decision to set the default to 1024 based on the new empirical data provided in this PR. Please update the comment to reflect the current findings and the rationale for the 1024 threshold to avoid confusing users and developers.

Co-authored-by: Copilot <copilot@github.com>

…#41526) Co-authored-by: Copilot <copilot@github.com>

…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com>

…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

add

6a8068e

Co-authored-by: Copilot <copilot@github.com>

claude Bot reviewed May 3, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

comment

0e1a35b

Co-authored-by: Copilot <copilot@github.com>

ywang96 merged commit 856ec48 into main May 3, 2026
7 of 9 checks passed

ywang96 deleted the tune-threshold branch May 3, 2026 01:32

ywang96 added a commit that referenced this pull request May 3, 2026

[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD (…

f98b274

…#41526) Co-authored-by: Copilot <copilot@github.com>

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD (…

24c7a40

…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSv4] Tune default value of `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD`#41526

[DSv4] Tune default value of `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD`#41526
ywang96 merged 2 commits intomainfrom
tune-threshold

ywang96 commented May 3, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ywang96 commented May 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ywang96 commented May 3, 2026 •

edited by github-actions Bot

Loading