[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD#41526
[DSv4] Tune default value of VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD#41526
VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD#41526Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Code Review
This pull request updates the default value for VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD from 4096 to 1024 in vllm/envs.py. A review comment points out that the existing documentation comments on lines 1689-1690 are now outdated and should be updated to align with the new threshold and its rationale.
| # is ~4096; B200 (132 SMs) is expected ~3072. | ||
| "VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD": lambda: int( | ||
| os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "4096") | ||
| os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "1024") |
There was a problem hiding this comment.
The documentation comment on lines 1689-1690 is now outdated and contradicts the new default value. It states that the empirical crossover is ~4096 for B300 and ~3072 for B200, which no longer aligns with the decision to set the default to 1024 based on the new empirical data provided in this PR. Please update the comment to reflect the current findings and the rationale for the 1024 threshold to avoid confusing users and developers.
…#41526) Co-authored-by: Copilot <copilot@github.com>
…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com>
…vllm-project#41526) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Empirically 1024 is a better default value for turning on multi-stream. In a disgg setting, this env var already disable multi-stream for prefill nodes and enable for decode nodes (since it's rare to have one rank handling > 1024 requests), which is what we intend to have.
For agg setup, data on blackwell indicates 1024 being a better default than 4096.
GB300 DEP4, 8k/1k
GB200 DEP8, 8k/1k
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.