Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions vllm/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@
VLLM_DEBUG_WORKSPACE: bool = False
VLLM_DISABLE_SHARED_EXPERTS_STREAM: bool = False
VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD: int = 256
VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD: int = 4096
VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD: int = 1024
VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary"
VLLM_USE_V2_MODEL_RUNNER: bool = False
VLLM_LOG_MODEL_INSPECTION: bool = False
Expand Down Expand Up @@ -1686,10 +1686,10 @@ def _get_or_set_default() -> str:
# tokens the FP8 main GEMM has idle SMs to share with the bf16 aux GEMMs
# and overlap is a 5-45% win; above it the FP8 GEMM saturates the device
# and the cross-stream sync becomes pure overhead. Set to 0 to disable
# the multi-stream path entirely. Empirical crossover on B300 (148 SMs)
# is ~4096; B200 (132 SMs) is expected ~3072.
# the multi-stream path entirely. See #PR 41526 for the empirical result
# for the default value of 1024 tokens.
"VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD": lambda: int(
os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "4096")
os.getenv("VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD", "1024")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The documentation comment on lines 1689-1690 is now outdated and contradicts the new default value. It states that the empirical crossover is ~4096 for B300 and ~3072 for B200, which no longer aligns with the decision to set the default to 1024 based on the new empirical data provided in this PR. Please update the comment to reflect the current findings and the rationale for the 1024 threshold to avoid confusing users and developers.

),
# Format for saving torch.compile cache artifacts
# - "binary": saves as binary file
Expand Down
Loading