[Perf] Increase default max splits for FA3 full cudagraphs#25495
[Perf] Increase default max splits for FA3 full cudagraphs#25495
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request increases the default value for VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH from 16 to 64, aiming to improve performance for FA3 full cudagraphs. The change is applied consistently in both the type-checking block and the runtime environment variable definition. While the change itself is straightforward, I've identified a maintainability issue with the duplicated default value. I've left a comment suggesting to use a constant to avoid potential inconsistencies in the future.
vllm/envs.py
Outdated
| lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", | ||
| "16")), | ||
| "64")), |
There was a problem hiding this comment.
The default value '64' is hardcoded here and also in the TYPE_CHECKING block at line 122. This duplication can lead to inconsistencies if the value is updated in one place but not the other. To improve maintainability and prevent potential bugs, consider defining this default value as a constant and referencing it in both locations. For example, you could add _DEFAULT_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH = 64 at the module level and use this constant here and at line 122.
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ect#25495) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ect#25495) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ect#25495) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ect#25495) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
#25274 provides evidence that 32 would be a much better default
due to full-CG potential becoming default #25444 seems like a good time to improve this