-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit
#22735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the logic for handling TRTLLM attention and simplifies how the KV cache layout is determined. The caching of the environment variable check in use_trtllm_attention is a good performance improvement. The simplification of get_kv_cache_layout is also a positive change for maintainability. I've identified a couple of areas for improvement: an incorrect type hint in the new supports_trtllm_attention function, and a logic flaw in get_kv_cache_layout that can lead to misleading log messages. My review includes suggestions to address these points.
get_kv_cache_layout and cache use_trtllm_attention env-dependent bit
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two main improvements. First, it refactors the logic for determining whether to use TensorRT-LLM attention by caching the environment-dependent checks. This is a good performance optimization as it avoids repeated lookups of environment variables and platform capabilities on every forward pass. The new supports_trtllm_attention function with @lru_cache is a clean way to achieve this. Second, it simplifies the get_kv_cache_layout function by removing a redundant check for VLLM_USE_TRTLLM_ATTENTION. The justification that this is handled by the _KV_CACHE_LAYOUT_OVERRIDE mechanism is sound and makes the code easier to understand. The new implementation also makes the priority of different configuration sources clearer. Overall, the changes are well-reasoned and improve both performance and maintainability.
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, thanks for the refactor.
|
Need to fix merge conflict |
|
This pull request has merge conflicts that must be resolved before it can be |
Head branch was pushed to by a user without write access
912ea26 to
6cf8c3c
Compare
|
done |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
7835044 to
989ed93
Compare
|
Still seeing some OOMs in recent tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the clean check
|
Eventually, it may also make sense to not have a dependency on kv cache layout because trtllm natively supports both HND and NHD layouts. The cubins for NHD would have to be added though. |
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Yiwen Chen <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…on` env-dependent bit (vllm-project#22735) Signed-off-by: NickLucche <[email protected]>
Let's try to keep
get_kv_cache_layoutas simple as possible, if we start adding if/else checks on backends this is going to very quickly become a mess.We have two ways to set the kv cache layout right now:
envs.VLLM_KV_CACHE_LAYOUT, which should allow the user to specify their preferred layout_KV_CACHE_LAYOUT_OVERRIDEfor all other runtime use-cases that require force-setting a layout with priority in code.I would argue we could probably get away with a single variable to model this behavior, but this PR maintains the two options for now as well as the priority between the two.
Also, the layout should be set once during startup and then become read-only.
What this PR does is that it moves
VLLM_USE_TRTLLM_ATTENTIONout ofget_kv_cache_layout.As this variable is only used within FA b200, and FA already forces the layout to be HND for b200 here https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L281C21-L281C40, this should be redundant.
cc @mgoin for checking correctness here.
The second change is a small refactor to
use_trtllm_attentionsince it's used in every forward, to avoid checking for environment state every single time and just cache it.A clearer re-factor will probably be due once things are more stable. In particular, we should probably extend it to support multiple just like Hybrid mem allocator.
cc @tdoublep @LucasWilkinson