-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue #21005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue #21005
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request was exported from Phabricator. Differential Revision: D78292345 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance optimization to the FreeKVCacheBlockQueue by implementing a doubly linked list with sentinel nodes. This change effectively removes expensive __eq__ comparisons on KVCacheBlock dataclasses, which should improve performance as demonstrated by the new benchmark. The implementation is a classic and well-executed approach.
My review focuses on ensuring the robustness of this new implementation. I've identified a couple of areas where adding validation checks could prevent potential crashes from state inconsistencies, making the system more resilient. These changes should have a negligible performance impact while significantly improving debuggability and correctness guarantees.
574454a to
53e5f05
Compare
53e5f05 to
f0a6c84
Compare
|
This pull request was exported from Phabricator. Differential Revision: D78292345 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D78292345 |
f0a6c84 to
36411c3
Compare
…project#21005) Summary: Pull Request resolved: vllm-project#21005 # Optimizations As a common trick for doubly linked list implementation, introducing fake head and tail nodes would significantly reduce the implementation overhead, and help us to get rid of dataclass.__eq__ comparison easily. - No dataclass.__eq__ invocation - Shorter code - Branchless All these combined should yield significant perf improvement for this piece of code. # Observations Per vLLM profiling, kv_cache_manager.allocate_slots consumed non-negligible cost for each prefill. |{F1980260529}|{F1980260481}|{F1980260497}| By zooming in, we could see the stack of FreeKVCacheBlockQueue.popleft is non-trivial. popleft -> remove -> string.__eq__ which is mainly coming from dataclasses (i.e. KVCacheBlock) equal comparison. Per [dataclasses python library doc](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) ``` dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False, match_args=True, kw_only=False, slots=False, weakref_slot=False) eq: If true (the default), an __eq__() method will be generated. This method compares the class as if it were a tuple of its fields, in order. Both instances in the comparison must be of the identical type. If the class already defines __eq__(), this parameter is ignored. ``` Test Plan: # Result Typically, block_size is set to 16, so in production usage, we might likely allocate 10-1000 blocks. In this range, the optimization gave us up to ~1ms TTFT savings (the improvements are more significant on long inputs). |After|Before| |{F1980286936}|{F1980286941}| Rollback Plan: Reviewed By: CuiCoco Differential Revision: D78292345 Signed-off-by: Jialin Ouyang <[email protected]>
36411c3 to
608f1f5
Compare
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
yeqcharlotte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@JialinOuyang-Meta some graph in your test plan is broken could you fix it?
cc: @njhill @WoosukKwon could you take another look?
|
resolve #21141 |
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JialinOuyang-Meta! I have a few small comments, perhaps could be done as a follow-on.
Sorry for the delay, I was on vacation for the last week.
| if not self.free_list_head: | ||
| if (self.fake_free_list_head.next_free_block | ||
| is self.fake_free_list_tail | ||
| or self.fake_free_list_head.next_free_block is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this second check needed? Would self.fake_free_list_head.next_free_block ever be None ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically, it should NOT be needed. But this would make pyre happy, otherwise, it would complain about we can't assign Optional[KVCacheBlock] to KVCacheBlock in L256.
Just curious, what's the typically way in vLLM to suppress pyre without such extra checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using asserts or selective ignore directives
| if self.fake_free_list_tail.prev_free_block is None: | ||
| raise RuntimeError( | ||
| "prev_free_block of fake_free_list_tail should always exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this check is needed, or it should be an assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, only added to make pyre stop complaining against Optional[KVCacheBlock] -> KVCacheBlock assignments in L305.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine but it should be an assert in this case.
| if self.fake_free_list_head.next_free_block is None: | ||
| raise RuntimeError( | ||
| "next_free_block of fake_free_list_head should always exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment. If we want integrity check here it should be an assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar reply as above. Would love to hear the best practice to suppress pyre without extra checks or computations.
And I will definitely address all your comments once I get some guidances on this point :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert should work the same way and is more correct since this is an integrity check.
| # Create a fake head and a tail block for the doubly linked list to | ||
| # reduce branching in the code | ||
| # | ||
| # The implementation garenteed that the fake head and tail | ||
| # are NEVER got popped, so we could safely assume each real blocks | ||
| # in the queue has prev and next blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested rewording of the comment
# Create a fake head tail blocks for the doubly linked list to
# reduce branching in the code.
#
# The implementation guarantees that the fake head and tail
# are NEVER popped, so we can safely assume each real block
# in the queue has prev and next blocks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate that! Will address all your comments in a followup PR.
No worry. And appreciate your inputs! |
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: x22x22 <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…vllm-project#21005) Signed-off-by: Jialin Ouyang <[email protected]>
Summary:
Optimizations
As a common trick for doubly linked list implementation, introducing fake head and tail nodes would significantly reduce the implementation overhead, and help us to get rid of dataclass.eq comparison easily.
All these combined should yield significant perf improvement for this piece of code.
Observations
Per vLLM profiling, kv_cache_manager.allocate_slots consumed non-negligible cost for each prefill.

|{F1980260529}|{F1980260481}|{F1980260497}|
By zooming in, we could see the stack of FreeKVCacheBlockQueue.popleft is non-trivial. popleft -> remove -> string.eq which is mainly coming from dataclasses (i.e. KVCacheBlock) equal comparison.
Per dataclasses python library doc
Test Plan:
Result
Typically, block_size is set to 16, so in production usage, we might likely allocate 10-1000 blocks. In this range, the optimization gave us up to ~1ms TTFT savings (the improvements are more significant on long inputs).
Benchmark
After

|
Before
Stack
After


Before
Rollback Plan:
Reviewed By: CuiCoco
Differential Revision: D78292345