-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some comments
fd9c7c7
to
020ac13
Compare
…er schedule (vllm-project#7822) Signed-off-by: Alvant <[email protected]>
Closes #7619
With the investigation in #7619, the root cause of block manager v2 low throughput with prefix caching is that block manager v2 doesn't mark prefix cache hit blocks as computed right after scheduling a batch. Specifically, the life cycle of a prefix cache block is as follows:
Here is a simple illustration. Note that we assume each sequence is in different batch.
Meanwhile, block manager v1 marks the block as computed right after the prefill is scheduled:
This PR fixes this issue by marking allocated blocks as touched, and let scheduler mark them as computed to achieve the same behavior of block manager v1.
Benchmark on L4
Command
cc @cadedaniel @rkooo567 @Yard1