Prefix caching and deallocation mechanism#2511
Prefix caching and deallocation mechanism#2511jadielam wants to merge 18 commits intovllm-project:mainfrom
Conversation
|
@zhuohan123 I have added the prefix caching and deallocation of prefixes in this PR. Feel free to take a look. I will keep adding test cases to it, but I think that is in a state where we can start to review it. |
2a6ee48 to
6bf3c45
Compare
|
Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1. Description of two different methodologies for deallocating prefixes: |
vllm/config.py
Outdated
There was a problem hiding this comment.
How about profiling the maximum capacity by current max gpu memory if user doesn't set this parameter? And raising error when the server cannot afford larger maximum capacity.
There was a problem hiding this comment.
@esmeetu I agree with you. I think that we should be more sophisticated and instead of measuring prefix_pool_max_capacity in terms of the number of prefixes that the prefix pool can contain, measure the max capacity in terms of the percentage of the GPU memory that we are willing to use in the prefix pool.
This makes the implementation of the PrefixPool class a little more involved, but I think that all the complexity can be contained inside that class, which is a good thing.
For now, I think I am going to keep it the way it is, and then in a future PR I can introduce that improvement. I will move fast on it. As soon as this one is accepted and merged (if accepted), I will then move into implementing that more sophisticated way.
There was a problem hiding this comment.
@esmeetu Last commit just introduced has changed the PrefixPool implementation to use number of blocks instead of number of prefixes.
574d9a8 to
7685f71
Compare
This issues have been removed by doing the following in the
|
f19a2f2 to
0fd470c
Compare
|
@WoosukKwon @simon-mo @zhuohan123 I think that this PR is ready for review. All issues regarding memory leaks and sync with the master branch have been solved. |
vllm/core/scheduler.py
Outdated
There was a problem hiding this comment.
Note to self: Remember to remove these comments when discussed
…ted after deallocating it
25c0c21 to
8f83072
Compare
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
This branch introduces caching of prefixes and deallocation of old prefixes.
I have already added some tests to the PrefixPool class. Will be adding some extra tests for the caching and deallocation mechanisms.