[Core] [Bugfix] Refactor block manager subsystem for better testability#3492
[Core] [Bugfix] Refactor block manager subsystem for better testability#3492youkaichao merged 96 commits intovllm-project:mainfrom
Conversation
|
|
@simon-mo ready for review |
|
|
||
| # Now that the batch has been created, we can assume all blocks in the | ||
| # batch will have been computed before the next scheduling invocation. | ||
| for seq_group in scheduler_outputs.scheduled_seq_groups: |
There was a problem hiding this comment.
This assumes the model execution won't fail & there's no retry. It is the general assumption in the engine now?
There was a problem hiding this comment.
Add this assumption to comments? What happen to later preemption?
There was a problem hiding this comment.
comment added
What happen to later preemption
these lines do not change prefix caching behavior wrt later preemption (either case requires recomputing the blocks)
There was a problem hiding this comment.
oh btw @rkooo567 this assumption is not introduced in this PR; the scheduler already assumes that its schedule is implemented by the engine. retries and failures are not yet in scope of vllm
simon-mo
left a comment
There was a problem hiding this comment.
Thanks for the great work. The new structure is looking great. I don't have major issues therefore approving. I left some comments mostly due to my confusion when reading the code, which I would imagine others will run into as well.
tests/core/block/e2e/conftest.py
Outdated
| def generator_outer(): | ||
| for llm in generator_inner(): | ||
| yield llm | ||
| del llm |
There was a problem hiding this comment.
why do we need another level of wrapper? would it work without it? if not please comment why
There was a problem hiding this comment.
I see the usage of the llm generator below but still confused since we are only yielding one llm instance
There was a problem hiding this comment.
oh good catch, not necessary
vllm/core/block/interfaces.py
Outdated
| pass | ||
|
|
||
|
|
||
| class DeviceAwareBlockAllocator(ABC): |
There was a problem hiding this comment.
I do wonder whether this can "inherit" BlockAllocator somehow so we are only re-defining allocate_mutable and allocate_immutable.
|
|
||
| # Now that the batch has been created, we can assume all blocks in the | ||
| # batch will have been computed before the next scheduling invocation. | ||
| for seq_group in scheduler_outputs.scheduled_seq_groups: |
There was a problem hiding this comment.
Add this assumption to comments? What happen to later preemption?
| return self.lora_request.lora_int_id if self.lora_request else 0 | ||
|
|
||
| def hash_of_block(self, logical_idx: int) -> int: | ||
| # TODO This can produce incorrect hash when block size > prompt size |
| def access_all_blocks_in_seq(self, seq, now): | ||
| pass |
| def mark_blocks_as_computed(self, seq_group: SequenceGroup): | ||
| # We ignore the sequence group as its not necessary. After the batch is | ||
| # formed by the scheduler, we do not need to mark blocks from individual | ||
| # sequence groups as computed -- all blocks in the batch can be marked | ||
| # as computed. |
There was a problem hiding this comment.
kinda -- the plumbing is here to validate the design, but prefix caching isn't tested e2e (need #3667)
| caching. | ||
|
|
||
| Args: | ||
| create_block (Block.Factory): A factory function for creating new |
There was a problem hiding this comment.
can we parameterize the type somehow so we are constraining the type to NaiveBlock because it only works there.
There was a problem hiding this comment.
hm, open to more typing, but here we need it to return an unspecialized block. the prefix caching allocator will specify to the naive block allocator that it should construct prefix caching blocks instead of the default naive block.
I will add a comment
|
|
||
| def free(self, block: Block) -> None: | ||
| block_id = block.block_id | ||
| block.block_id = None |
There was a problem hiding this comment.
why is this needed? plz comment?
| Refcount = int | ||
|
|
||
|
|
||
| class NaiveBlockAllocator(BlockAllocator): |
There was a problem hiding this comment.
I think this should be called CoWBlockAllocator btw because it's actually quite powerful
There was a problem hiding this comment.
I thought about this -- the conflict is that the prefix caching allocator also supports CoW (it isn't unique to this allocator).
|
Thanks for the review 🙏. Applying feedback |
This PR refactors the block manager / allocator / prefix caching to make things easier to test. Concretely, it establishes the following layers (diagrams from https://docs.google.com/document/d/1ipAypZYfZgloP_08sLi1Z2BADHMauWScbT_H1F0ThCQ/edit?pli=1):
The interfaces are such that the BlockAllocator can be implemented via a NaiveBlockAllocator (e.g. no prefix caching) or a PrefixCachingBlockAllocator. The value of this approach is in its separation of concerns and consequent improved testability.
This PR is missing the following features before the BlockManagerV2 can become the default.
computedbit.Testing
Design
Key APIs
Important interactions
Swapping GPU/CPU
Copy-on-write
Prefix caching promotion