Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

Merged
merged 3 commits into from
Aug 26, 2024

Conversation

comaniac
Copy link
Collaborator

@comaniac comaniac commented Aug 23, 2024

Closes #7619

With the investigation in #7619, the root cause of block manager v2 low throughput with prefix caching is that block manager v2 doesn't mark prefix cache hit blocks as computed right after scheduling a batch. Specifically, the life cycle of a prefix cache block is as follows:

  1. The block is allocated by the first sequence of a batch. At this moment it will be added to "cached blocks", but won't be marked as computed; otherwise the rest sequences in the same batch will skip the computation of this block and result in incorrect output.
  2. When the batch of sequence is finished (prefill+decode), the blocks are freed and added to the evictor.
  3. When the sequence of a following batch allocates the same block, it will be activated from the evictor and marked as computed.

Here is a simple illustration. Note that we assume each sequence is in different batch.

seq 1: [allocate-block-uncomputed] -- [prefill] --[decode1] --  ... -- [decodeN] -- [free-block]
seq 2:                                [allocate-block-uncomputed] -- ...
...
seq N:                                                                                          [allocate-block-computed] -- ...

Meanwhile, block manager v1 marks the block as computed right after the prefill is scheduled:

seq 1: [allocate-block-uncomputed] -- [prefill] --[decode1] --  ... -- [decodeN] -- [free-block]
seq 2:                                [allocate-block-computed] -- ...
...

This PR fixes this issue by marking allocated blocks as touched, and let scheduler mark them as computed to achieve the same behavior of block manager v1.

Benchmark on L4

Command

python3 benchmarks/benchmark_prefix_caching.py \
    --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8 \
    --output-len 200 \
    --enable-prefix-caching \
    [--use-v2-block-manager]
Branch Block Manager Warmup (s) Processed (s)
main v1 14.5 13.4
main v2 23.6 13.4
PR v1 14.5 13.3
PR v2 14.4 13.3

cc @cadedaniel @rkooo567 @Yard1

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some comments

vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved
vllm/core/block/prefix_caching_block.py Show resolved Hide resolved
vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved
@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 23, 2024
@comaniac comaniac merged commit 2deb029 into vllm-project:main Aug 26, 2024
42 checks passed
@comaniac comaniac deleted the fix-v2-prefix-cache branch August 26, 2024 18:24
triple-Mu pushed a commit to triple-Mu/vllm_official that referenced this pull request Sep 4, 2024
Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance]: Block manager v2 has low throughput with prefix caching warmup
2 participants