[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

comaniac · 2024-08-23T17:24:46Z

With the investigation in #7619, the root cause of block manager v2 low throughput with prefix caching is that block manager v2 doesn't mark prefix cache hit blocks as computed right after scheduling a batch. Specifically, the life cycle of a prefix cache block is as follows:

The block is allocated by the first sequence of a batch. At this moment it will be added to "cached blocks", but won't be marked as computed; otherwise the rest sequences in the same batch will skip the computation of this block and result in incorrect output.
When the batch of sequence is finished (prefill+decode), the blocks are freed and added to the evictor.
When the sequence of a following batch allocates the same block, it will be activated from the evictor and marked as computed.

Here is a simple illustration. Note that we assume each sequence is in different batch.

seq 1: [allocate-block-uncomputed] -- [prefill] --[decode1] --  ... -- [decodeN] -- [free-block]
seq 2:                                [allocate-block-uncomputed] -- ...
...
seq N:                                                                                          [allocate-block-computed] -- ...

Meanwhile, block manager v1 marks the block as computed right after the prefill is scheduled:

seq 1: [allocate-block-uncomputed] -- [prefill] --[decode1] --  ... -- [decodeN] -- [free-block]
seq 2:                                [allocate-block-computed] -- ...
...

This PR fixes this issue by marking allocated blocks as touched, and let scheduler mark them as computed to achieve the same behavior of block manager v1.

Benchmark on L4

Command

python3 benchmarks/benchmark_prefix_caching.py \
    --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8 \
    --output-len 200 \
    --enable-prefix-caching \
    [--use-v2-block-manager]

Branch	Block Manager	Warmup (s)	Processed (s)
main	v1	14.5	13.4
main	v2	23.6	13.4
PR	v1	14.5	13.3
PR	v2	14.4	13.3

cc @cadedaniel @rkooo567 @Yard1

github-actions · 2024-08-23T17:25:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Yard1

LGTM, some comments

vllm/core/block/prefix_caching_block.py

…er schedule (vllm-project#7822)

…er schedule (vllm-project#7822) Signed-off-by: Alvant <[email protected]>

…er schedule (vllm-project#7822)

Yard1 reviewed Aug 23, 2024

View reviewed changes

vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved

vllm/core/block/prefix_caching_block.py Show resolved Hide resolved

vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 23, 2024

comaniac added 3 commits August 26, 2024 09:29

done w/o test

5daf36c

add test

6d8a610

use set

020ac13

comaniac force-pushed the fix-v2-prefix-cache branch from fd9c7c7 to 020ac13 Compare August 26, 2024 16:30

Yard1 approved these changes Aug 26, 2024

View reviewed changes

comaniac merged commit 2deb029 into vllm-project:main Aug 26, 2024
42 checks passed

comaniac deleted the fix-v2-prefix-cache branch August 26, 2024 18:24

triple-Mu pushed a commit to triple-Mu/vllm_official that referenced this pull request Sep 4, 2024

[Performance][BlockManagerV2] Mark prefix cache block as computed aft…

b2f57ce

…er schedule (vllm-project#7822)

Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024

[Performance][BlockManagerV2] Mark prefix cache block as computed aft…

fa77715

…er schedule (vllm-project#7822)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Performance][BlockManagerV2] Mark prefix cache block as computed aft…

ed30706

…er schedule (vllm-project#7822) Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Performance][BlockManagerV2] Mark prefix cache block as computed aft…

6a3afb6

…er schedule (vllm-project#7822)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

comaniac commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024

Yard1 left a comment

[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule #7822

Conversation

comaniac commented Aug 23, 2024 • edited Loading

Benchmark on L4

github-actions bot commented Aug 23, 2024

Yard1 left a comment

Choose a reason for hiding this comment

comaniac commented Aug 23, 2024 •

edited

Loading