[Feature][PP] Support Dynamic Chunk for Chunked Pipeline Parallelism#7896
[Feature][PP] Support Dynamic Chunk for Chunked Pipeline Parallelism#7896MengqingCao merged 14 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a dynamic chunk sizing feature for Chunked Pipeline Parallelism (CPP) to mitigate bubbles in long-sequence scenarios. By profiling prefill latency at startup and fitting a quadratic model, the scheduler can now predict and apply optimal chunk sizes at runtime. The solution includes online refinement capabilities to continuously improve prediction accuracy based on actual execution times, significantly enhancing load balancing for variable-length sequences. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\nmarkdown\n[Scheduler][Feature] Implement profiling-based dynamic chunk sizing for Ascend\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request introduces a profiling-based dynamic chunk sizing strategy for the Ascend backend to optimize prefill performance. It implements a `ProfilingChunkScheduler` that utilizes a quadratic latency model to predict optimal chunk sizes based on request history length. The system includes a `ProfilingChunkManager` for model fitting and online refinement using real-time execution data. A critical bug was identified in the `ProfilingChunkManager` where the `_set_time_done` attribute was not initialized, which would result in an `AttributeError` during the first batch execution.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it adds a new `profiling_chunk_config` option within the engine's additional configuration, allowing users to enable and tune the dynamic chunking behavior.\n\n### How was this patch tested?\nThe implementation extends the vLLM v1 scheduler and engine core via monkey-patching and configuration updates; however, no new automated tests were included in this PR.\n
| self._profiling_done = False | ||
|
|
There was a problem hiding this comment.
The attribute _set_time_done is used in vllm_ascend/patch/platform/patch_profiling_chunk.py (line 106) to guard the target latency calibration, but it is not initialized in the ProfilingChunkManager.__init__ method. This will cause an AttributeError when the first batch execution timing is recorded in the engine core process.
| self._profiling_done = False | |
| self._profiling_done = False | |
| self._set_time_done = False |
yiz-liu
left a comment
There was a problem hiding this comment.
Add or modify performance test cases.
417025d to
1b3bc22
Compare
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
7fb08e6 to
f5b8ce1
Compare
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
5b071c6 to
a9bfeec
Compare
32d2a6c to
49061cd
Compare
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
f973c57 to
84a24a3
Compare
84a24a3 to
389d2b1
Compare
| L = np.array(seq_lens, dtype=np.float64) | ||
| T = np.array(latencies, dtype=np.float64) | ||
|
|
||
| if len(L) < 8: |
There was a problem hiding this comment.
why we constraint the length of L to 8 here?
There was a problem hiding this comment.
These values are all judgments of the amount of fitting data. For fitting the relationship between sequence length and time, a small amount of data can lead to inaccurate fitting. All are experience value. For clarity, I will create a variable name to represent.
| True if fitting succeeded, False otherwise | ||
| """ | ||
| num_points = len(chunked_data) | ||
| if num_points < 5: |
There was a problem hiding this comment.
Is num_points < 5 an experience value?
| num_points, | ||
| ) | ||
| return False | ||
| if num_points > 30: |
There was a problem hiding this comment.
ditto, why when num_points > 30 means fitted here?
There was a problem hiding this comment.
ditto, addtionally, this is the online calibration stage, and users need to use several pieces of real data for preheating. It is not advisable to be too long here to affect the user experience, so the amount of fitting data is limited. The actual measurement of 30 data points has shown good fitting effect, and online calibration can be completed by consuming approximately 4 data points
| if fitted_a < 0: | ||
| logger.warning("Fitted a=%.2e is not positive. Setting a=1e-9.", fitted_a) | ||
| fitted_a = 1e-9 | ||
|
|
||
| if fitted_b < 0: | ||
| logger.warning("Fitted b=%.2e is not positive. Setting b=0.0.", fitted_b) | ||
| fitted_b = 0.0 |
There was a problem hiding this comment.
it seems the post process of a and b could be extract into a function.
There was a problem hiding this comment.
Sure. We will extract it into a tool function.
| @@ -0,0 +1,179 @@ | |||
| # | |||
There was a problem hiding this comment.
plz add the details of this patch like others in https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/patch/__init__.py
| and request.num_computed_tokens > 0 | ||
| ): | ||
| predicted_chunk = self.profiling_chunk_manager.predict_chunk_size( | ||
| history_len=request.num_computed_tokens, |
There was a problem hiding this comment.
I think it would be better to use consistent naming conventions for num_computed_tokens and history_len.
Sure. We plan to collect more test results and include them in the subsequent document PR. The subsequent PR will include interface specifications, user guides, and more performance data. |
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
6503a4b to
b7b096c
Compare
… values Signed-off-by: gaojc <gaojingchun1@huawei.com>
b7b096c to
99e3184
Compare
| # The method below is based on the upstream Scheduler.schedule() | ||
| # with profiling-based chunk sizing applied to both RUNNING requests | ||
| # (chunked prefill continuation) and WAITING requests (new prefill). | ||
| # Modified sections are marked with ">>> PROFILING CHUNK" comments. |
There was a problem hiding this comment.
This makes the code very clear! 👍
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com> Signed-off-by: xulei_ict <xulei292@huawei.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com> Signed-off-by: tfhddd <2272751277@qq.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com> Signed-off-by: guxin108 <1252896542@qq.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com>
…llm-project#7896) ### What this PR does / why we need it? Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency. We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time. Specific modifications: Patch modifications 1. Patch Scheduler to add dynamic chunking logic 2. Minimal patch to EngineCore to add logic for calling Profile Changes on the vllm-ascend side 1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile 2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@620e892 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: Zeng Shu <shuzeng@huawei.com> Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com> Signed-off-by: gaojc <gaojingchun1@huawei.com> Co-authored-by: gaojc <gaojc0714@163.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency.
We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time.
Specific modifications:
Patch modifications
Changes on the vllm-ascend side
Performance Results:
256k Qwen3-235B chunk32k
TTFT
PP 61.4s -> CPP 53.5s, 15% benefit
1k~32k
DeepSeek V3.1
CPP
TTFT 9.9s, Throughput 23635 tps
PCP
TTFT 11.4s, Throughput 20734 tps