Skip to content

[Feature][PP] Support Dynamic Chunk for Chunked Pipeline Parallelism#7896

Merged
MengqingCao merged 14 commits intovllm-project:mainfrom
gjc0824:refactor/profiling-chunk
Apr 17, 2026
Merged

[Feature][PP] Support Dynamic Chunk for Chunked Pipeline Parallelism#7896
MengqingCao merged 14 commits intovllm-project:mainfrom
gjc0824:refactor/profiling-chunk

Conversation

@gjc0824
Copy link
Copy Markdown
Contributor

@gjc0824 gjc0824 commented Apr 1, 2026

What this PR does / why we need it?

Chunked Pipeline Parallelism (CPP), with its relatively low communication overhead, has become a key feature for ultra-long sequences and variable-length sequence scenarios. However, ordinary CPP faces the problem of empty slots in long sequences as the KVCache grows, reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by using a dynamic chunk strategy. Specifically, we profile requests of different lengths and fit the functional relationship between sequence length and execution time. During the scheduling process, we calculate the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications

  1. Patch Scheduler to add dynamic chunking logic
  2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side

  1. Add timing logic in worker & modelrunner, minor adaptation of dummy_run to CPP Profile
  2. Create ProfileChunkManager to uniformly manage dynamic chunking resolution, function fitting, and other utility functions

Performance Results:

  • Fixed-Length Scene
    256k Qwen3-235B chunk32k
    TTFT
    PP 61.4s -> CPP 53.5s, 15% benefit
  • Variable-length scene
    1k~32k
    DeepSeek V3.1
    CPP
    TTFT 9.9s, Throughput 23635 tps
    PCP
    TTFT 11.4s, Throughput 20734 tps

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a dynamic chunk sizing feature for Chunked Pipeline Parallelism (CPP) to mitigate bubbles in long-sequence scenarios. By profiling prefill latency at startup and fitting a quadratic model, the scheduler can now predict and apply optimal chunk sizes at runtime. The solution includes online refinement capabilities to continuously improve prediction accuracy based on actual execution times, significantly enhancing load balancing for variable-length sequences.

Highlights

  • Dynamic Chunking Scheduler: Introduced ProfilingChunkScheduler, which dynamically adjusts chunk sizes based on prefill latency profiling to optimize load balancing in Chunked Pipeline Parallelism.
  • Profiling and Modeling: Added a profiling mechanism that fits a quadratic latency model to prefill execution times, enabling runtime prediction of optimal chunk sizes.
  • Online Refinement: Implemented online model refinement by recording execution timing after each model step, allowing the predictor to adapt to changing workload characteristics.
  • Infrastructure Support: Updated EngineCore and Worker components to support startup profiling and execution timing feedback, ensuring compatibility with multiprocessing environments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:\n\nmarkdown\n[Scheduler][Feature] Implement profiling-based dynamic chunk sizing for Ascend\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request introduces a profiling-based dynamic chunk sizing strategy for the Ascend backend to optimize prefill performance. It implements a `ProfilingChunkScheduler` that utilizes a quadratic latency model to predict optimal chunk sizes based on request history length. The system includes a `ProfilingChunkManager` for model fitting and online refinement using real-time execution data. A critical bug was identified in the `ProfilingChunkManager` where the `_set_time_done` attribute was not initialized, which would result in an `AttributeError` during the first batch execution.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it adds a new `profiling_chunk_config` option within the engine's additional configuration, allowing users to enable and tune the dynamic chunking behavior.\n\n### How was this patch tested?\nThe implementation extends the vLLM v1 scheduler and engine core via monkey-patching and configuration updates; however, no new automated tests were included in this PR.\n

Comment on lines +323 to +324
self._profiling_done = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The attribute _set_time_done is used in vllm_ascend/patch/platform/patch_profiling_chunk.py (line 106) to guard the target latency calibration, but it is not initialized in the ProfilingChunkManager.__init__ method. This will cause an AttributeError when the first batch execution timing is recorded in the engine core process.

Suggested change
self._profiling_done = False
self._profiling_done = False
self._set_time_done = False

Copy link
Copy Markdown
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add or modify performance test cases.

@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch 8 times, most recently from 417025d to 1b3bc22 Compare April 8, 2026 08:40
gjc0824 added 7 commits April 9, 2026 17:20
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch 3 times, most recently from 7fb08e6 to f5b8ce1 Compare April 10, 2026 11:24
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch 3 times, most recently from 5b071c6 to a9bfeec Compare April 12, 2026 08:46
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch from 32d2a6c to 49061cd Compare April 12, 2026 09:31
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch 2 times, most recently from f973c57 to 84a24a3 Compare April 13, 2026 02:39
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch from 84a24a3 to 389d2b1 Compare April 13, 2026 02:49
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work! Could you post more performance results on dynamic cpp and the design doc, usage doc, test cases should also be supplemented, I'm okay with adding them in a follow-up pr.

L = np.array(seq_lens, dtype=np.float64)
T = np.array(latencies, dtype=np.float64)

if len(L) < 8:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we constraint the length of L to 8 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are all judgments of the amount of fitting data. For fitting the relationship between sequence length and time, a small amount of data can lead to inaccurate fitting. All are experience value. For clarity, I will create a variable name to represent.

True if fitting succeeded, False otherwise
"""
num_points = len(chunked_data)
if num_points < 5:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is num_points < 5 an experience value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

num_points,
)
return False
if num_points > 30:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, why when num_points > 30 means fitted here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, addtionally, this is the online calibration stage, and users need to use several pieces of real data for preheating. It is not advisable to be too long here to affect the user experience, so the amount of fitting data is limited. The actual measurement of 30 data points has shown good fitting effect, and online calibration can be completed by consuming approximately 4 data points

Comment on lines +152 to +158
if fitted_a < 0:
logger.warning("Fitted a=%.2e is not positive. Setting a=1e-9.", fitted_a)
fitted_a = 1e-9

if fitted_b < 0:
logger.warning("Fitted b=%.2e is not positive. Setting b=0.0.", fitted_b)
fitted_b = 0.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the post process of a and b could be extract into a function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We will extract it into a tool function.

@@ -0,0 +1,179 @@
#
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

and request.num_computed_tokens > 0
):
predicted_chunk = self.profiling_chunk_manager.predict_chunk_size(
history_len=request.num_computed_tokens,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use consistent naming conventions for num_computed_tokens and history_len.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@gjc0824
Copy link
Copy Markdown
Contributor Author

gjc0824 commented Apr 15, 2026

Thanks for your great work! Could you post more performance results on dynamic cpp and the design doc, usage doc, test cases should also be supplemented, I'm okay with adding them in a follow-up pr.

Sure. We plan to collect more test results and include them in the subsequent document PR. The subsequent PR will include interface specifications, user guides, and more performance data.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch 2 times, most recently from 6503a4b to b7b096c Compare April 15, 2026 06:21
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thx!

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Apr 16, 2026
… values

Signed-off-by: gaojc <gaojingchun1@huawei.com>
@gjc0824 gjc0824 force-pushed the refactor/profiling-chunk branch from b7b096c to 99e3184 Compare April 16, 2026 15:46
# The method below is based on the upstream Scheduler.schedule()
# with profiling-based chunk sizing applied to both RUNNING requests
# (chunked prefill continuation) and WAITING requests (new prefill).
# Modified sections are marked with ">>> PROFILING CHUNK" comments.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the code very clear! 👍

@MengqingCao MengqingCao merged commit bec6314 into vllm-project:main Apr 17, 2026
52 checks passed
serlar pushed a commit to serlar/vllm-ascend that referenced this pull request Apr 18, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Signed-off-by: xulei_ict <xulei292@huawei.com>
1kzk pushed a commit to 1kzk/vllm-ascend that referenced this pull request Apr 20, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
tfhddd pushed a commit to ascend-gha-runners/vllm-ascend that referenced this pull request Apr 21, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Signed-off-by: tfhddd <2272751277@qq.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
…llm-project#7896)

### What this PR does / why we need it?
Chunked Pipeline Parallelism (CPP), with its relatively low
communication overhead, has become a key feature for ultra-long
sequences and variable-length sequence scenarios. However, ordinary CPP
faces the problem of empty slots in long sequences as the KVCache grows,
reducing computational efficiency.

We control the execution time of each scheduling batch to be equal by
using a dynamic chunk strategy. Specifically, we profile requests of
different lengths and fit the functional relationship between sequence
length and execution time. During the scheduling process, we calculate
the appropriate chunk size based on the given target execution time.

Specific modifications:

Patch modifications
1. Patch Scheduler to add dynamic chunking logic
2. Minimal patch to EngineCore to add logic for calling Profile

Changes on the vllm-ascend side
1. Add timing logic in worker & modelrunner, minor adaptation of
dummy_run to CPP Profile
2. Create ProfileChunkManager to uniformly manage dynamic chunking
resolution, function fitting, and other utility functions

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@620e892
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Zeng Shu <shuzeng@huawei.com>
Signed-off-by: Xiaochao Wang <wangxiaochao6@hisilicon.com>
Signed-off-by: gaojc <gaojingchun1@huawei.com>
Co-authored-by: gaojc <gaojc0714@163.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants