[PD Disaggregation] Add the KVCache transfer latency monitor metric#7944
[PD Disaggregation] Add the KVCache transfer latency monitor metric#7944SCDESPERTATE wants to merge 24 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a crucial new metric for monitoring KVCache transfer latency in a disaggregated environment. By providing a detailed breakdown of this specific performance aspect, it enables operators to better diagnose and optimize performance bottlenecks related to KVCache movement between prefill and decode nodes.
Highlights
- New Metric Introduction: I've added a new Prometheus Histogram metric,
sglang:kvcache_transfer_latency, to specifically monitor the time taken for KVCache data transfer between prefill and decode nodes in a PD disaggregation setup. This provides a more granular view than the existing TTFT metric. - Latency Measurement Mechanism: I've implemented a mechanism to capture the start and end timestamps of KVCache transfer operations (
send_kvcacheandsend_kvcache_slice). The duration is accumulated per request across all chunks, and the total latency is reported to the metrics collector once all KVCache data for a request has been successfully transferred. - Metrics Integration: The
SchedulerMetricsCollectoris now passed to the KV manager in both the prefill and decode components, allowing theMooncakeConn(responsible for KVCache transfer) to observe and report the new latency metric.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a new histogram metric, kvcache_transfer_latency, to monitor the performance of KVCache transfers in a PD disaggregation setup. The changes involve adding the metric definition, capturing timestamps around the transfer process, and reporting the collected latency. The review focuses on improving code clarity, consistency, and maintainability by suggesting simplifications to the latency calculation logic, recommending a more appropriate timer, and refactoring constants.
| if self.scheduler_metrics_collector is not None: | ||
| if req.room not in self.kvcache_transfer_latency_table: | ||
| self.kvcache_transfer_latency_table[req.room] = { | ||
| req.mooncake_session_id: -time.time() | ||
| } | ||
| else: | ||
| if ( | ||
| req.mooncake_session_id | ||
| not in self.kvcache_transfer_latency_table[req.room] | ||
| ): | ||
| self.kvcache_transfer_latency_table[req.room][ | ||
| req.mooncake_session_id | ||
| ] = -time.time() | ||
| else: | ||
| self.kvcache_transfer_latency_table[req.room][ | ||
| req.mooncake_session_id | ||
| ] -= time.time() |
There was a problem hiding this comment.
The nested if statements for checking the existence of req.room and req.mooncake_session_id in self.kvcache_transfer_latency_table can be simplified using setdefault to avoid redundant checks and improve readability. This approach reduces the lines of code and makes the intent clearer.
if self.scheduler_metrics_collector is not None:
latency_table = self.kvcache_transfer_latency_table.setdefault(req.room, {})
latency_table[req.mooncake_session_id] = -time.time()There was a problem hiding this comment.
The pre-commit module would make the code more complicated if your suggestion here is adapted, right now this change is the most concise one as far as I see.
|
@whybeyoung @stmatengss Can you sync on the design with the PR owner? I think this is related to the fine-grained monitor/profiling for PD. |
Sure. I will handle this PR. |
| # Convenience function for logging to gauge. | ||
| gauge.labels(**self.labels).set(data) | ||
|
|
||
| def observe_kvcache_transfer_latency(self, value: float): |
There was a problem hiding this comment.
Return value should be None?
There was a problem hiding this comment.
Great catch! Appreciate it
| ) | ||
|
|
||
| if self.scheduler_metrics_collector is not None: | ||
| self.kvcache_transfer_latency_table: Dict[int, Dict[str, float]] = {} |
There was a problem hiding this comment.
int: req.room, str: session id, float: time. My understanding is correct?
There was a problem hiding this comment.
Yes, you are right.
| if self.scheduler_metrics_collector is None: | ||
| return | ||
|
|
||
| if before_transfer: |
There was a problem hiding this comment.
Why not implement like this
If xxx not in kvcache_transfer_latency_table:
kvcache_transfer_latency_table[xxx] = 0
kvcache_transfer_latency_table[xxx] -= time()
There was a problem hiding this comment.
Not exactly, kvcache_transfer_latency_table[xxx] itself is a dict type data too. But I find your suggestion useful in polishing line 960~966😊
| f"Losing connection with prefill instance (bootstrap_addr: {failed_bootstrap_addr}), affected {len(affected_rooms)} requests" | ||
| ) | ||
|
|
||
| def _collect_kv_transfer_timestamp( |
There was a problem hiding this comment.
I think two seperate functions will be more generate. _collect_kv_transfer_timestamp_begin and _collect_kv_transfer_timestamp_end
| @@ -26,6 +26,7 @@ | |||
| ) | |||
| from sglang.srt.disaggregation.common.utils import group_concurrent_contiguous | |||
There was a problem hiding this comment.
It seems you don't implement the same latency monitor for nixl
There was a problem hiding this comment.
Yes. So far, I'm not quite familiar with the NIXL part code, may be later I would go through that then add support for it. But for interface compatibility, the collector field has to be added, otherwise exception would be raised.
| if kv_chunk.is_last: | ||
| if self.scheduler_metrics_collector is not None: | ||
| self.scheduler_metrics_collector.observe_kvcache_transfer_latency( | ||
| self.kvcache_transfer_latency_table[req.room].pop( |
There was a problem hiding this comment.
Can kvcache_transfer_latency_table be a part of scheduler_metrics_collector? It can reduce the code changes in conn.py side.
There was a problem hiding this comment.
Might make sense, but my concern is such change would disrupt the design of the collector since all its members's types are from Prometheus. What's more, operations like collecting timestamp still looks better to stay out of the collector class since I believe it is disaggregation-specific.
There was a problem hiding this comment.
Following up, I think the kvcache_transfer_latency_table and its related operations could be abstracted into a class and placed into the python/sglang/srt/disaggregation/utils.py file. What do you think?
|
Could you implement the same function in the other backends (Ascend, Nixl, and Fake)? @SCDESPERTATE |
As far as I see, backend |
| disaggregation_mode: DisaggregationMode, | ||
| server_args: ServerArgs, | ||
| is_mla_backend: Optional[bool] = False, | ||
| scheduler_metrics_collector: Optional[SchedulerMetricsCollector] = None, |
There was a problem hiding this comment.
same as the other pr. don't pass collector into kv manager
There was a problem hiding this comment.
After a deep analysis, I find it difficult to achieve a fine-grained latency tracking without passing the collector to the KVManager && the coordination inside the conn.py🤔 The purpose of this PR is to track the KV transfer latency of the exact network stack, reflecting the real-time network performance. However, if the timestamp collecting is only allowed in the prefill.py, though generality is preserved, other irrelevant latencies like request queueing, scheduler dispatching and result polling would be included in this metric, which would mislead the operators. Hence, I think passing the collector into the KVManager in this case is quite necessary.
|
Do you happen to have any updates to this PR? @SCDESPERTATE |
Motivation
In the current metric design, TTFT is too coarse-grained to effectively monitor various detailed performance aspects in the PD disaggregation scenario. One such aspect is the KVCache transfer latency between prefill and decode nodes on a per-request basis. Thus, this Pull Request (PR) introduces this metric to assist operators in better monitoring the KVCache transfer performance within the PD disaggregation setup.
Modifications
kvcache_transfer_latencyshown by the Histogram tool.send_kvcacheandsend_kvcache_slice).Here is an example of the metrics:
Checklist