[PD Disaggregation] Add the KVCache transfer latency monitor metric by SCDESPERTATE · Pull Request #7944 · sgl-project/sglang

SCDESPERTATE · 2025-07-11T03:40:43Z

Motivation

In the current metric design, TTFT is too coarse-grained to effectively monitor various detailed performance aspects in the PD disaggregation scenario. One such aspect is the KVCache transfer latency between prefill and decode nodes on a per-request basis. Thus, this Pull Request (PR) introduces this metric to assist operators in better monitoring the KVCache transfer performance within the PD disaggregation setup.

Modifications

Add a new metric named kvcache_transfer_latency shown by the Histogram tool.
Capture timestamps before and after the actual KVCache transfer process (send_kvcache and send_kvcache_slice).
Accumulate the transfer duration for each chunk, and report the total duration to the metric collector once all KVCache data for a request has been successfully transferred.

Here is an example of the metrics:

$ curl http://10.13.3.164:8188/metrics
# HELP sglang:kvcache_transfer_latency Histogram of kvcache transfer latency in seconds.
# TYPE sglang:kvcache_transfer_latency histogram
sglang:kvcache_transfer_latency_sum{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 4.168236494064331
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.001",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.002",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.004",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 9.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.006",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 17.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.008",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 44.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.01",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 60.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.02",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 138.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.04",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 202.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.06",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 209.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.08",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 210.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.1",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 215.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.2",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.4",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.6",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.8",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="1.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="2.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="4.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="6.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="8.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="10.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="20.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="40.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="60.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="80.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="100.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="200.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="400.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="+Inf",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_count{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 920568.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
# HELP sglang:cached_tokens_total Number of cached prompt tokens.
# TYPE sglang:cached_tokens_total counter
sglang:cached_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 2688.0
# HELP sglang:num_aborted_requests_total Number of requests aborted.
# TYPE sglang:num_aborted_requests_total counter
sglang:num_aborted_requests_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 157.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang:num_grammar_queue_reqs gauge
sglang:num_grammar_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang:num_prefill_prealloc_queue_reqs gauge
sglang:num_prefill_prealloc_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_infight_queue_reqs The number of requests in the prefill infight queue.
# TYPE sglang:num_prefill_infight_queue_reqs gauge
sglang:num_prefill_infight_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang:num_decode_prealloc_queue_reqs gauge
sglang:num_decode_prealloc_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang:num_decode_transfer_queue_reqs gauge
sglang:num_decode_transfer_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 8825.75532746315
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.6",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 4.0
sglang:e2e_request_latency_seconds_bucket{le="4.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 17.0
sglang:e2e_request_latency_seconds_bucket{le="6.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 18.0
sglang:e2e_request_latency_seconds_bucket{le="8.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 18.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 20.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 23.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 53.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="400.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="800.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_count{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @SCDESPERTATE, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial new metric for monitoring KVCache transfer latency in a disaggregated environment. By providing a detailed breakdown of this specific performance aspect, it enables operators to better diagnose and optimize performance bottlenecks related to KVCache movement between prefill and decode nodes.

Highlights

New Metric Introduction: I've added a new Prometheus Histogram metric, sglang:kvcache_transfer_latency, to specifically monitor the time taken for KVCache data transfer between prefill and decode nodes in a PD disaggregation setup. This provides a more granular view than the existing TTFT metric.
Latency Measurement Mechanism: I've implemented a mechanism to capture the start and end timestamps of KVCache transfer operations (send_kvcache and send_kvcache_slice). The duration is accumulated per request across all chunks, and the total latency is reported to the metrics collector once all KVCache data for a request has been successfully transferred.
Metrics Integration: The SchedulerMetricsCollector is now passed to the KV manager in both the prefill and decode components, allowing the MooncakeConn (responsible for KVCache transfer) to observe and report the new latency metric.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new histogram metric, kvcache_transfer_latency, to monitor the performance of KVCache transfers in a PD disaggregation setup. The changes involve adding the metric definition, capturing timestamps around the transfer process, and reporting the collected latency. The review focuses on improving code clarity, consistency, and maintainability by suggesting simplifications to the latency calculation logic, recommending a more appropriate timer, and refactoring constants.

python/sglang/srt/disaggregation/decode.py

gemini-code-assist · 2025-07-11T03:42:25Z

python/sglang/srt/disaggregation/mooncake/conn.py

+                        if self.scheduler_metrics_collector is not None:
+                            if req.room not in self.kvcache_transfer_latency_table:
+                                self.kvcache_transfer_latency_table[req.room] = {
+                                    req.mooncake_session_id: -time.time()
+                                }
+                            else:
+                                if (
+                                    req.mooncake_session_id
+                                    not in self.kvcache_transfer_latency_table[req.room]
+                                ):
+                                    self.kvcache_transfer_latency_table[req.room][
+                                        req.mooncake_session_id
+                                    ] = -time.time()
+                                else:
+                                    self.kvcache_transfer_latency_table[req.room][
+                                        req.mooncake_session_id
+                                    ] -= time.time()


The nested if statements for checking the existence of req.room and req.mooncake_session_id in self.kvcache_transfer_latency_table can be simplified using setdefault to avoid redundant checks and improve readability. This approach reduces the lines of code and makes the intent clearer.

if self.scheduler_metrics_collector is not None: latency_table = self.kvcache_transfer_latency_table.setdefault(req.room, {}) latency_table[req.mooncake_session_id] = -time.time()

The pre-commit module would make the code more complicated if your suggestion here is adapted, right now this change is the most concise one as far as I see.

python/sglang/srt/disaggregation/mooncake/conn.py

python/sglang/srt/metrics/collector.py

ShangmingCai · 2025-07-28T08:12:10Z

@whybeyoung @stmatengss Can you sync on the design with the PR owner? I think this is related to the fine-grained monitor/profiling for PD.

stmatengss · 2025-07-30T16:16:25Z

@whybeyoung @stmatengss Can you sync on the design with the PR owner? I think this is related to the fine-grained monitor/profiling for PD.

Sure. I will handle this PR.

stmatengss · 2025-08-04T08:28:28Z

python/sglang/srt/metrics/collector.py

        # Convenience function for logging to gauge.
        gauge.labels(**self.labels).set(data)

+    def observe_kvcache_transfer_latency(self, value: float):


Return value should be None?

Great catch! Appreciate it

stmatengss · 2025-08-04T08:33:36Z

python/sglang/srt/disaggregation/mooncake/conn.py

            )

+            if self.scheduler_metrics_collector is not None:
+                self.kvcache_transfer_latency_table: Dict[int, Dict[str, float]] = {}


int: req.room, str: session id, float: time. My understanding is correct?

Yes, you are right.

stmatengss · 2025-08-04T08:37:59Z

python/sglang/srt/disaggregation/mooncake/conn.py

+        if self.scheduler_metrics_collector is None:
+            return
+
+        if before_transfer:


Why not implement like this

If xxx not in kvcache_transfer_latency_table: kvcache_transfer_latency_table[xxx] = 0 kvcache_transfer_latency_table[xxx] -= time()

Not exactly, kvcache_transfer_latency_table[xxx] itself is a dict type data too. But I find your suggestion useful in polishing line 960~966😊

stmatengss · 2025-08-04T08:39:35Z

python/sglang/srt/disaggregation/mooncake/conn.py

            f"Losing connection with prefill instance (bootstrap_addr: {failed_bootstrap_addr}), affected {len(affected_rooms)} requests"
        )

+    def _collect_kv_transfer_timestamp(


I think two seperate functions will be more generate. _collect_kv_transfer_timestamp_begin and _collect_kv_transfer_timestamp_end

stmatengss · 2025-08-04T08:44:41Z

python/sglang/srt/disaggregation/nixl/conn.py

@@ -26,6 +26,7 @@
 )
 from sglang.srt.disaggregation.common.utils import group_concurrent_contiguous


It seems you don't implement the same latency monitor for nixl

Yes. So far, I'm not quite familiar with the NIXL part code, may be later I would go through that then add support for it. But for interface compatibility, the collector field has to be added, otherwise exception would be raised.

stmatengss · 2025-08-04T08:46:31Z

python/sglang/srt/disaggregation/mooncake/conn.py

                        if kv_chunk.is_last:
+                            if self.scheduler_metrics_collector is not None:
+                                self.scheduler_metrics_collector.observe_kvcache_transfer_latency(
+                                    self.kvcache_transfer_latency_table[req.room].pop(


Can kvcache_transfer_latency_table be a part of scheduler_metrics_collector? It can reduce the code changes in conn.py side.

Might make sense, but my concern is such change would disrupt the design of the collector since all its members's types are from Prometheus. What's more, operations like collecting timestamp still looks better to stay out of the collector class since I believe it is disaggregation-specific.

Following up, I think the kvcache_transfer_latency_table and its related operations could be abstracted into a class and placed into the python/sglang/srt/disaggregation/utils.py file. What do you think?

stmatengss

LGTM.

stmatengss · 2025-08-05T07:37:18Z

Could you implement the same function in the other backends (Ascend, Nixl, and Fake)? @SCDESPERTATE

SCDESPERTATE · 2025-08-05T08:11:59Z

Could you implement the same function in the other backends (Ascend, Nixl, and Fake)? @SCDESPERTATE

As far as I see, backend Ascend inherits the transfer process defined in Mooncake and Fake doesn't involve data transfer. Therefore only Nixl needs to be altered. However, since I have to spend some time going through its implementation to be sure to implement the correct support, the implementation in the Nixl backend is left to relevant developers. Furthermore, class KVCacheTransferLatencyMonitor's interfaces is not mooncake-specific, being plug-and-play for the Nixl backend.

ByronHsu · 2025-08-10T17:37:14Z

python/sglang/srt/disaggregation/base/conn.py

        disaggregation_mode: DisaggregationMode,
        server_args: ServerArgs,
        is_mla_backend: Optional[bool] = False,
+        scheduler_metrics_collector: Optional[SchedulerMetricsCollector] = None,


same as the other pr. don't pass collector into kv manager

After a deep analysis, I find it difficult to achieve a fine-grained latency tracking without passing the collector to the KVManager && the coordination inside the conn.py🤔 The purpose of this PR is to track the KV transfer latency of the exact network stack, reflecting the real-time network performance. However, if the timestamp collecting is only allowed in the prefill.py, though generality is preserved, other irrelevant latencies like request queueing, scheduler dispatching and result polling would be included in this metric, which would mislead the operators. Hence, I think passing the collector into the KVManager in this case is quite necessary.

stmatengss · 2025-09-11T12:11:23Z

Do you happen to have any updates to this PR? @SCDESPERTATE

SCDESPERTATE and others added 3 commits July 10, 2025 12:09

add the KVCache transfer latency metric

8f2208a

Merge branch 'sgl-project:main' into add_send_kvcache_lat_metric

7b2a6c8

format the changes

bd1b6aa

SCDESPERTATE requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners July 11, 2025 03:40

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

SCDESPERTATE marked this pull request as draft July 11, 2025 04:27

pass metrics_collector on demand

291399c

SCDESPERTATE marked this pull request as ready for review July 11, 2025 12:09

SCDESPERTATE and others added 4 commits July 11, 2025 20:10

Merge branch 'main' into add_send_kvcache_lat_metric

6fd9f1a

avoid circular import

d7ff04d

Merge branch 'main' into add_send_kvcache_lat_metric

342cc6c

update all KVManager.__init__ interfaces to accommodate the collector

763c2b4

SCDESPERTATE requested a review from xiezhq-hermann as a code owner July 25, 2025 08:36

SCDESPERTATE added 3 commits July 25, 2025 17:52

update BaseKVManager.__init__ interface to receiver collector

9118c71

tidy the timestamp_record process inside the KVManager.transfer_worker

76039d3

fix a typo

e1a67a0

Merge branch 'main' into add_send_kvcache_lat_metric

face952

stmatengss requested a review from ShangmingCai as a code owner August 4, 2025 05:44

SCDESPERTATE added 2 commits August 4, 2025 14:44

follow the change in PR#8483

f4e3742

a follow-up of the last commit

325d56a

stmatengss reviewed Aug 4, 2025

View reviewed changes

stmatengss and others added 2 commits August 4, 2025 22:35

Merge branch 'main' into add_send_kvcache_lat_metric

3c01861

abstract kvcache_latency metric and its operations into a class

1d8e524

SCDESPERTATE mentioned this pull request Aug 5, 2025

[PD metrics] Add latency Histogram metrics of each stage for generate requests #8710

Merged

6 tasks

SCDESPERTATE and others added 5 commits August 5, 2025 11:18

fix a typo in utils.py

8037779

Merge branch 'main' into add_send_kvcache_lat_metric

7d77b50

Merge branch 'main' into add_send_kvcache_lat_metric

6ee4e17

fix lint error

d2111b1

Merge branch 'main' into add_send_kvcache_lat_metric

2466bf7

stmatengss approved these changes Aug 5, 2025

View reviewed changes

Merge branch 'main' into add_send_kvcache_lat_metric

e450144

ByronHsu reviewed Aug 10, 2025

View reviewed changes

SCDESPERTATE added 2 commits August 12, 2025 11:52

Merge branch 'main' into add_send_kvcache_lat_metric

1db01fa

Merge branch 'main' into add_send_kvcache_lat_metric

7988552

		@@ -26,6 +26,7 @@
		)
		from sglang.srt.disaggregation.common.utils import group_concurrent_contiguous

Conversation

SCDESPERTATE commented Jul 11, 2025

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ShangmingCai commented Jul 28, 2025

Uh oh!

stmatengss commented Jul 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SCDESPERTATE Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stmatengss left a comment

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Aug 5, 2025

Uh oh!

SCDESPERTATE commented Aug 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SCDESPERTATE Aug 4, 2025 •

edited

Loading