Skip to content

[PD Disaggregation] Add the KVCache transfer latency monitor metric#7944

Open
SCDESPERTATE wants to merge 24 commits intosgl-project:mainfrom
SCDESPERTATE:add_send_kvcache_lat_metric
Open

[PD Disaggregation] Add the KVCache transfer latency monitor metric#7944
SCDESPERTATE wants to merge 24 commits intosgl-project:mainfrom
SCDESPERTATE:add_send_kvcache_lat_metric

Conversation

@SCDESPERTATE
Copy link
Contributor

Motivation

In the current metric design, TTFT is too coarse-grained to effectively monitor various detailed performance aspects in the PD disaggregation scenario. One such aspect is the KVCache transfer latency between prefill and decode nodes on a per-request basis. Thus, this Pull Request (PR) introduces this metric to assist operators in better monitoring the KVCache transfer performance within the PD disaggregation setup.

Modifications

  • Add a new metric named kvcache_transfer_latency shown by the Histogram tool.
  • Capture timestamps before and after the actual KVCache transfer process (send_kvcache and send_kvcache_slice).
  • Accumulate the transfer duration for each chunk, and report the total duration to the metric collector once all KVCache data for a request has been successfully transferred.

Here is an example of the metrics:

$ curl http://10.13.3.164:8188/metrics
# HELP sglang:kvcache_transfer_latency Histogram of kvcache transfer latency in seconds.
# TYPE sglang:kvcache_transfer_latency histogram
sglang:kvcache_transfer_latency_sum{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 4.168236494064331
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.001",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.002",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.004",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 9.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.006",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 17.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.008",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 44.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.01",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 60.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.02",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 138.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.04",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 202.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.06",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 209.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.08",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 210.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.1",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 215.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.2",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.4",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.6",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="0.8",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="1.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="2.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="4.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="6.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="8.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="10.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="20.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="40.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="60.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="80.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="100.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="200.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="400.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_bucket{engine_type="unified",le="+Inf",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
sglang:kvcache_transfer_latency_count{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 216.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 920568.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
# HELP sglang:cached_tokens_total Number of cached prompt tokens.
# TYPE sglang:cached_tokens_total counter
sglang:cached_tokens_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 2688.0
# HELP sglang:num_aborted_requests_total Number of requests aborted.
# TYPE sglang:num_aborted_requests_total counter
sglang:num_aborted_requests_total{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 157.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang:num_grammar_queue_reqs gauge
sglang:num_grammar_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang:num_prefill_prealloc_queue_reqs gauge
sglang:num_prefill_prealloc_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_infight_queue_reqs The number of requests in the prefill infight queue.
# TYPE sglang:num_prefill_infight_queue_reqs gauge
sglang:num_prefill_infight_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang:num_decode_prealloc_queue_reqs gauge
sglang:num_decode_prealloc_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang:num_decode_transfer_queue_reqs gauge
sglang:num_decode_transfer_queue_reqs{engine_type="unified",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 8825.75532746315
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.6",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 4.0
sglang:e2e_request_latency_seconds_bucket{le="4.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 17.0
sglang:e2e_request_latency_seconds_bucket{le="6.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 18.0
sglang:e2e_request_latency_seconds_bucket{le="8.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 18.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 20.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 23.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 53.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="400.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="800.0",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0
sglang:e2e_request_latency_seconds_count{model_name="/home/moyun.zty/models/deepseek-ai__DeepSeek-R1"} 217.0

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial new metric for monitoring KVCache transfer latency in a disaggregated environment. By providing a detailed breakdown of this specific performance aspect, it enables operators to better diagnose and optimize performance bottlenecks related to KVCache movement between prefill and decode nodes.

Highlights

  • New Metric Introduction: I've added a new Prometheus Histogram metric, sglang:kvcache_transfer_latency, to specifically monitor the time taken for KVCache data transfer between prefill and decode nodes in a PD disaggregation setup. This provides a more granular view than the existing TTFT metric.
  • Latency Measurement Mechanism: I've implemented a mechanism to capture the start and end timestamps of KVCache transfer operations (send_kvcache and send_kvcache_slice). The duration is accumulated per request across all chunks, and the total latency is reported to the metrics collector once all KVCache data for a request has been successfully transferred.
  • Metrics Integration: The SchedulerMetricsCollector is now passed to the KV manager in both the prefill and decode components, allowing the MooncakeConn (responsible for KVCache transfer) to observe and report the new latency metric.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new histogram metric, kvcache_transfer_latency, to monitor the performance of KVCache transfers in a PD disaggregation setup. The changes involve adding the metric definition, capturing timestamps around the transfer process, and reporting the collected latency. The review focuses on improving code clarity, consistency, and maintainability by suggesting simplifications to the latency calculation logic, recommending a more appropriate timer, and refactoring constants.

Comment on lines +553 to +569
if self.scheduler_metrics_collector is not None:
if req.room not in self.kvcache_transfer_latency_table:
self.kvcache_transfer_latency_table[req.room] = {
req.mooncake_session_id: -time.time()
}
else:
if (
req.mooncake_session_id
not in self.kvcache_transfer_latency_table[req.room]
):
self.kvcache_transfer_latency_table[req.room][
req.mooncake_session_id
] = -time.time()
else:
self.kvcache_transfer_latency_table[req.room][
req.mooncake_session_id
] -= time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nested if statements for checking the existence of req.room and req.mooncake_session_id in self.kvcache_transfer_latency_table can be simplified using setdefault to avoid redundant checks and improve readability. This approach reduces the lines of code and makes the intent clearer.

                            if self.scheduler_metrics_collector is not None:
                                latency_table = self.kvcache_transfer_latency_table.setdefault(req.room, {})
                                latency_table[req.mooncake_session_id] = -time.time()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-commit module would make the code more complicated if your suggestion here is adapted, right now this change is the most concise one as far as I see.

@SCDESPERTATE SCDESPERTATE marked this pull request as draft July 11, 2025 04:27
@SCDESPERTATE SCDESPERTATE marked this pull request as ready for review July 11, 2025 12:09
@ShangmingCai
Copy link
Collaborator

@whybeyoung @stmatengss Can you sync on the design with the PR owner? I think this is related to the fine-grained monitor/profiling for PD.

@stmatengss
Copy link
Collaborator

@whybeyoung @stmatengss Can you sync on the design with the PR owner? I think this is related to the fine-grained monitor/profiling for PD.

Sure. I will handle this PR.

# Convenience function for logging to gauge.
gauge.labels(**self.labels).set(data)

def observe_kvcache_transfer_latency(self, value: float):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return value should be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Appreciate it

)

if self.scheduler_metrics_collector is not None:
self.kvcache_transfer_latency_table: Dict[int, Dict[str, float]] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int: req.room, str: session id, float: time. My understanding is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

if self.scheduler_metrics_collector is None:
return

if before_transfer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement like this

If xxx not in kvcache_transfer_latency_table:
     kvcache_transfer_latency_table[xxx] = 0
kvcache_transfer_latency_table[xxx] -= time()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, kvcache_transfer_latency_table[xxx] itself is a dict type data too. But I find your suggestion useful in polishing line 960~966😊

f"Losing connection with prefill instance (bootstrap_addr: {failed_bootstrap_addr}), affected {len(affected_rooms)} requests"
)

def _collect_kv_transfer_timestamp(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two seperate functions will be more generate. _collect_kv_transfer_timestamp_begin and _collect_kv_transfer_timestamp_end

@@ -26,6 +26,7 @@
)
from sglang.srt.disaggregation.common.utils import group_concurrent_contiguous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you don't implement the same latency monitor for nixl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So far, I'm not quite familiar with the NIXL part code, may be later I would go through that then add support for it. But for interface compatibility, the collector field has to be added, otherwise exception would be raised.

if kv_chunk.is_last:
if self.scheduler_metrics_collector is not None:
self.scheduler_metrics_collector.observe_kvcache_transfer_latency(
self.kvcache_transfer_latency_table[req.room].pop(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can kvcache_transfer_latency_table be a part of scheduler_metrics_collector? It can reduce the code changes in conn.py side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense, but my concern is such change would disrupt the design of the collector since all its members's types are from Prometheus. What's more, operations like collecting timestamp still looks better to stay out of the collector class since I believe it is disaggregation-specific.

Copy link
Contributor Author

@SCDESPERTATE SCDESPERTATE Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up, I think the kvcache_transfer_latency_table and its related operations could be abstracted into a class and placed into the python/sglang/srt/disaggregation/utils.py file. What do you think?

Copy link
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@stmatengss
Copy link
Collaborator

Could you implement the same function in the other backends (Ascend, Nixl, and Fake)? @SCDESPERTATE

@SCDESPERTATE
Copy link
Contributor Author

Could you implement the same function in the other backends (Ascend, Nixl, and Fake)? @SCDESPERTATE

As far as I see, backend Ascend inherits the transfer process defined in Mooncake and Fake doesn't involve data transfer. Therefore only Nixl needs to be altered. However, since I have to spend some time going through its implementation to be sure to implement the correct support, the implementation in the Nixl backend is left to relevant developers. Furthermore, class KVCacheTransferLatencyMonitor's interfaces is not mooncake-specific, being plug-and-play for the Nixl backend.

disaggregation_mode: DisaggregationMode,
server_args: ServerArgs,
is_mla_backend: Optional[bool] = False,
scheduler_metrics_collector: Optional[SchedulerMetricsCollector] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the other pr. don't pass collector into kv manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a deep analysis, I find it difficult to achieve a fine-grained latency tracking without passing the collector to the KVManager && the coordination inside the conn.py🤔 The purpose of this PR is to track the KV transfer latency of the exact network stack, reflecting the real-time network performance. However, if the timestamp collecting is only allowed in the prefill.py, though generality is preserved, other irrelevant latencies like request queueing, scheduler dispatching and result polling would be included in this metric, which would mislead the operators. Hence, I think passing the collector into the KVManager in this case is quite necessary.

@stmatengss
Copy link
Collaborator

Do you happen to have any updates to this PR? @SCDESPERTATE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants