[PD metrics] Add latency Histogram metrics of each stage for generate requests#8710
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @acelyc111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances our performance monitoring capabilities by introducing detailed latency metrics for various stages of request processing within the system. My primary goal is to provide granular insights into the time spent during prefill and decode operations, enabling more effective identification and resolution of performance bottlenecks. This is achieved by instrumenting key code paths to record stage-specific latencies and exposing these as Prometheus Histogram metrics.
Highlights
- New Latency Tracking Method: I've introduced a new
add_latencymethod within theReqclass, which allows for precise timing of different request processing stages. This method records the time elapsed since the last recorded point for a given stage. - Granular Latency Measurement Points: I've integrated calls to the new
add_latencymethod at key points across the prefill and decode pipelines. This includes stages likeprepare,bootstrap,forward,transfer_kv_cache, andwaiting, providing granular insights into where time is spent. - Prometheus Histogram for Stage Latency: I've added a new Prometheus
Histogrammetric,sglang:request_latency_seconds, to theSchedulerMetricsCollector. This Histogram is configured with exponential buckets to effectively capture and visualize latency distributions for each stage, allowing for better bottleneck identification. - Metrics Utility Refactoring: I've refactored the
exponential_bucketsutility function into its own dedicated file (sglang/srt/metrics/utils.py) to improve code organization and reusability for metric-related utilities.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces latency histogram metrics for various stages of generate requests in the disaggregated prefill/decode architecture. The changes involve adding a new Histogram metric in the SchedulerMetricsCollector, a new add_latency method to the Req class to record stage latencies, and instrumenting the code at key points in the request lifecycle to call this method. The implementation is mostly correct, but I've identified a potential AttributeError when metrics are disabled and a misleading comment in the metric definition. My feedback includes suggestions to fix these issues.
6462fa6 to
35be3a1
Compare
34ded43 to
c8668f6
Compare
|
This is great - thank you so much for this PR (@SCDESPERTATE for #7317). Having more observability around the P/D pieces is crucial. Tagging @zhyncs and @ByronHsu to take a look. Lets see if we can get these PRs in soon |
|
@acelyc111 - NVIDIA GPU CI seems to be failing. An example is https://github.com/sgl-project/sglang/actions/runs/16723290245/job/47354428066?pr=8710 |
81a95b9 to
8b1e111
Compare
Fixed, please trigger the CI again, thanks! |
|
@ishandhanani Actually, #7944 is that similar work for monitoring the KVcache transfer latency, not #7317 😂. The discussion and review have already been opened on that PR. |
|
@acelyc111 - another merge conflict |
@ishandhanani Resolved. |
|
@zhyncs Please take a look, thanks! |
3450c56 to
963e5dd
Compare
There was a problem hiding this comment.
For the prepare stage of the request, would it be better to use create_time as last_tic? This way we can include the tokenizer time in the metrics.
There was a problem hiding this comment.
To fine grained stat the time cost in tokenizer, prefill, detokenizer stages, I think it's better to separate theses staged, the metrics added in this PR only care about the latencies in prefill and decode stages.
963e5dd to
5fcc273
Compare
|
@ishandhanani @zhyncs Is it ready to be merged? Just want to reduce the difference between our company intra repo and the community repo, thanks! |
5fcc273 to
ac78131
Compare
2f22475 to
20aa5f0
Compare
hnyls2002
left a comment
There was a problem hiding this comment.
LGTM. @ShangmingCai @trevor-m @ByronHsu Maybe you can take a look at the PD-disaggregation latency records.
| from __future__ import annotations | ||
|
|
||
| import enum | ||
|
|
There was a problem hiding this comment.
move this below copyright
Motivation
In PD disaggregation deployments, the request on either prefill or decode side will experience several stages. When trouble shooting online issue, we can only observe the queue lengths of the request in these stages (e.g.
num_prefill_bootstrap_queue_reqs,num_decode_prealloc_queue_reqs), but we didn't observe the duration of requests on each stage, so it's not easy to distinguish the queue lengths are dynamically changing but keep in the number, or just staying motionless.Modifications
Introduce a new metric named
sglang:request_latency_seconds, it is labeled as other scheduler metrics with an extra labelstage, which can be the following values:The metric values on each stage are easy to be record by using the method:
Accuracy Test
Benchmark & Profiling
Checklist