[PD metrics] Add latency Histogram metrics of each stage for generate requests by acelyc111 · Pull Request #8710 · sgl-project/sglang

acelyc111 · 2025-08-02T17:42:14Z

Motivation

In PD disaggregation deployments, the request on either prefill or decode side will experience several stages. When trouble shooting online issue, we can only observe the queue lengths of the request in these stages (e.g. num_prefill_bootstrap_queue_reqs, num_decode_prealloc_queue_reqs), but we didn't observe the duration of requests on each stage, so it's not easy to distinguish the queue lengths are dynamically changing but keep in the number, or just staying motionless.

Modifications

Introduce a new metric named sglang:request_latency_seconds, it is labeled as other scheduler metrics with an extra label stage, which can be the following values:

    # prefill
    PREFILL_WAITING = "prefill_waiting"

    # disaggregation prefill
    PREFILL_PREPARE = "prefill_prepare"
    PREFILL_BOOTSTRAP = "prefill_bootstrap"
    PREFILL_FORWARD = "prefill_forward"
    PREFILL_TRANSFER_KV_CACHE = "prefill_transfer_kv_cache"

    # disaggregation decode
    DECODE_PREPARE = "decode_prepare"
    DECODE_BOOTSTRAP = "decode_bootstrap"
    DECODE_WAITING = "decode_waiting"
    DECODE_TRANSFERRED = "decode_transferred"

The metric values on each stage are easy to be record by using the method:

// stage A operations
req.add_latency("stage_A")
// stage B operations
req.add_latency("stage_B")
// stage C operations
req.add_latency("stage_C")

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @acelyc111, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances our performance monitoring capabilities by introducing detailed latency metrics for various stages of request processing within the system. My primary goal is to provide granular insights into the time spent during prefill and decode operations, enabling more effective identification and resolution of performance bottlenecks. This is achieved by instrumenting key code paths to record stage-specific latencies and exposing these as Prometheus Histogram metrics.

Highlights

New Latency Tracking Method: I've introduced a new add_latency method within the Req class, which allows for precise timing of different request processing stages. This method records the time elapsed since the last recorded point for a given stage.
Granular Latency Measurement Points: I've integrated calls to the new add_latency method at key points across the prefill and decode pipelines. This includes stages like prepare, bootstrap, forward, transfer_kv_cache, and waiting, providing granular insights into where time is spent.
Prometheus Histogram for Stage Latency: I've added a new Prometheus Histogram metric, sglang:request_latency_seconds, to the SchedulerMetricsCollector. This Histogram is configured with exponential buckets to effectively capture and visualize latency distributions for each stage, allowing for better bottleneck identification.
Metrics Utility Refactoring: I've refactored the exponential_buckets utility function into its own dedicated file (sglang/srt/metrics/utils.py) to improve code organization and reusability for metric-related utilities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces latency histogram metrics for various stages of generate requests in the disaggregated prefill/decode architecture. The changes involve adding a new Histogram metric in the SchedulerMetricsCollector, a new add_latency method to the Req class to record stage latencies, and instrumenting the code at key points in the request lifecycle to call this method. The implementation is mostly correct, but I've identified a potential AttributeError when metrics are disabled and a misleading comment in the metric definition. My feedback includes suggestions to fix these issues.

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/metrics/collector.py

ishandhanani · 2025-08-04T17:15:42Z

This is great - thank you so much for this PR (@SCDESPERTATE for #7317). Having more observability around the P/D pieces is crucial.

Tagging @zhyncs and @ByronHsu to take a look. Lets see if we can get these PRs in soon

ishandhanani · 2025-08-04T18:46:24Z

@acelyc111 - NVIDIA GPU CI seems to be failing. An example is https://github.com/sgl-project/sglang/actions/runs/16723290245/job/47354428066?pr=8710

acelyc111 · 2025-08-05T02:19:05Z

@acelyc111 - NVIDIA GPU CI seems to be failing. An example is https://github.com/sgl-project/sglang/actions/runs/16723290245/job/47354428066?pr=8710

Fixed, please trigger the CI again, thanks!

SCDESPERTATE · 2025-08-05T02:47:26Z

@ishandhanani Actually, #7944 is that similar work for monitoring the KVcache transfer latency, not #7317 😂. The discussion and review have already been opened on that PR.

ishandhanani · 2025-08-05T05:20:38Z

@acelyc111 - another merge conflict

acelyc111 · 2025-08-06T16:27:38Z

@acelyc111 - another merge conflict

@ishandhanani Resolved.

acelyc111 · 2025-08-11T15:45:10Z

@zhyncs Please take a look, thanks!

ZhengWG · 2025-08-22T02:45:15Z

python/sglang/srt/disaggregation/decode.py

For the prepare stage of the request, would it be better to use create_time as last_tic? This way we can include the tokenizer time in the metrics.

To fine grained stat the time cost in tokenizer, prefill, detokenizer stages, I think it's better to separate theses staged, the metrics added in this PR only care about the latencies in prefill and decode stages.

acelyc111 · 2025-08-23T05:36:08Z

@ishandhanani @zhyncs Is it ready to be merged? Just want to reduce the difference between our company intra repo and the community repo, thanks!

zhyncs · 2025-09-14T09:34:25Z

@hnyls2002

… requests

hnyls2002

LGTM. @ShangmingCai @trevor-m @ByronHsu Maybe you can take a look at the PD-disaggregation latency records.

hnyls2002 · 2025-09-15T17:49:11Z

python/sglang/srt/managers/schedule_batch.py

 from __future__ import annotations

+import enum
+


move this below copyright

… requests (sgl-project#8710)

acelyc111 requested review from ByronHsu, Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners August 2, 2025 17:42

acelyc111 changed the title ~~[PD metrics] Add latency Histogram metrics of each stage for generate…~~ [PD metrics] Add latency Histogram metrics of each stage for generate requests Aug 2, 2025

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

python/sglang/srt/metrics/collector.py Outdated Show resolved Hide resolved

acelyc111 force-pushed the fix_some_pd_metrics branch from 6462fa6 to 35be3a1 Compare August 3, 2025 10:37

ishandhanani mentioned this pull request Aug 3, 2025

[SGLANG]: Backend Roadmap/Improvements ai-dynamo/dynamo#2261

Open

19 tasks

acelyc111 force-pushed the fix_some_pd_metrics branch from 34ded43 to c8668f6 Compare August 4, 2025 02:09

acelyc111 force-pushed the fix_some_pd_metrics branch from 81a95b9 to 8b1e111 Compare August 5, 2025 02:17

acelyc111 force-pushed the fix_some_pd_metrics branch from 3450c56 to 963e5dd Compare August 12, 2025 02:11

ishandhanani approved these changes Aug 15, 2025

View reviewed changes

ZhengWG reviewed Aug 22, 2025

View reviewed changes

acelyc111 force-pushed the fix_some_pd_metrics branch from 963e5dd to 5fcc273 Compare August 23, 2025 05:09

acelyc111 force-pushed the fix_some_pd_metrics branch from 5fcc273 to ac78131 Compare September 14, 2025 09:21

zhyncs self-assigned this Sep 14, 2025

zhyncs added the high priority label Sep 14, 2025

zhyncs assigned hnyls2002 Sep 14, 2025

[PD metrics] Add latency Histogram metrics of each stage for generate…

5d02297

… requests

fix

20aa5f0

acelyc111 force-pushed the fix_some_pd_metrics branch from 2f22475 to 20aa5f0 Compare September 14, 2025 09:41

hnyls2002 approved these changes Sep 15, 2025

View reviewed changes

python/sglang/srt/managers/schedule_batch.py

from __future__ import annotations

import enum

Copy link

Collaborator

hnyls2002 Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this below copyright

hnyls2002 merged commit b1721ed into sgl-project:main Sep 15, 2025
80 of 85 checks passed

sufeng-buaa mentioned this pull request Sep 25, 2025

[Feature] Propose Unified Observability Interface for Request Tracing, PD Metric, and TimeStat Log #10916

Closed

2 tasks

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025

[PD metrics] Add latency Histogram metrics of each stage for generate…

d6de301

… requests (sgl-project#8710)

jimmy-evo mentioned this pull request Oct 27, 2025

fix(metrics): double times add_latency for DECODE_BOOTSTRAP #12209

Merged

sufeng-buaa mentioned this pull request Jan 11, 2026

[metrics] Add more useful metrics #15809

Open

6 tasks

Conversation

acelyc111 commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ishandhanani commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Aug 4, 2025

Uh oh!

acelyc111 commented Aug 5, 2025

Uh oh!

SCDESPERTATE commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Aug 5, 2025

Uh oh!

acelyc111 commented Aug 6, 2025

Uh oh!

acelyc111 commented Aug 11, 2025

Uh oh!

ZhengWG Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

acelyc111 Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

acelyc111 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Sep 14, 2025

Uh oh!

hnyls2002 left a comment

Choose a reason for hiding this comment

Uh oh!

hnyls2002 Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

acelyc111 commented Aug 2, 2025 •

edited

Loading

ishandhanani commented Aug 4, 2025 •

edited

Loading

SCDESPERTATE commented Aug 5, 2025 •

edited

Loading

acelyc111 commented Aug 23, 2025 •

edited

Loading