Skip to content

[PD metrics] Add latency Histogram metrics of each stage for generate requests#8710

Merged
hnyls2002 merged 2 commits intosgl-project:mainfrom
acelyc111:fix_some_pd_metrics
Sep 15, 2025
Merged

[PD metrics] Add latency Histogram metrics of each stage for generate requests#8710
hnyls2002 merged 2 commits intosgl-project:mainfrom
acelyc111:fix_some_pd_metrics

Conversation

@acelyc111
Copy link
Collaborator

@acelyc111 acelyc111 commented Aug 2, 2025

Motivation

In PD disaggregation deployments, the request on either prefill or decode side will experience several stages. When trouble shooting online issue, we can only observe the queue lengths of the request in these stages (e.g. num_prefill_bootstrap_queue_reqs, num_decode_prealloc_queue_reqs), but we didn't observe the duration of requests on each stage, so it's not easy to distinguish the queue lengths are dynamically changing but keep in the number, or just staying motionless.

Modifications

Introduce a new metric named sglang:request_latency_seconds, it is labeled as other scheduler metrics with an extra label stage, which can be the following values:

    # prefill
    PREFILL_WAITING = "prefill_waiting"

    # disaggregation prefill
    PREFILL_PREPARE = "prefill_prepare"
    PREFILL_BOOTSTRAP = "prefill_bootstrap"
    PREFILL_FORWARD = "prefill_forward"
    PREFILL_TRANSFER_KV_CACHE = "prefill_transfer_kv_cache"

    # disaggregation decode
    DECODE_PREPARE = "decode_prepare"
    DECODE_BOOTSTRAP = "decode_bootstrap"
    DECODE_WAITING = "decode_waiting"
    DECODE_TRANSFERRED = "decode_transferred"

The metric values on each stage are easy to be record by using the method:

// stage A operations
req.add_latency("stage_A")
// stage B operations
req.add_latency("stage_B")
// stage C operations
req.add_latency("stage_C")

Accuracy Test

Benchmark & Profiling

Checklist

@acelyc111 acelyc111 changed the title [PD metrics] Add latency Histogram metrics of each stage for generate… [PD metrics] Add latency Histogram metrics of each stage for generate requests Aug 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @acelyc111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances our performance monitoring capabilities by introducing detailed latency metrics for various stages of request processing within the system. My primary goal is to provide granular insights into the time spent during prefill and decode operations, enabling more effective identification and resolution of performance bottlenecks. This is achieved by instrumenting key code paths to record stage-specific latencies and exposing these as Prometheus Histogram metrics.

Highlights

  • New Latency Tracking Method: I've introduced a new add_latency method within the Req class, which allows for precise timing of different request processing stages. This method records the time elapsed since the last recorded point for a given stage.
  • Granular Latency Measurement Points: I've integrated calls to the new add_latency method at key points across the prefill and decode pipelines. This includes stages like prepare, bootstrap, forward, transfer_kv_cache, and waiting, providing granular insights into where time is spent.
  • Prometheus Histogram for Stage Latency: I've added a new Prometheus Histogram metric, sglang:request_latency_seconds, to the SchedulerMetricsCollector. This Histogram is configured with exponential buckets to effectively capture and visualize latency distributions for each stage, allowing for better bottleneck identification.
  • Metrics Utility Refactoring: I've refactored the exponential_buckets utility function into its own dedicated file (sglang/srt/metrics/utils.py) to improve code organization and reusability for metric-related utilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces latency histogram metrics for various stages of generate requests in the disaggregated prefill/decode architecture. The changes involve adding a new Histogram metric in the SchedulerMetricsCollector, a new add_latency method to the Req class to record stage latencies, and instrumenting the code at key points in the request lifecycle to call this method. The implementation is mostly correct, but I've identified a potential AttributeError when metrics are disabled and a misleading comment in the metric definition. My feedback includes suggestions to fix these issues.

@ishandhanani
Copy link
Collaborator

ishandhanani commented Aug 4, 2025

This is great - thank you so much for this PR (@SCDESPERTATE for #7317). Having more observability around the P/D pieces is crucial.

Tagging @zhyncs and @ByronHsu to take a look. Lets see if we can get these PRs in soon

@ishandhanani
Copy link
Collaborator

@acelyc111 acelyc111 force-pushed the fix_some_pd_metrics branch from 81a95b9 to 8b1e111 Compare August 5, 2025 02:17
@acelyc111
Copy link
Collaborator Author

@acelyc111 - NVIDIA GPU CI seems to be failing. An example is https://github.com/sgl-project/sglang/actions/runs/16723290245/job/47354428066?pr=8710

Fixed, please trigger the CI again, thanks!

@SCDESPERTATE
Copy link
Contributor

SCDESPERTATE commented Aug 5, 2025

@ishandhanani Actually, #7944 is that similar work for monitoring the KVcache transfer latency, not #7317 😂. The discussion and review have already been opened on that PR.

@ishandhanani
Copy link
Collaborator

@acelyc111 - another merge conflict

@acelyc111
Copy link
Collaborator Author

@acelyc111 - another merge conflict

@ishandhanani Resolved.

@acelyc111
Copy link
Collaborator Author

@zhyncs Please take a look, thanks!

@acelyc111 acelyc111 force-pushed the fix_some_pd_metrics branch from 3450c56 to 963e5dd Compare August 12, 2025 02:11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the prepare stage of the request, would it be better to use create_time as last_tic? This way we can include the tokenizer time in the metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fine grained stat the time cost in tokenizer, prefill, detokenizer stages, I think it's better to separate theses staged, the metrics added in this PR only care about the latencies in prefill and decode stages.

@acelyc111 acelyc111 force-pushed the fix_some_pd_metrics branch from 963e5dd to 5fcc273 Compare August 23, 2025 05:09
@acelyc111
Copy link
Collaborator Author

acelyc111 commented Aug 23, 2025

@ishandhanani @zhyncs Is it ready to be merged? Just want to reduce the difference between our company intra repo and the community repo, thanks!

@zhyncs zhyncs self-assigned this Sep 14, 2025
@zhyncs
Copy link
Collaborator

zhyncs commented Sep 14, 2025

@hnyls2002

Copy link
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @ShangmingCai @trevor-m @ByronHsu Maybe you can take a look at the PD-disaggregation latency records.

from __future__ import annotations

import enum

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this below copyright

@hnyls2002 hnyls2002 merged commit b1721ed into sgl-project:main Sep 15, 2025
80 of 85 checks passed
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants