SGLang Tracing: Add trace-level, trace-module, and unify tracing/request-stage-metrics#13152
SGLang Tracing: Add trace-level, trace-module, and unify tracing/request-stage-metrics#13152sufeng-buaa wants to merge 53 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the tracing and metrics collection system to enhance flexibility, reduce overhead, and broaden its applicability. By transitioning to a class-based Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the tracing system in sglang. It replaces the global tracing functions with a class-based design centered around RequestTimeRecorder, unifying tracing with request stage metrics. It also adds --trace-level and --trace-module for more granular control over tracing, replacing the old --enable-trace flag. The changes are extensive, touching documentation, server arguments, and core scheduler logic. My review focused on ensuring the new API is used consistently, the documentation is accurate, and the refactoring is sound. I've identified a few issues in the documentation that need correction and a critical typo in a function name that would lead to a runtime error. Overall, this is a solid enhancement to the project's observability features.
c12a958 to
e75d844
Compare
|
All feedback from Bot Assist has been addressed. |
python/sglang/srt/tracing/trace.py
Outdated
|
|
||
|
|
||
| @dataclass | ||
| class SglangTraceEvent: |
There was a problem hiding this comment.
| class SglangTraceEvent: | |
| class SGLangTraceEvent: |
nit: we should probably use the correct uppercase and lowercase of SGLang.
There was a problem hiding this comment.
I have renamed all 'Sglang***' to 'SGLang***'
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
|
Could you please take a look at my code when you have time @fzyzcjy ? They still seem very busy. |
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM. Let's wait for the CI and see if other reviewers have any other comments.
6468590 to
595367d
Compare
python/sglang/srt/tracing/trace.py
Outdated
| global get_cur_time_ns | ||
| if not opentelemetry_imported: | ||
| tracing_enabled = False | ||
| logger.warning("opentelemetry package is not installed!!! audo disable tracing") |
There was a problem hiding this comment.
There’s a minor typo: “auto”.
More importantly, I’m wondering whether we should raise an exception and terminate the main process to clearly alert users when required dependencies are missing—ensuring they understand that tracing will not function. We’ve encountered this issue repeatedly: users forget to install the necessary dependencies and later contact us confused about why no traces are being uploaded.
Moreover, I found that both vLLM and TensorRT-LLM adopt this approach—they raise exceptions in such scenarios to prevent silent failures.
There was a problem hiding this comment.
That makes sense, I'll make the changes.
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
Deprecated. Please see the new PR: #17862 |



Motivation
The PR is response to #10916. For details on the motivation and visual output, please refer to the issue.
To reduce span overhead, we added the trace-level feature. To support broader use cases beyond request tracing, we introduced trace-module.
Modifications
I thought about unifying TimeStat too, but it would require too many changes, so I gave up on that. May I will push a draft patch later.
Instrumentation Overhead Evaluation
The overhead of each instrumentation point remains almost unchanged compared to before. See #9962 and #10804
Checklist