Refactor: observability code cleanup by sufeng-buaa · Pull Request #17862 · sgl-project/sglang

sufeng-buaa · 2026-01-28T06:26:57Z

Motivation

According to #17482, organize the observability-related code, remove redundant code, and unify the interfaces for time statistics, request latency metrics, and request tracing.

Modifications

Move observability-related code to python/sglang/srt/observability
Remove redundant code, refactor inappropriate code, and correct non-standard naming.
Timestamps Record
- In python/sglang/srt/observability/req_time_stats.py, APIServerReqTimeStats, DPControllerReqTimeStats, and SchedulerReqTimeStats are defined to record timestamp information for the tokenizer/gRPC server, dp controller, and scheduler, respectively. A series of set_time methods are provided to set timestamps, along with get methods to calculate latency.
- Uniformly use MONOTONIC TIME, and update the time difference between MONOTONIC TIME and REALTIME upon each incoming request, for converting to REALTIME when necessary.
Request latency metrics
- Define the base class ReqTimeStatsBase for APIServerReqTimeStats, DPControllerReqTimeStats, and SchedulerReqTimeStats, integrating a metrics collector inside it. Export latency information to the metrics collector within each set_*_time method.
Request Tracing
- Define the base class ReqTimeStatsBase for APIServerReqTimeStats, DPControllerReqTimeStats, and SchedulerReqTimeStats, integrating a trace context inside it. Export trace spans within each set_*_time method. Define getstate and setstate of ReqTimeStatsBase to propagate the trace context.
- Refactor the tracing package and optimize the span structure.
- Support trace levels, and dynamically adjust trace levels via HTTP API.
- Support tracing for requests with parallel_sample_num > 1.
- Support tracing for request retract

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T06:27:01Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

sufeng-buaa · 2026-01-28T06:28:52Z

/tag-and-rerun-ci

sufeng-buaa · 2026-01-29T03:23:21Z

/gemini review

gemini-code-assist · 2026-01-29T03:23:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ShangmingCai

The disaggregation part LGTM.

zhanghaotong · 2026-01-29T06:50:34Z

python/sglang/srt/observability/req_time_stats.py

+global_diff_realtime_monotonic = time.time() - time.perf_counter()
+
+
+def calibrate_time_diff():


Using global_diff_realtime_monotonic is a good idea for converting perf_counter values to timestamps. However, when DP > 1, you may have a single tokenization manager but multiple schedulers running across different devices. In that case, you need to be careful: each process/device can have its own monotonic clock offset, so the conversion may be inconsistent across ranks unless those offsets are synchronized or computed per rank.

Monotonic time is typically obtained via the Linux kernel's vDSO (or system call), and the Linux kernel ensures timestamp consistency across different CPUs. Even in extremely rare cases, non-observability functionalities only rely on time retrieved within the same process, while observability features can fully tolerate such minimal timing discrepancies.

But if the tokenize manager and the scheduler are running on completely different machines, will the monotonic time they obtain still be consistent?

Fixed. Correct monotonic time during deserialization by propagating it with a diff.

sufeng-buaa · 2026-01-30T10:44:19Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the observability code. The changes centralize timing, metrics, and tracing logic into new ReqTimeStats and TraceContext objects, which greatly improves code organization and maintainability. The new API for tracing is much cleaner and more powerful, with features like dynamic trace levels. The use of monotonic time and careful handling of time across processes are also commendable. Overall, this is a high-quality refactoring that enhances the observability of the system. My review includes a few minor suggestions for improving the documentation.

docs/references/production_request_trace.md

sufeng-buaa · 2026-02-01T05:15:05Z

/rerun-failed-ci

sherlockwu

Thanks for the cleanup of the metrics and tracing code! LGTM. A few small nits below.

sherlockwu · 2026-02-18T00:56:33Z

python/sglang/srt/managers/io_struct.py

    # Detailed breakdown of cached tokens by source (device/host/storage)
    cached_tokens_details: Optional[List[Optional[Dict[str, Any]]]] = None

+    # for observability


nit: For observability

same as other places.

sherlockwu · 2026-02-18T01:21:18Z

python/sglang/srt/observability/trace.py

+        self.abort()
+
+        if attrs:
+            self.root_span.set_attributes(attrs)


Is it possible for root_span to be None (e.g. after deserialization)?

It looks impossible under current logic -- but probably safer to add an explicit guard / null check

ok, I have added it.

sufeng-buaa · 2026-02-19T13:30:58Z

/rerun-failed-ci

sufeng-buaa · 2026-02-23T04:17:33Z

/rerun-failed-ci

sufeng-buaa · 2026-02-23T14:21:01Z

/rerun-failed-ci

sufeng-buaa · 2026-02-23T15:07:38Z

/rerun-failed-ci

sufeng-buaa · 2026-02-23T16:22:40Z

/rerun-failed-ci

sufeng-buaa · 2026-02-23T16:57:12Z

/rerun-failed-ci

Signed-off-by: Feng Su <sufeng@linux.alibaba.com>

sufeng-buaa · 2026-02-24T05:01:16Z

/rerun-failed-ci

sufeng-buaa · 2026-02-24T06:42:01Z

/rerun-failed-ci

sherlockwu · 2026-02-24T17:22:52Z

/rerun-failed-ci

Adapt HiCache host-tier metrics to follow the canonical SchedulerStats -> SchedulerMetricsCollector -> log_stats() pipeline established in #17862. Mirrors the LoRA conditional-metrics pattern: - Add hicache_host_used_tokens/hicache_host_total_tokens to SchedulerStats - Add enable_hierarchical_cache flag to SchedulerMetricsCollector - Populate stats in _log_hicache_stats() before log_stats() push - Remove standalone HiCacheMetricsCollector class

feng-95 · 2026-02-27T04:02:24Z

Hi @sufeng-buaa, Thanks for the great refactoring work!

While reviewing the observability code, I noticed that the service name is fixed as "sglang". Would it make sense to make it compatible with the standard OTEL_SERVICE_NAME environment variable? This could make it easier to integrate with existing OpenTelemetry setups.

If you are open to this enhancement, I'd be more than happy to implement it and submit a follow-up PR. Let me know what you think!

sufeng-buaa · 2026-02-27T04:21:12Z

Hi @sufeng-buaa, Thanks for the great refactoring work!

While reviewing the observability code, I noticed that the service name is fixed as "sglang". Would it make sense to make it compatible with the standard OTEL_SERVICE_NAME environment variable? This could make it easier to integrate with existing OpenTelemetry setups.

If you are open to this enhancement, I'd be more than happy to implement it and submit a follow-up PR. Let me know what you think!

The tracing feature is still pretty early. I haven't really thought hard about the naming yet. Feel free to send PRs and help standardize it.

Signed-off-by: Feng Su <sufeng@linux.alibaba.com>

sufeng-buaa requested review from ByronHsu, CatherineSue, JustinTong0323, ShangmingCai, Ying1123, ch-wan, fzyzcjy, hanming-lu, hnyls2002, ispobock, merrymercy, slin1237, xiezhq-hermann and yizhang2077 as code owners January 28, 2026 06:26

github-actions bot added the documentation Improvements or additions to documentation label Jan 28, 2026

github-actions bot added the run-ci label Jan 28, 2026

sufeng-buaa mentioned this pull request Jan 29, 2026

SGLang Tracing: Add trace-level, trace-module, and unify tracing/request-stage-metrics #13152

Closed

4 tasks

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch from fbb3f87 to c381672 Compare January 29, 2026 05:45

ShangmingCai reviewed Jan 29, 2026

View reviewed changes

zhanghaotong reviewed Jan 29, 2026

View reviewed changes

sufeng-buaa mentioned this pull request Jan 30, 2026

[Roadmap] roadmap of request tracing (2025 Q4 and 2026 Q1) #13511

Open

16 tasks

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch 2 times, most recently from 2c1c70f to 8092b3f Compare January 30, 2026 10:44

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

docs/references/production_request_trace.md Outdated Show resolved Hide resolved

docs/references/production_request_trace.md Outdated Show resolved Hide resolved

docs/references/production_request_trace.md Outdated Show resolved Hide resolved

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch 4 times, most recently from 584e0e6 to d42426b Compare February 18, 2026 01:15

sherlockwu approved these changes Feb 18, 2026

View reviewed changes

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch 2 times, most recently from d76ad22 to d233882 Compare February 19, 2026 10:12

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch 2 times, most recently from e558920 to 281156b Compare February 23, 2026 03:11

merrymercy approved these changes Feb 23, 2026

View reviewed changes

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch 2 times, most recently from 74c4f00 to 74112dd Compare February 24, 2026 00:19

Refactor: observability code cleanup

9f765ab

Signed-off-by: Feng Su <sufeng@linux.alibaba.com>

sufeng-buaa force-pushed the sufeng-buaa/observability_integration branch from 74112dd to 9f765ab Compare February 24, 2026 02:28

merrymercy merged commit 3b89302 into sgl-project:main Feb 25, 2026
256 of 271 checks passed

sufeng-buaa linked an issue Feb 27, 2026 that may be closed by this pull request

[Refactor ideas] Per request time stats tracing #17482

Closed

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Refactor: observability code cleanup (sgl-project#17862)

fd7f6d1

Signed-off-by: Feng Su <sufeng@linux.alibaba.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Refactor: observability code cleanup (sgl-project#17862)

226bc31

Signed-off-by: Feng Su <sufeng@linux.alibaba.com>

		global_diff_realtime_monotonic = time.time() - time.perf_counter()


		def calibrate_time_diff():

Conversation

sufeng-buaa commented Jan 28, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Uh oh!

sufeng-buaa commented Jan 28, 2026

Uh oh!

sufeng-buaa commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sufeng-buaa commented Feb 1, 2026

Uh oh!

sherlockwu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa commented Feb 19, 2026

Uh oh!

sufeng-buaa commented Feb 23, 2026

Uh oh!

sufeng-buaa commented Feb 23, 2026

Uh oh!

sufeng-buaa commented Feb 23, 2026

Uh oh!

sufeng-buaa commented Feb 23, 2026

Uh oh!

sufeng-buaa commented Feb 23, 2026

Uh oh!

sufeng-buaa commented Feb 24, 2026

Uh oh!

sufeng-buaa commented Feb 24, 2026

Uh oh!

sherlockwu commented Feb 24, 2026

Uh oh!

Uh oh!

feng-95 commented Feb 27, 2026

Uh oh!

sufeng-buaa commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!