Refactor: observability code cleanup#17862
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
fbb3f87 to
c381672
Compare
ShangmingCai
left a comment
There was a problem hiding this comment.
The disaggregation part LGTM.
| global_diff_realtime_monotonic = time.time() - time.perf_counter() | ||
|
|
||
|
|
||
| def calibrate_time_diff(): |
There was a problem hiding this comment.
Using global_diff_realtime_monotonic is a good idea for converting perf_counter values to timestamps. However, when DP > 1, you may have a single tokenization manager but multiple schedulers running across different devices. In that case, you need to be careful: each process/device can have its own monotonic clock offset, so the conversion may be inconsistent across ranks unless those offsets are synchronized or computed per rank.
There was a problem hiding this comment.
Monotonic time is typically obtained via the Linux kernel's vDSO (or system call), and the Linux kernel ensures timestamp consistency across different CPUs. Even in extremely rare cases, non-observability functionalities only rely on time retrieved within the same process, while observability features can fully tolerate such minimal timing discrepancies.
There was a problem hiding this comment.
But if the tokenize manager and the scheduler are running on completely different machines, will the monotonic time they obtain still be consistent?
There was a problem hiding this comment.
Fixed. Correct monotonic time during deserialization by propagating it with a diff.
2c1c70f to
8092b3f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the observability code. The changes centralize timing, metrics, and tracing logic into new ReqTimeStats and TraceContext objects, which greatly improves code organization and maintainability. The new API for tracing is much cleaner and more powerful, with features like dynamic trace levels. The use of monotonic time and careful handling of time across processes are also commendable. Overall, this is a high-quality refactoring that enhances the observability of the system. My review includes a few minor suggestions for improving the documentation.
|
/rerun-failed-ci |
584e0e6 to
d42426b
Compare
sherlockwu
left a comment
There was a problem hiding this comment.
Thanks for the cleanup of the metrics and tracing code! LGTM. A few small nits below.
| # Detailed breakdown of cached tokens by source (device/host/storage) | ||
| cached_tokens_details: Optional[List[Optional[Dict[str, Any]]]] = None | ||
|
|
||
| # for observability |
There was a problem hiding this comment.
nit: For observability
same as other places.
| self.abort() | ||
|
|
||
| if attrs: | ||
| self.root_span.set_attributes(attrs) |
There was a problem hiding this comment.
Is it possible for root_span to be None (e.g. after deserialization)?
It looks impossible under current logic -- but probably safer to add an explicit guard / null check
There was a problem hiding this comment.
ok, I have added it.
d76ad22 to
d233882
Compare
|
/rerun-failed-ci |
e558920 to
281156b
Compare
|
/rerun-failed-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
74c4f00 to
74112dd
Compare
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
74112dd to
9f765ab
Compare
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Adapt HiCache host-tier metrics to follow the canonical SchedulerStats -> SchedulerMetricsCollector -> log_stats() pipeline established in #17862. Mirrors the LoRA conditional-metrics pattern: - Add hicache_host_used_tokens/hicache_host_total_tokens to SchedulerStats - Add enable_hierarchical_cache flag to SchedulerMetricsCollector - Populate stats in _log_hicache_stats() before log_stats() push - Remove standalone HiCacheMetricsCollector class
|
Hi @sufeng-buaa, Thanks for the great refactoring work! While reviewing the observability code, I noticed that the service name is fixed as "sglang". Would it make sense to make it compatible with the standard OTEL_SERVICE_NAME environment variable? This could make it easier to integrate with existing OpenTelemetry setups. If you are open to this enhancement, I'd be more than happy to implement it and submit a follow-up PR. Let me know what you think! |
The tracing feature is still pretty early. I haven't really thought hard about the naming yet. Feel free to send PRs and help standardize it. |
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Motivation
According to #17482, organize the observability-related code, remove redundant code, and unify the interfaces for time statistics, request latency metrics, and request tracing.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci