sgl-project · sufeng-buaa · Nov 13, 2025 · Nov 13, 2025 · Nov 13, 2025 · Nov 13, 2025
diff --git a/docs/references/production_request_trace.md b/docs/references/production_request_trace.md
@@ -1,6 +1,6 @@
 # Production Request Tracing
 
-SGlang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--enable-trace` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server.
+SGLang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--trace-level` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server. The `--trace-level` option accepts configurable values from `0` to `3`, where `0` means tracing is disabled and higher numbers indicate more detailed tracing. Additionally, you can use `--trace-module` to specify the module to trace; currently, only `request` is supported.
 
 You can find example screenshots of the visualization in https://github.com/sgl-project/sglang/issues/8965.
 
@@ -17,39 +17,39 @@ This section explains how to configure the request tracing and export the trace
     pip install opentelemetry-sdk opentelemetry-api opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc
     ```
 
-2. launch opentelemetry collector and jaeger
+2. Launch OpenTelemetry collector and Jaeger
     ```bash
     docker compose -f examples/monitoring/tracing_compose.yaml up -d
     ```
 
-3. start your SGLang server with tracing enabled
+3. Start your SGLang server with tracing enabled
     ```bash
     # set env variables
     export SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS=500
     export SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE=64
     # start the prefill and decode server
-    python -m sglang.launch_server --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 <other option>
+    python -m sglang.launch_server --trace-level 3 --otlp-traces-endpoint 0.0.0.0:4317 [--trace-module request] <other option>
     # start the mini lb
     python -m sglang_router.launch_router --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 <other option>
     ```
 
-    Replace `0.0.0.0:4317` with the actual endpoint of the opentelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317.
+    Replace `0.0.0.0:4317` with the actual endpoint of the OpenTelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317.
 
     To use the HTTP/protobuf span exporter, set the following environment variable and point to an HTTP endpoint, for example, `http://0.0.0.0:4318/v1/traces`.
     ```bash
     export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
     ```
 
 
-4. raise some requests
+4. Raise some requests
 5. Observe whether trace data is being exported
     * Access port 16686 of Jaeger using a web browser to visualize the request traces.
     * The OpenTelemetry Collector also exports trace data in JSON format to /tmp/otel_trace.json. In a follow-up patch, we will provide a tool to convert this data into a Perfetto-compatible format, enabling visualization of requests in the Perfetto UI.
 
-## How to add Tracing for slices you're interested in?
+## How to add Tracing for slices you're interested in?(API introduction)
 We have already inserted instrumentation points in the tokenizer and scheduler main threads. If you wish to trace additional request execution segments or perform finer-grained tracing, please use the APIs from the tracing package as described below.
 
-1. initialization
+1. Initialization
 
     Every process involved in tracing during the initialization phase should execute:
     ```python
@@ -63,98 +63,64 @@ We have already inserted instrumentation points in the tokenizer and scheduler m
     ```
     The "thread label" can be regarded as the name of the thread, used to distinguish different threads in the visualization view.
 
-2. Mark the beginning and end of a request
+2. Create a time recorder for a request
+    Each request needs to call `TraceMetricContext()` to initialize a time recorder, which is used to generate slice spans and request stage metrics. You can either store it within the request object or maintain it as a global variable. A set of APIs for managing the global time recorder is provided in `python/sglang/srt/tracing/trace_metric_wrapper.py`.
+
+3. Mark the beginning and end of a request
     ```
-    trace_req_start(rid, bootstrap_room)
-    trace_req_finish(rid)
+    # The time recorder calls trace_req_start() by default when it is created.
+    trace_metric_ctx.trace_req_finish()
     ```
-    These two APIs must be called within the same process, for example, in the tokenizer.
+    TraceMetricContext() and trace_req_finish() must be called within the same process, for example, in the tokenizer.
 
-3. Add tracing for slice
+4. Add tracing for a slice
 
     * Add slice tracing normally:
         ```python
-        trace_slice_start("slice A", rid)
-        trace_slice_end("slice A", rid)
+        trace_metric_ctx.slice_start(RequestStage.TOKENIZER)
+        trace_metric_ctx.slice_end(RequestStage.TOKENIZER)
         ```
 
-    - Use the "anonymous" flag to not specify a slice name at the start of the slice, allowing the slice name to be determined by trace_slice_end.
+    - Use the `ANONYMOUS` to not specify a slice name at the start of the slice, allowing the slice name to be determined by trace_slice_end.
     <br>Note: Anonymous slices must not be nested.
         ```python
-        trace_slice_start("", rid, anonymous = True)
-        trace_slice_end("slice A", rid)
+        trace_metric_ctx.slice_start(RequestStage.ANONYMOUS)
+        trace_metric_ctx.slice_end(RequestStage.TOKENIZER)
         ```
 
-    - In trace_slice_end, use auto_next_anon to automatically create the next anonymous slice, which can reduce the number of instrumentation points needed.
+    - In slice_end, use auto_next_anon to automatically create the next anonymous slice, which can reduce the number of instrumentation points needed.
         ```python
-        trace_slice_start("", rid, anonymous = True)
-        trace_slice_end("slice A", rid, auto_next_anon = True)
-        trace_slice_end("slice B", rid, auto_next_anon = True)
-        trace_slice_end("slice C", rid, auto_next_anon = True)
-        trace_slice_end("slice D", rid)
+        trace_metric_ctx.slice_start(RequestStage.ANONYMOUS)
+        trace_metric_ctx.slice_end(RequestStage.A, auto_next_anon = True)
+        trace_metric_ctx.slice_end(RequestStage.B, auto_next_anon = True)
+        trace_metric_ctx.slice_end(RequestStage.C, auto_next_anon = True)
+        trace_metric_ctx.slice_end(RequestStage.D)
         ```
     - The end of the last slice in a thread must be marked with thread_finish_flag=True; otherwise, the thread's span will not be properly generated.
         ```python
-        trace_slice_end("slice D", rid, thread_finish_flag = True)
+        trace_metric_ctx.slice_end(RequestStage.D, thread_finish_flag = True)
         ```
 
-4. When the request execution flow transfers to another thread, the trace context needs to be explicitly propagated.
-    - sender: Execute the following code before sending the request to another thread via ZMQ
-        ```python
-        trace_context = trace_get_proc_propagate_context(rid)
-        req.trace_context = trace_context
-        ```
+5. When the request execution flow transfers to another thread, the thread context needs to be explicitly rebuilt.
     - receiver: Execute the following code after receiving the request via ZMQ
         ```python
-        trace_set_proc_propagate_context(rid, req.trace_context)
-        ```
-
-5. When the request execution flow transfers to another node(PD disaggregation), the trace context needs to be explicitly propagated.
-    - sender: Execute the following code before sending the request to node thread via http
-        ```python
-        trace_context = trace_get_remote_propagate_context(bootstrap_room_list)
-        headers = {"trace_context": trace_context}
-        session.post(url, headers=headers)
-        ```
-    - receiver: Execute the following code after receiving the request via http
-        ```python
-        trace_set_remote_propagate_context(request.headers['trace_context'])
+        trace_metric_ctx.rebuild_thread_context()
         ```
 
 ## How to Extend the Tracing Framework to Support Complex Tracing Scenarios
 
 The currently provided tracing package still has potential for further development. If you wish to build more advanced features upon it, you must first understand its existing design principles.
 
-The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a two-level trace context structure and a four-level span structure: `SglangTraceReqContext`, `SglangTraceThreadContext`. Their relationship is as follows:
+The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a three-level trace context structure or span structure: `TraceReqContext`, `TraceThreadContext` and `TraceSliceContext`. Their relationship is as follows:
 ```
-SglangTraceReqContext (req_id="req-123")
-├── SglangTraceThreadContext(thread_label="scheduler", tp_rank=0)
+TraceReqContext (req_id="req-123")
+├── TraceThreadContext(thread_label="scheduler", tp_rank=0)
+|     └── TraceSliceContext(slice_name="prefill")
 |
-└── SglangTraceThreadContext(thread_label="scheduler", tp_rank=1)
+└── TraceThreadContext(thread_label="scheduler", tp_rank=1)
+      └── TraceSliceContext(slice_name="prefill")
 ```
 
-Each traced request maintains a global `SglangTraceReqContext`. For every thread processing the request, a corresponding `SglangTraceThreadContext` is recorded and composed within the `SglangTraceReqContext`. Within each thread, every currently traced slice (possibly nested) is stored in a list.
+Each traced request maintains a global `TraceReqContext` and creates a corresponding request span. For every thread that processes the request, a `TraceThreadContext` is recorded and a thread span is created. The `TraceThreadContext` is nested within the `TraceReqContext`, and each currently traced code slice—potentially nested—is stored in its associated `TraceThreadContext`.
 
 In addition to the above hierarchy, each slice also records its previous slice via Span.add_link(), which can be used to trace the execution flow.
-
-When the request execution flow transfers to a new thread, the trace context needs to be explicitly propagated. In the framework, this is represented by `SglangTracePropagateContext`, which contains the context of the request span and the previous slice span.
-
-
-We designed a four-level span structure, consisting of `bootstrap_room_span`, `req_root_span`, `thread_span`, and `slice_span`. Among them, `req_root_span` and `thread_span` correspond to `SglangTraceReqContext` and `SglangTraceThreadContext`, respectively, and `slice_span` is stored within the `SglangTraceThreadContext`. The `bootstrap_room_span` is designed to accommodate the separation of PD-disaggregation. On different nodes, we may want to add certain attributes to the `req_root_span`. However, if the `req_root_span` is shared across all nodes, the Prefill and Decode nodes would not be allowed to add attributes due to the constraints imposed by OpenTelemetry's design.
-
-```
-bootstrap room span
-├── router req root span
-|    └── router thread span
-|          └── slice span
-├── prefill req root span
-|    ├── tokenizer thread span
-|    |     └── slice span
-|    └── scheduler thread span
-|          └── slice span
-└── decode req root span
-      ├── tokenizer thread span
-      |    └── slice span
-      └── scheduler thread span
-           └── slice span
-```
@@ -47,7 +47,7 @@
     prepare_abort,
 )
 from sglang.srt.layers.dp_attention import get_attention_tp_size
-from sglang.srt.managers.schedule_batch import FINISH_ABORT, RequestStage, ScheduleBatch
+from sglang.srt.managers.schedule_batch import FINISH_ABORT, ScheduleBatch
 from sglang.srt.managers.utils import GenerationBatchResult
 from sglang.srt.mem_cache.allocator import BaseTokenToKVPoolAllocator
 from sglang.srt.mem_cache.base_prefix_cache import BasePrefixCache
@@ -60,7 +60,7 @@
     ReqToTokenPool,
 )
 from sglang.srt.mem_cache.swa_memory_pool import SWAKVPool
-from sglang.srt.tracing.trace import trace_event_batch, trace_slice_end
+from sglang.srt.tracing.trace_metric_wrapper import RequestStage, trace_event_batch
 from sglang.srt.utils import get_int_env_var
 from sglang.srt.utils.torch_memory_saver_adapter import TorchMemorySaverAdapter
 
@@ -344,8 +344,9 @@ def add(self, req: Req, is_retracted: bool = False) -> None:
                 prefill_dp_rank=req.data_parallel_rank,
             )
 
-            req.add_latency(RequestStage.DECODE_PREPARE)
-            trace_slice_end(RequestStage.DECODE_PREPARE, req.rid, auto_next_anon=True)
+            req.trace_metric_ctx.slice_end(
+                RequestStage.DECODE_PREPARE, auto_next_anon=True
+            )
             self.queue.append(
                 DecodeRequest(req=req, kv_receiver=kv_receiver, waiting_for_input=False)
             )
@@ -354,6 +355,7 @@ def _check_if_req_exceed_kv_capacity(self, req: Req) -> bool:
         if len(req.origin_input_ids) > self.max_total_num_tokens:
             message = f"Request {req.rid} exceeds the maximum number of tokens: {len(req.origin_input_ids)} > {self.max_total_num_tokens}"
             logger.error(message)
+            req.trace_metric_ctx.abort(abort_info={"abort_info": message})
             prepare_abort(req, message, status_code=HTTPStatus.BAD_REQUEST)
             self.scheduler.stream_output([req], req.return_logprob)
             return True
@@ -473,6 +475,9 @@ def pop_preallocated(
                 )
                 failed_reqs.append(decode_req)
                 indices_to_remove.add(i)
+                decode_req.req.trace_metric_ctx.abort(
+                    abort_info=decode_req.req.finished_reason
+                )
 
         # Then, preallocate the remaining requests if possible
         for i, decode_req in enumerate(self.queue):
@@ -578,9 +583,9 @@ def pop_preallocated(
             decode_req.req.time_stats.decode_transfer_queue_entry_time = (
                 time.perf_counter()
             )
-            decode_req.req.add_latency(RequestStage.DECODE_BOOTSTRAP)
-            trace_slice_end(
-                RequestStage.DECODE_BOOTSTRAP, decode_req.req.rid, auto_next_anon=True
+
+            decode_req.req.trace_metric_ctx.slice_end(
+                RequestStage.DECODE_BOOTSTRAP, auto_next_anon=True
             )
 
         self.queue = [
@@ -762,9 +767,8 @@ def _commit_transfer_to_req(self, decode_req: DecodeRequest) -> None:
 
         decode_req.kv_receiver.clear()
         decode_req.kv_receiver = None
-        trace_slice_end(
+        decode_req.req.trace_metric_ctx.slice_end(
             RequestStage.DECODE_TRANSFERRED,
-            decode_req.req.rid,
             auto_next_anon=True,
         )
         decode_req.req.time_stats.wait_queue_entry_time = time.perf_counter()
@@ -788,6 +792,9 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
                 except Exception as e:
                     error_message += f" with exception {e}"
                 logger.error(error_message)
+                decode_req.req.trace_metric_ctx.abort(
+                    abort_info={"abort_info": error_message}
+                )
                 prepare_abort(
                     decode_req.req,
                     error_message,
@@ -806,6 +813,7 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
                 self._commit_transfer_to_req(decode_req)
                 indices_to_remove.add(i)
                 transferred_reqs.append(decode_req.req)
+
             elif poll in [
                 KVPoll.Bootstrapping,
                 KVPoll.WaitingForInput,
@@ -818,7 +826,6 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
         for i in indices_to_remove:
             idx = self.queue[i].metadata_buffer_index
             assert idx != -1
-            self.queue[i].req.add_latency(RequestStage.DECODE_TRANSFERRED)
             self.req_to_metadata_buffer_idx_allocator.free(idx)
 
         self.queue = [
@@ -967,7 +974,9 @@ def get_new_prebuilt_batch(self: Scheduler) -> Optional[ScheduleBatch]:
             # we can only add at least `num_not_used_batch` new batch to the running queue
             if i < num_not_used_batch:
                 can_run_list.append(req)
-                req.add_latency(RequestStage.DECODE_WAITING)
+                req.trace_metric_ctx.slice_end(
+                    RequestStage.DECODE_WAITING, auto_next_anon=True
+                )
                 req.init_next_round_input(self.tree_cache)
             else:
                 waiting_queue.append(req)