Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ae001a6
Sglang Tracing: Unify tracing and req stage metrics, add trace-level …
sufeng-buaa Nov 13, 2025
8395446
Sglang Tracing: update doc
sufeng-buaa Nov 13, 2025
5dac5da
Sglang Tracing: update test cases
sufeng-buaa Nov 13, 2025
e75d844
fix lint
sufeng-buaa Nov 13, 2025
71f9d2a
rename 'NoOpTimeRecorder' to 'NoOpStageContext'
sufeng-buaa Nov 13, 2025
a847eba
sglang tracing: fix crash when enable tracing but not install otlp
sufeng-buaa Nov 13, 2025
cd6c8d8
sglang tracing: rename 'Sglang***' to 'SGLang***'
sufeng-buaa Nov 13, 2025
a1c82a8
sglang tracing: fix dependence install
sufeng-buaa Nov 13, 2025
3ef8db2
trace: add more explanations
sufeng-buaa Nov 17, 2025
6d3735a
fix tracing doc
ShangmingCai Nov 17, 2025
2ed4718
lint
ShangmingCai Nov 17, 2025
9b13760
Merge branch 'main' into sufeng-buaa/unify-trace-metric
ShangmingCai Nov 17, 2025
c9efa58
trace: update doc
sufeng-buaa Nov 17, 2025
ea30182
trace: standardize naming
sufeng-buaa Nov 18, 2025
44d4e1d
trace: fix lint
sufeng-buaa Nov 18, 2025
bbc98f0
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Nov 25, 2025
dd7e5bb
SGLang Tracing: delete bootstrap_room span and 'trace_context' header
sufeng-buaa Nov 26, 2025
f2a92d0
sglang tracing: add module name to root span
sufeng-buaa Nov 26, 2025
2f6f961
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Nov 26, 2025
2be1623
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Nov 27, 2025
6c57dec
sglang tracing: remove SGLang prefix from classes related to trace
sufeng-buaa Nov 27, 2025
7ca0972
fix lint
sufeng-buaa Nov 28, 2025
66df528
fix 'prev_span' condition
sufeng-buaa Nov 28, 2025
bfddf4a
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 1, 2025
7631089
fix root span not be closed when scheduler have no propagation_context
sufeng-buaa Dec 2, 2025
5bbeb9d
fix warning for last extra decode when enabling overlap schedule
sufeng-buaa Dec 2, 2025
0f8dc9b
add more test cases
sufeng-buaa Dec 2, 2025
6f9b330
simplify trace function name
sufeng-buaa Dec 8, 2025
c6aa18c
check exception path
sufeng-buaa Dec 8, 2025
e5f5e2a
fix null reference when propagate_context is null
sufeng-buaa Dec 8, 2025
f120734
fix abort on error path
sufeng-buaa Dec 9, 2025
b4a75e4
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 10, 2025
a71fcf3
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 12, 2025
23d330c
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 14, 2025
f37e296
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 15, 2025
3911a8d
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 17, 2025
371f229
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 22, 2025
595367d
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 24, 2025
1c69b16
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Dec 25, 2025
2620d22
fix conflict
sufeng-buaa Dec 26, 2025
3b856a5
raise Exception if otel init error
sufeng-buaa Dec 26, 2025
4d995b5
fix lint
sufeng-buaa Dec 26, 2025
dddee15
fix conflict
sufeng-buaa Dec 27, 2025
68df11f
optimize code
sufeng-buaa Dec 27, 2025
606d3cf
fix conflict
sufeng-buaa Dec 27, 2025
59133cd
fix conflict
sufeng-buaa Dec 28, 2025
91566f0
remove SGLang prefix from classes related to trace
sufeng-buaa Dec 30, 2025
6358231
Merge 'upstream/main' and fix conflict
sufeng-buaa Jan 4, 2026
6c0c233
fix
sufeng-buaa Jan 4, 2026
93f998b
Merge branch 'upstream/main' and fix conflict
sufeng-buaa Jan 5, 2026
66b2fbf
Merge branch 'upstream/main' and fix conflict
sufeng-buaa Jan 6, 2026
954d642
Merge branch 'main' into sufeng-buaa/unify-trace-metric
sufeng-buaa Jan 8, 2026
c04a00f
optimize code
sufeng-buaa Jan 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 37 additions & 71 deletions docs/references/production_request_trace.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Production Request Tracing

SGlang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--enable-trace` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server.
SGLang exports request trace data based on the OpenTelemetry Collector. You can enable tracing by adding the `--trace-level` and configure the OpenTelemetry Collector endpoint using `--otlp-traces-endpoint` when launching the server. The `--trace-level` option accepts configurable values from `0` to `3`, where `0` means tracing is disabled and higher numbers indicate more detailed tracing. Additionally, you can use `--trace-module` to specify the module to trace; currently, only `request` is supported.

You can find example screenshots of the visualization in https://github.com/sgl-project/sglang/issues/8965.

Expand All @@ -17,39 +17,39 @@ This section explains how to configure the request tracing and export the trace
pip install opentelemetry-sdk opentelemetry-api opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc
```

2. launch opentelemetry collector and jaeger
2. Launch OpenTelemetry collector and Jaeger
```bash
docker compose -f examples/monitoring/tracing_compose.yaml up -d
```

3. start your SGLang server with tracing enabled
3. Start your SGLang server with tracing enabled
```bash
# set env variables
export SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS=500
export SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE=64
# start the prefill and decode server
python -m sglang.launch_server --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 <other option>
python -m sglang.launch_server --trace-level 3 --otlp-traces-endpoint 0.0.0.0:4317 [--trace-module request] <other option>
# start the mini lb
python -m sglang_router.launch_router --enable-trace --otlp-traces-endpoint 0.0.0.0:4317 <other option>
```

Replace `0.0.0.0:4317` with the actual endpoint of the opentelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317.
Replace `0.0.0.0:4317` with the actual endpoint of the OpenTelemetry collector. If you launched the openTelemetry collector with tracing_compose.yaml, the default receiving port is 4317.

To use the HTTP/protobuf span exporter, set the following environment variable and point to an HTTP endpoint, for example, `http://0.0.0.0:4318/v1/traces`.
```bash
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
```


4. raise some requests
4. Raise some requests
5. Observe whether trace data is being exported
* Access port 16686 of Jaeger using a web browser to visualize the request traces.
* The OpenTelemetry Collector also exports trace data in JSON format to /tmp/otel_trace.json. In a follow-up patch, we will provide a tool to convert this data into a Perfetto-compatible format, enabling visualization of requests in the Perfetto UI.

## How to add Tracing for slices you're interested in?
## How to add Tracing for slices you're interested in?(API introduction)
We have already inserted instrumentation points in the tokenizer and scheduler main threads. If you wish to trace additional request execution segments or perform finer-grained tracing, please use the APIs from the tracing package as described below.

1. initialization
1. Initialization

Every process involved in tracing during the initialization phase should execute:
```python
Expand All @@ -63,98 +63,64 @@ We have already inserted instrumentation points in the tokenizer and scheduler m
```
The "thread label" can be regarded as the name of the thread, used to distinguish different threads in the visualization view.

2. Mark the beginning and end of a request
2. Create a time recorder for a request
Each request needs to call `TraceMetricContext()` to initialize a time recorder, which is used to generate slice spans and request stage metrics. You can either store it within the request object or maintain it as a global variable. A set of APIs for managing the global time recorder is provided in `python/sglang/srt/tracing/trace_metric_wrapper.py`.

3. Mark the beginning and end of a request
```
trace_req_start(rid, bootstrap_room)
trace_req_finish(rid)
# The time recorder calls trace_req_start() by default when it is created.
trace_metric_ctx.trace_req_finish()
```
These two APIs must be called within the same process, for example, in the tokenizer.
TraceMetricContext() and trace_req_finish() must be called within the same process, for example, in the tokenizer.

3. Add tracing for slice
4. Add tracing for a slice

* Add slice tracing normally:
```python
trace_slice_start("slice A", rid)
trace_slice_end("slice A", rid)
trace_metric_ctx.slice_start(RequestStage.TOKENIZER)
trace_metric_ctx.slice_end(RequestStage.TOKENIZER)
```

- Use the "anonymous" flag to not specify a slice name at the start of the slice, allowing the slice name to be determined by trace_slice_end.
- Use the `ANONYMOUS` to not specify a slice name at the start of the slice, allowing the slice name to be determined by trace_slice_end.
<br>Note: Anonymous slices must not be nested.
```python
trace_slice_start("", rid, anonymous = True)
trace_slice_end("slice A", rid)
trace_metric_ctx.slice_start(RequestStage.ANONYMOUS)
trace_metric_ctx.slice_end(RequestStage.TOKENIZER)
```

- In trace_slice_end, use auto_next_anon to automatically create the next anonymous slice, which can reduce the number of instrumentation points needed.
- In slice_end, use auto_next_anon to automatically create the next anonymous slice, which can reduce the number of instrumentation points needed.
```python
trace_slice_start("", rid, anonymous = True)
trace_slice_end("slice A", rid, auto_next_anon = True)
trace_slice_end("slice B", rid, auto_next_anon = True)
trace_slice_end("slice C", rid, auto_next_anon = True)
trace_slice_end("slice D", rid)
trace_metric_ctx.slice_start(RequestStage.ANONYMOUS)
trace_metric_ctx.slice_end(RequestStage.A, auto_next_anon = True)
trace_metric_ctx.slice_end(RequestStage.B, auto_next_anon = True)
trace_metric_ctx.slice_end(RequestStage.C, auto_next_anon = True)
trace_metric_ctx.slice_end(RequestStage.D)
```
- The end of the last slice in a thread must be marked with thread_finish_flag=True; otherwise, the thread's span will not be properly generated.
```python
trace_slice_end("slice D", rid, thread_finish_flag = True)
trace_metric_ctx.slice_end(RequestStage.D, thread_finish_flag = True)
```

4. When the request execution flow transfers to another thread, the trace context needs to be explicitly propagated.
- sender: Execute the following code before sending the request to another thread via ZMQ
```python
trace_context = trace_get_proc_propagate_context(rid)
req.trace_context = trace_context
```
5. When the request execution flow transfers to another thread, the thread context needs to be explicitly rebuilt.
- receiver: Execute the following code after receiving the request via ZMQ
```python
trace_set_proc_propagate_context(rid, req.trace_context)
```

5. When the request execution flow transfers to another node(PD disaggregation), the trace context needs to be explicitly propagated.
- sender: Execute the following code before sending the request to node thread via http
```python
trace_context = trace_get_remote_propagate_context(bootstrap_room_list)
headers = {"trace_context": trace_context}
session.post(url, headers=headers)
```
- receiver: Execute the following code after receiving the request via http
```python
trace_set_remote_propagate_context(request.headers['trace_context'])
trace_metric_ctx.rebuild_thread_context()
```

## How to Extend the Tracing Framework to Support Complex Tracing Scenarios

The currently provided tracing package still has potential for further development. If you wish to build more advanced features upon it, you must first understand its existing design principles.

The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a two-level trace context structure and a four-level span structure: `SglangTraceReqContext`, `SglangTraceThreadContext`. Their relationship is as follows:
The core of the tracing framework's implementation lies in the design of the span structure and the trace context. To aggregate scattered slices and enable concurrent tracking of multiple requests, we have designed a three-level trace context structure or span structure: `TraceReqContext`, `TraceThreadContext` and `TraceSliceContext`. Their relationship is as follows:
```
SglangTraceReqContext (req_id="req-123")
├── SglangTraceThreadContext(thread_label="scheduler", tp_rank=0)
TraceReqContext (req_id="req-123")
├── TraceThreadContext(thread_label="scheduler", tp_rank=0)
| └── TraceSliceContext(slice_name="prefill")
|
└── SglangTraceThreadContext(thread_label="scheduler", tp_rank=1)
└── TraceThreadContext(thread_label="scheduler", tp_rank=1)
└── TraceSliceContext(slice_name="prefill")
```

Each traced request maintains a global `SglangTraceReqContext`. For every thread processing the request, a corresponding `SglangTraceThreadContext` is recorded and composed within the `SglangTraceReqContext`. Within each thread, every currently traced slice (possibly nested) is stored in a list.
Each traced request maintains a global `TraceReqContext` and creates a corresponding request span. For every thread that processes the request, a `TraceThreadContext` is recorded and a thread span is created. The `TraceThreadContext` is nested within the `TraceReqContext`, and each currently traced code slice—potentially nestedis stored in its associated `TraceThreadContext`.

In addition to the above hierarchy, each slice also records its previous slice via Span.add_link(), which can be used to trace the execution flow.

When the request execution flow transfers to a new thread, the trace context needs to be explicitly propagated. In the framework, this is represented by `SglangTracePropagateContext`, which contains the context of the request span and the previous slice span.


We designed a four-level span structure, consisting of `bootstrap_room_span`, `req_root_span`, `thread_span`, and `slice_span`. Among them, `req_root_span` and `thread_span` correspond to `SglangTraceReqContext` and `SglangTraceThreadContext`, respectively, and `slice_span` is stored within the `SglangTraceThreadContext`. The `bootstrap_room_span` is designed to accommodate the separation of PD-disaggregation. On different nodes, we may want to add certain attributes to the `req_root_span`. However, if the `req_root_span` is shared across all nodes, the Prefill and Decode nodes would not be allowed to add attributes due to the constraints imposed by OpenTelemetry's design.

```
bootstrap room span
├── router req root span
| └── router thread span
| └── slice span
├── prefill req root span
| ├── tokenizer thread span
| | └── slice span
| └── scheduler thread span
| └── slice span
└── decode req root span
├── tokenizer thread span
| └── slice span
└── scheduler thread span
└── slice span
```
31 changes: 20 additions & 11 deletions python/sglang/srt/disaggregation/decode.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
prepare_abort,
)
from sglang.srt.layers.dp_attention import get_attention_tp_size
from sglang.srt.managers.schedule_batch import FINISH_ABORT, RequestStage, ScheduleBatch
from sglang.srt.managers.schedule_batch import FINISH_ABORT, ScheduleBatch
from sglang.srt.managers.utils import GenerationBatchResult
from sglang.srt.mem_cache.allocator import BaseTokenToKVPoolAllocator
from sglang.srt.mem_cache.base_prefix_cache import BasePrefixCache
Expand All @@ -60,7 +60,7 @@
ReqToTokenPool,
)
from sglang.srt.mem_cache.swa_memory_pool import SWAKVPool
from sglang.srt.tracing.trace import trace_event_batch, trace_slice_end
from sglang.srt.tracing.trace_metric_wrapper import RequestStage, trace_event_batch
from sglang.srt.utils import get_int_env_var
from sglang.srt.utils.torch_memory_saver_adapter import TorchMemorySaverAdapter

Expand Down Expand Up @@ -344,8 +344,9 @@ def add(self, req: Req, is_retracted: bool = False) -> None:
prefill_dp_rank=req.data_parallel_rank,
)

req.add_latency(RequestStage.DECODE_PREPARE)
trace_slice_end(RequestStage.DECODE_PREPARE, req.rid, auto_next_anon=True)
req.trace_metric_ctx.slice_end(
RequestStage.DECODE_PREPARE, auto_next_anon=True
)
self.queue.append(
DecodeRequest(req=req, kv_receiver=kv_receiver, waiting_for_input=False)
)
Expand All @@ -354,6 +355,7 @@ def _check_if_req_exceed_kv_capacity(self, req: Req) -> bool:
if len(req.origin_input_ids) > self.max_total_num_tokens:
message = f"Request {req.rid} exceeds the maximum number of tokens: {len(req.origin_input_ids)} > {self.max_total_num_tokens}"
logger.error(message)
req.trace_metric_ctx.abort(abort_info={"abort_info": message})
prepare_abort(req, message, status_code=HTTPStatus.BAD_REQUEST)
self.scheduler.stream_output([req], req.return_logprob)
return True
Expand Down Expand Up @@ -473,6 +475,9 @@ def pop_preallocated(
)
failed_reqs.append(decode_req)
indices_to_remove.add(i)
decode_req.req.trace_metric_ctx.abort(
abort_info=decode_req.req.finished_reason
)

# Then, preallocate the remaining requests if possible
for i, decode_req in enumerate(self.queue):
Expand Down Expand Up @@ -578,9 +583,9 @@ def pop_preallocated(
decode_req.req.time_stats.decode_transfer_queue_entry_time = (
time.perf_counter()
)
decode_req.req.add_latency(RequestStage.DECODE_BOOTSTRAP)
trace_slice_end(
RequestStage.DECODE_BOOTSTRAP, decode_req.req.rid, auto_next_anon=True

decode_req.req.trace_metric_ctx.slice_end(
RequestStage.DECODE_BOOTSTRAP, auto_next_anon=True
)

self.queue = [
Expand Down Expand Up @@ -762,9 +767,8 @@ def _commit_transfer_to_req(self, decode_req: DecodeRequest) -> None:

decode_req.kv_receiver.clear()
decode_req.kv_receiver = None
trace_slice_end(
decode_req.req.trace_metric_ctx.slice_end(
RequestStage.DECODE_TRANSFERRED,
decode_req.req.rid,
auto_next_anon=True,
)
decode_req.req.time_stats.wait_queue_entry_time = time.perf_counter()
Expand All @@ -788,6 +792,9 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
except Exception as e:
error_message += f" with exception {e}"
logger.error(error_message)
decode_req.req.trace_metric_ctx.abort(
abort_info={"abort_info": error_message}
)
prepare_abort(
decode_req.req,
error_message,
Expand All @@ -806,6 +813,7 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
self._commit_transfer_to_req(decode_req)
indices_to_remove.add(i)
transferred_reqs.append(decode_req.req)

elif poll in [
KVPoll.Bootstrapping,
KVPoll.WaitingForInput,
Expand All @@ -818,7 +826,6 @@ def pop_transferred(self, rids_to_check: Optional[List[str]] = None) -> List[Req
for i in indices_to_remove:
idx = self.queue[i].metadata_buffer_index
assert idx != -1
self.queue[i].req.add_latency(RequestStage.DECODE_TRANSFERRED)
self.req_to_metadata_buffer_idx_allocator.free(idx)

self.queue = [
Expand Down Expand Up @@ -967,7 +974,9 @@ def get_new_prebuilt_batch(self: Scheduler) -> Optional[ScheduleBatch]:
# we can only add at least `num_not_used_batch` new batch to the running queue
if i < num_not_used_batch:
can_run_list.append(req)
req.add_latency(RequestStage.DECODE_WAITING)
req.trace_metric_ctx.slice_end(
RequestStage.DECODE_WAITING, auto_next_anon=True
)
req.init_next_round_input(self.tree_cache)
else:
waiting_queue.append(req)
Expand Down
Loading
Loading