Add request-level timestamp for when prefill finishes#14860
Add request-level timestamp for when prefill finishes#14860zhyncs merged 8 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| # For embeddings, there's no separate prefill phase, so omit `prefill_finished_ts`. | ||
| if ( | ||
| not isinstance(recv_obj, BatchEmbeddingOutput) | ||
| and state.first_token_time > 0 | ||
| ): | ||
| meta_info["prefill_finished_ts"] = state.first_token_time |
There was a problem hiding this comment.
no longer need to rely on ReqState.first_token_time for prefill_finished_ts, since we directly track this now.
There was a problem hiding this comment.
shall we keep not isinstance(recv_obj, BatchEmbeddingOutput)
There was a problem hiding this comment.
i think the whole if clause is no longer needed. because in this PR, we explicitly track prefill_finished_ts as a new field, and is added to meta_info with _add_metric_if_present() above, we shouldn't need this case to use state.first_token_time.
|
/tag-and-rerun-ci |
ShangmingCai
left a comment
There was a problem hiding this comment.
Logic LGTM. Do you have time to check on this PR @sufeng-buaa ?
| ): | ||
| if req.is_chunked <= 0: | ||
| if req.time_stats.prefill_finished_ts == 0.0: | ||
| req.time_stats.prefill_finished_ts = time.time() |
There was a problem hiding this comment.
Currently, all timestamps collected in TimeStats are obtained via time.perf_counter(). Additionally, with this in place, should prefill_end_time_host be removed? It seems that prefill_end_time_host does not record the actual prefill end time.
There was a problem hiding this comment.
but there doesn't seem to be a better way to maintain consistency with first_token_time (obtained by time.time()). It just feels a bit inconsistent. Maybe we should add a comment in TimeStats to clarify this?
There was a problem hiding this comment.
if i'm understanding correctly, prefill_end_time_host is used in calculating prefill_launch_latency, which specifically corresponds to the prefill kernel launch time. whereas for my newly created prefill_finished_ts timestamp is more useful for seeing where prefill as a whole ends (which encapsulates prefill_launch_delay, prefill_launch_latency, and other untracked operations in between).
good point about the time.perf_counter() units for other timestamps in TimeStats. i left a comment clarifying this difference. IMO we should still keep it consistent with first_token_time in this case, especially since the other _ts metrics reported at request-level are also the same unit, making it easier to compare them directly.
| if req.time_stats.prefill_finished_ts == 0.0: | ||
| req.time_stats.prefill_finished_ts = time.time() | ||
|
|
There was a problem hiding this comment.
Not sure about whether this is the best place to record prefill finish time.
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM as long as CI passes.
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci try again |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
…n3_pp * 'main' of https://github.com/sgl-project/sglang: (74 commits) [bug fix][pp] fix inconsistent latency between tp (sgl-project#15379) Fix warp illegal instruction in kimi k2 thinking PCG (sgl-project#15306) Fix gpt-oss yarn with `truncate` argument (sgl-project#14270) Monkey patch deepseek-ocr's `v_head_dim` (sgl-project#15384) [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (sgl-project#15361) [PP] Fix dynamic chunking strategy for PP (sgl-project#15372) Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (sgl-project#12967) Split test_piecewise_cuda_graph.py to optimize CI resource usage (sgl-project#15290) unified management of environment variables for vlm cuda ipc transport (sgl-project#14501) Mistral Large 3 NVFP4 TRTLLM MoE support (sgl-project#15049) fix: adjust time for test_epd_disaggregation.py (sgl-project#15354) Add doc for qwen3 next (sgl-project#15337) feat: DeepSeek-V3.2 Streaming tool call output (sgl-project#15278) Feature/trtllm mha workspace size configurable sgl-project#15089 (sgl-project#15131) [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V (sgl-project#15205) [Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307) Add request-level timestamp for when prefill finishes (sgl-project#14860) [CI] Migrate LoRA tests to test/registered/lora/ (sgl-project#15176) Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness (sgl-project#15277) Fix condition check for require_gathered_buffer (sgl-project#15328) ...
Motivation
Currently, the
prefill_finished_tsfield is derived fromReqState.first_token_time. However, for non-streaming requests,ReqState.first_token_timeis never set, so we cannot rely on this value for when prefill finishes. As a result,prefill_finished_tsis omitted from the request-level metrics for non-streaming requests.Modifications
This PR adds a new explicit
prefill_finished_tsfield to theTimeStatsobject, and updates the value once a request completes prefill. Theprefill_finished_tsfield is populated in both streaming and non-streaming requests.Since this PR only adds the new metric field, it should have no impact on the model outputs or speed.
Log examples
Accuracy Tests
Benchmarking and Profiling
Checklist