Skip to content

Add request-level timestamp for when prefill finishes#14860

Merged
zhyncs merged 8 commits intosgl-project:mainfrom
scottjlee:sjl/1210-prefill-finish
Dec 17, 2025
Merged

Add request-level timestamp for when prefill finishes#14860
zhyncs merged 8 commits intosgl-project:mainfrom
scottjlee:sjl/1210-prefill-finish

Conversation

@scottjlee
Copy link
Copy Markdown
Contributor

Motivation

Currently, the prefill_finished_ts field is derived from ReqState.first_token_time. However, for non-streaming requests, ReqState.first_token_time is never set, so we cannot rely on this value for when prefill finishes. As a result, prefill_finished_ts is omitted from the request-level metrics for non-streaming requests.

Modifications

This PR adds a new explicit prefill_finished_ts field to the TimeStats object, and updates the value once a request completes prefill. The prefill_finished_ts field is populated in both streaming and non-streaming requests.

Since this PR only adds the new metric field, it should have no impact on the model outputs or speed.

Log examples

  • Non-streaming:
[2025-12-11 01:42:24] Finish: obj=GenerateReqInput(validation_time=2.880580723285675e-05, received_time=1765417343.9740803, received_time_perf=14121634.43814618, rid='oBTZYgU-4YNCb4-980033869e849bc1', http_worker_ipc=None, video_data=None, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, decode_tp_size=None, reasoning=False, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False), out={'meta_info': {'id': 'oBTZYgU-4YNCb4-980033869e849bc1', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 23, 'weight_version': 'default', 'total_retractions': 0, 'queue_time': 0.0006293673068284988, 'prefill_launch_delay': 0.0007045920938253403, 'prefill_launch_latency': 0.024291932582855225, 'prefill_finished_ts': 1765417344.0011635, 'completion_tokens': 32, 'cached_tokens': 0, 'spec_accept_rate': 0.42857142857142855, 'spec_accept_length': 2.2857142857142856, 'spec_verify_ct': 14, 'spec_accept_token_num': 18, 'spec_draft_token_num': 42, 'e2e_latency': 0.15244436264038086, 'request_received_ts': 1765417343.9740803, 'request_sent_to_scheduler_ts': 1765417343.9751327, 'decode_finished_ts': 1765417344.1265247, 'inference_time': 0.15041274577379227, 'response_sent_to_client_ts': 1765417344.1267433}}
  • Streaming:
[2025-12-11 01:42:46] Finish: obj=GenerateReqInput(validation_time=2.5197863578796387e-05, received_time=1765417366.1434186, received_time_perf=14121656.607484436, rid='oBTZYgU-4YNCb4-980033869e849bc1', http_worker_ipc=None, video_data=None, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=True, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, decode_tp_size=None, reasoning=False, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False), out={'meta_info': {'id': 'oBTZYgU-4YNCb4-980033869e849bc1', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 23, 'weight_version': 'default', 'total_retractions': 0, 'queue_time': 0.0007969774305820465, 'prefill_launch_delay': 0.0005463473498821259, 'prefill_launch_latency': 0.011130349710583687, 'prefill_finished_ts': 1765417366.1574004, 'completion_tokens': 32, 'cached_tokens': 21, 'spec_accept_rate': 0.42857142857142855, 'spec_accept_length': 2.2857142857142856, 'spec_verify_ct': 14, 'spec_accept_token_num': 18, 'spec_draft_token_num': 42, 'e2e_latency': 0.14004898071289062, 'request_received_ts': 1765417366.1434186, 'request_sent_to_scheduler_ts': 1765417366.1445887, 'response_sent_to_client_ts': 1765417366.1584878, 'decode_finished_ts': 1765417366.2834675, 'inference_time': 0.13779237307608128, 'decode_throughput': 255.8506616612087}}

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment on lines -1807 to -1812
# For embeddings, there's no separate prefill phase, so omit `prefill_finished_ts`.
if (
not isinstance(recv_obj, BatchEmbeddingOutput)
and state.first_token_time > 0
):
meta_info["prefill_finished_ts"] = state.first_token_time
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer need to rely on ReqState.first_token_time for prefill_finished_ts, since we directly track this now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we keep not isinstance(recv_obj, BatchEmbeddingOutput)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the whole if clause is no longer needed. because in this PR, we explicitly track prefill_finished_ts as a new field, and is added to meta_info with _add_metric_if_present() above, we shouldn't need this case to use state.first_token_time.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Dec 11, 2025

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic LGTM. Do you have time to check on this PR @sufeng-buaa ?

):
if req.is_chunked <= 0:
if req.time_stats.prefill_finished_ts == 0.0:
req.time_stats.prefill_finished_ts = time.time()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all timestamps collected in TimeStats are obtained via time.perf_counter(). Additionally, with this in place, should prefill_end_time_host be removed? It seems that prefill_end_time_host does not record the actual prefill end time.

Copy link
Copy Markdown
Collaborator

@sufeng-buaa sufeng-buaa Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there doesn't seem to be a better way to maintain consistency with first_token_time (obtained by time.time()). It just feels a bit inconsistent. Maybe we should add a comment in TimeStats to clarify this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm understanding correctly, prefill_end_time_host is used in calculating prefill_launch_latency, which specifically corresponds to the prefill kernel launch time. whereas for my newly created prefill_finished_ts timestamp is more useful for seeing where prefill as a whole ends (which encapsulates prefill_launch_delay, prefill_launch_latency, and other untracked operations in between).

good point about the time.perf_counter() units for other timestamps in TimeStats. i left a comment clarifying this difference. IMO we should still keep it consistent with first_token_time in this case, especially since the other _ts metrics reported at request-level are also the same unit, making it easier to compare them directly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +110 to +112
if req.time_stats.prefill_finished_ts == 0.0:
req.time_stats.prefill_finished_ts = time.time()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about whether this is the best place to record prefill finish time.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as CI passes.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@scottjlee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Dec 15, 2025

/rerun-failed-ci

@zhyncs zhyncs self-assigned this Dec 15, 2025
@scottjlee
Copy link
Copy Markdown
Contributor Author

scottjlee commented Dec 16, 2025

/rerun-failed-ci try again

@scottjlee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

3 similar comments
@scottjlee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@scottjlee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@scottjlee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Dec 17, 2025

@zhyncs zhyncs merged commit 169a75d into sgl-project:main Dec 17, 2025
395 of 424 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 19, 2025
…n3_pp

* 'main' of https://github.com/sgl-project/sglang: (74 commits)
  [bug fix][pp] fix inconsistent latency between tp (sgl-project#15379)
  Fix warp illegal instruction in kimi k2 thinking PCG (sgl-project#15306)
  Fix gpt-oss yarn with `truncate` argument (sgl-project#14270)
  Monkey patch deepseek-ocr's `v_head_dim` (sgl-project#15384)
  [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (sgl-project#15361)
  [PP] Fix dynamic chunking strategy for PP (sgl-project#15372)
  Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (sgl-project#12967)
  Split test_piecewise_cuda_graph.py to optimize CI resource usage (sgl-project#15290)
  unified management of environment variables for vlm cuda ipc transport  (sgl-project#14501)
  Mistral Large 3 NVFP4 TRTLLM MoE support (sgl-project#15049)
  fix: adjust time for test_epd_disaggregation.py (sgl-project#15354)
  Add doc for qwen3 next (sgl-project#15337)
  feat: DeepSeek-V3.2 Streaming tool call output (sgl-project#15278)
  Feature/trtllm mha workspace size configurable sgl-project#15089 (sgl-project#15131)
  [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V (sgl-project#15205)
  [Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307)
  Add request-level timestamp for when prefill finishes (sgl-project#14860)
  [CI] Migrate LoRA tests to test/registered/lora/ (sgl-project#15176)
  Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness (sgl-project#15277)
  Fix condition check for require_gathered_buffer (sgl-project#15328)
  ...
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants