-
Notifications
You must be signed in to change notification settings - Fork 638
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467
Conversation
Thanks for your contribution! |
2c56b6d
to
36ad6ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds draft_logprobs
support for the Speculative Decode MTP mode, enabling developers to capture intermediate prediction probabilities during speculative decoding. The primary goal is to enhance observability and debuggability of the speculative decoding process while maintaining full backward compatibility with existing OpenAI-compatible interfaces.
Key changes:
- Added
include_draft_logprobs
parameter to/v1/completions
and/v1/chat/completions
request APIs - Introduced
draft_logprobs
field in response structures to carry intermediate draft token probabilities - Extended token processing logic to handle and buffer draft logprobs separately from target logprobs in speculative decoding scenarios
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
File | Description |
---|---|
fastdeploy/entrypoints/openai/protocol.py | Added include_draft_logprobs request parameter and draft_logprobs response field to protocol definitions |
fastdeploy/entrypoints/openai/serving_completion.py | Implemented draft logprobs aggregation and processing logic for completion endpoints |
fastdeploy/entrypoints/openai/serving_chat.py | Implemented draft logprobs aggregation and processing logic for chat endpoints |
fastdeploy/engine/request.py | Extended CompletionOutput and RequestOutput classes with draft logprobs support and output type tracking |
fastdeploy/output/token_processor.py | Core token processing logic updated to extract, buffer, and merge draft/target logprobs in speculative decoding mode |
tests/output/test_process_batch_output.py | Added unit test infrastructure for validating speculative decoding with logprobs |
if self.use_logprobs: | ||
mtype = int(self.output_tokens[1, 0].item()) | ||
batch = self.output_tokens[2, 0] | ||
accept_num = [int(num[0]) for num in self.output_tokens[3 : batch + 3]] | ||
tokens = tokens[3 + MAX_BSZ : 3 + MAX_BSZ + batch * MAX_DRAFT_TOKENS * (K + 1)].reshape( | ||
[batch, MAX_DRAFT_TOKENS, K + 1] | ||
) |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers (3, K+1) are used without explanation. Consider extracting these as named constants like METADATA_HEADER_SIZE = 3
and TOPK_SIZE = K + 1
to improve code clarity and maintainability.
Copilot uses AI. Check for mistakes.
for target, decode in zip(self._batch_result_buffer, draft_batch_result): | ||
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | ||
target_batch_result.append(target) | ||
self._batch_result_buffer = None |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue if self._batch_result_buffer
and draft_batch_result
have different lengths. The zip
operation will silently truncate to the shorter list, which could lead to data loss. Add length validation before the zip operation.
for target, decode in zip(self._batch_result_buffer, draft_batch_result): | |
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | |
target_batch_result.append(target) | |
self._batch_result_buffer = None | |
if len(self._batch_result_buffer) != len(draft_batch_result): | |
llm_logger.error( | |
f"Length mismatch: _batch_result_buffer ({len(self._batch_result_buffer)}) vs draft_batch_result ({len(draft_batch_result)}). Skipping postprocess for this batch." | |
) | |
else: | |
for target, decode in zip(self._batch_result_buffer, draft_batch_result): | |
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs | |
target_batch_result.append(target) | |
self._batch_result_buffer = None |
Copilot uses AI. Check for mistakes.
if logprobs_res and logprobs_res.content is not None: | ||
logprob_contents.extend(logprobs_res.content) | ||
|
||
# draf_logprobs |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'draf_logprobs' to 'draft_logprobs'.
# draf_logprobs | |
# draft_logprobs |
Copilot uses AI. Check for mistakes.
final_res = final_res_batch[idx] | ||
prompt_token_ids = prompt_batched_token_ids[idx] | ||
assert prompt_token_ids is not None | ||
prompt_text = request.prompt |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable prompt_text
is assigned here but immediately reassigned at line 568 if request.echo
is true. Consider moving this assignment inside the else block of the echo condition to avoid unnecessary assignment.
Copilot uses AI. Check for mistakes.
""" | ||
try: | ||
self.cached_generated_tokens.put_results(batch_result) | ||
if self.cfg.speculative_config.method and self.use_logprobs: |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The condition self.cfg.speculative_config.method and self.use_logprobs
is checked here, but the same logic pattern appears multiple times. Consider extracting this into a property method like is_speculative_with_logprobs
for better readability and maintainability.
Copilot uses AI. Check for mistakes.
llm_logger.info( | ||
f"Request: {task_id} finished, number of " f"generated tokens: {self.tokens_counter[task_id]}." | ||
f"Request: {task_id} finished, number of " | ||
f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}" |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Log message formatting is inconsistent with spacing around colons. Add spaces after colons for better readability: token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}
.
f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}" | |
f"generated tokens: {self.tokens_counter[task_id]}, token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}" |
Copilot uses AI. Check for mistakes.
] * MAX_DRAFT_TOKENS | ||
processor.speculative_stats_step = 0 | ||
|
||
# processor._recycle_resources = Mock() |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. If this is needed for future reference, document why it's commented or remove it entirely.
# processor._recycle_resources = Mock() |
Copilot uses AI. Check for mistakes.
[Speculative Decoding] Add
draft_logprobs
Support for Speculative Decode MTPMotivation
本 PR 为 MTP 的 Speculative Decode 模式 增加了
draft_logprobs
支持,并扩展了 OpenAI 兼容接口,以便开发者能够在推测解码过程中获取中间预测概率信息。主要目的:
Modifications
1. 新增请求参数
include_draft_logprobs
bool
/v1/completions
与/v1/chat/completions
请求体中draft_logprobs
)2. 新增响应参数
draft_logprobs
include_draft_logprobs=true
时,响应中将包含该字段。3. 兼容性
logprobs
字段。Usage or Command
示例 1:
/completions
接口示例响应:
示例 2:
/chat/completions
接口示例响应:
Accuracy Tests
include_draft_logprobs
的情况下,新增字段仅用于观测,不影响模型输出结果。Checklist
[Speculative Decoding]
pre-commit
passed)include_draft_logprobs
release
branch if needed