Skip to content

Conversation

sunlei1024
Copy link
Collaborator

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP

Motivation

本 PR 为 MTP 的 Speculative Decode 模式 增加了 draft_logprobs 支持,并扩展了 OpenAI 兼容接口,以便开发者能够在推测解码过程中获取中间预测概率信息。

主要目的:

  • 增强 Speculative Decode 的 可观测性可调试性
  • 为研究与实验提供更丰富的中间结果数据,便于分析模型行为。

Modifications

1. 新增请求参数

  • include_draft_logprobs

    • 类型:bool
    • 位置:/v1/completions/v1/chat/completions 请求体中
    • 功能:控制是否返回推测解码中间阶段的概率值 (draft_logprobs)

2. 新增响应参数

  • draft_logprobs

    • include_draft_logprobs=true 时,响应中将包含该字段。
    • 记录 Speculative Decode 阶段各候选 token 的中间 log probability 信息。

3. 兼容性

  • 非 Speculative Decode 模式逻辑保持不变。
  • 与现有 OpenAI 兼容接口完全兼容,不影响原有 logprobs 字段。

Usage or Command

示例 1:/completions 接口

curl https://{ip}:{port}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "prompt": "Hello, world!",
    "logprobs": 5,
    "include_draft_logprobs": true
  }'

示例响应:

{
  "id": "cmpl-xxx",
  "object": "text_completion",
  "choices": [
    {
      "text": "Hello",
      "logprobs": [ ... ], 
      "draft_logprobs": [ ... ]
    }
  ]
}

示例 2:/chat/completions 接口

curl https://{ip}:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, world!"}
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "include_draft_logprobs": true
  }'

示例响应:

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello",
        "logprobs": [ ... ], 
        "draft_logprobs": [ ... ]
      }
    }
  ]
}

Accuracy Tests

  • 核心 Speculative Decode 逻辑未修改,输出一致性通过现有单元测试验证。
  • 在开启 include_draft_logprobs 的情况下,新增字段仅用于观测,不影响模型输出结果。
  • 已针对 OpenAI 兼容接口做请求/响应格式验证。

Checklist

  • Add tag [Speculative Decoding]
  • Code formatted (pre-commit passed)
  • Added unit tests for include_draft_logprobs
  • Verified backward compatibility
  • (Optional) Accuracy benchmark on large-scale dataset
  • (Optional) Cherry-pick to release branch if needed

Copy link

paddle-bot bot commented Oct 17, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Oct 17, 2025
@sunlei1024 sunlei1024 force-pushed the feat/draft_logprobs_cp branch from 2c56b6d to 36ad6ed Compare October 17, 2025 04:07
@Jiang-Jia-Jun Jiang-Jia-Jun requested a review from Copilot October 17, 2025 05:32
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds draft_logprobs support for the Speculative Decode MTP mode, enabling developers to capture intermediate prediction probabilities during speculative decoding. The primary goal is to enhance observability and debuggability of the speculative decoding process while maintaining full backward compatibility with existing OpenAI-compatible interfaces.

Key changes:

  • Added include_draft_logprobs parameter to /v1/completions and /v1/chat/completions request APIs
  • Introduced draft_logprobs field in response structures to carry intermediate draft token probabilities
  • Extended token processing logic to handle and buffer draft logprobs separately from target logprobs in speculative decoding scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
fastdeploy/entrypoints/openai/protocol.py Added include_draft_logprobs request parameter and draft_logprobs response field to protocol definitions
fastdeploy/entrypoints/openai/serving_completion.py Implemented draft logprobs aggregation and processing logic for completion endpoints
fastdeploy/entrypoints/openai/serving_chat.py Implemented draft logprobs aggregation and processing logic for chat endpoints
fastdeploy/engine/request.py Extended CompletionOutput and RequestOutput classes with draft logprobs support and output type tracking
fastdeploy/output/token_processor.py Core token processing logic updated to extract, buffer, and merge draft/target logprobs in speculative decoding mode
tests/output/test_process_batch_output.py Added unit test infrastructure for validating speculative decoding with logprobs

Comment on lines +522 to +528
if self.use_logprobs:
mtype = int(self.output_tokens[1, 0].item())
batch = self.output_tokens[2, 0]
accept_num = [int(num[0]) for num in self.output_tokens[3 : batch + 3]]
tokens = tokens[3 + MAX_BSZ : 3 + MAX_BSZ + batch * MAX_DRAFT_TOKENS * (K + 1)].reshape(
[batch, MAX_DRAFT_TOKENS, K + 1]
)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers (3, K+1) are used without explanation. Consider extracting these as named constants like METADATA_HEADER_SIZE = 3 and TOPK_SIZE = K + 1 to improve code clarity and maintainability.

Copilot uses AI. Check for mistakes.

Comment on lines +420 to +423
for target, decode in zip(self._batch_result_buffer, draft_batch_result):
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs
target_batch_result.append(target)
self._batch_result_buffer = None
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue if self._batch_result_buffer and draft_batch_result have different lengths. The zip operation will silently truncate to the shorter list, which could lead to data loss. Add length validation before the zip operation.

Suggested change
for target, decode in zip(self._batch_result_buffer, draft_batch_result):
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs
target_batch_result.append(target)
self._batch_result_buffer = None
if len(self._batch_result_buffer) != len(draft_batch_result):
llm_logger.error(
f"Length mismatch: _batch_result_buffer ({len(self._batch_result_buffer)}) vs draft_batch_result ({len(draft_batch_result)}). Skipping postprocess for this batch."
)
else:
for target, decode in zip(self._batch_result_buffer, draft_batch_result):
target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs
target_batch_result.append(target)
self._batch_result_buffer = None

Copilot uses AI. Check for mistakes.

if logprobs_res and logprobs_res.content is not None:
logprob_contents.extend(logprobs_res.content)

# draf_logprobs
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'draf_logprobs' to 'draft_logprobs'.

Suggested change
# draf_logprobs
# draft_logprobs

Copilot uses AI. Check for mistakes.

final_res = final_res_batch[idx]
prompt_token_ids = prompt_batched_token_ids[idx]
assert prompt_token_ids is not None
prompt_text = request.prompt
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable prompt_text is assigned here but immediately reassigned at line 568 if request.echo is true. Consider moving this assignment inside the else block of the echo condition to avoid unnecessary assignment.

Copilot uses AI. Check for mistakes.

"""
try:
self.cached_generated_tokens.put_results(batch_result)
if self.cfg.speculative_config.method and self.use_logprobs:
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The condition self.cfg.speculative_config.method and self.use_logprobs is checked here, but the same logic pattern appears multiple times. Consider extracting this into a property method like is_speculative_with_logprobs for better readability and maintainability.

Copilot uses AI. Check for mistakes.

llm_logger.info(
f"Request: {task_id} finished, number of " f"generated tokens: {self.tokens_counter[task_id]}."
f"Request: {task_id} finished, number of "
f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}"
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Log message formatting is inconsistent with spacing around colons. Add spaces after colons for better readability: token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}.

Suggested change
f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}"
f"generated tokens: {self.tokens_counter[task_id]}, token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}"

Copilot uses AI. Check for mistakes.

] * MAX_DRAFT_TOKENS
processor.speculative_stats_step = 0

# processor._recycle_resources = Mock()
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed. If this is needed for future reference, document why it's commented or remove it entirely.

Suggested change
# processor._recycle_resources = Mock()

Copilot uses AI. Check for mistakes.

@gongshaotian gongshaotian changed the title [Feat] Add draft_logprobs for Speculative Decode MTP [Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant