[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467

sunlei1024 · 2025-10-17T03:54:53Z

[Speculative Decoding] Add `draft_logprobs` Support for Speculative Decode MTP

Motivation

本 PR 为 MTP 的 Speculative Decode 模式 增加了 draft_logprobs 支持，并扩展了 OpenAI 兼容接口，以便开发者能够在推测解码过程中获取中间预测概率信息。

主要目的：

增强 Speculative Decode 的 可观测性 与 可调试性。
为研究与实验提供更丰富的中间结果数据，便于分析模型行为。

Modifications

1. 新增请求参数

include_draft_logprobs
- 类型：bool
- 位置：/v1/completions 与 /v1/chat/completions 请求体中
- 功能：控制是否返回推测解码中间阶段的概率值 (draft_logprobs)

2. 新增响应参数

draft_logprobs
- 当 include_draft_logprobs=true 时，响应中将包含该字段。
- 记录 Speculative Decode 阶段各候选 token 的中间 log probability 信息。

3. 兼容性

非 Speculative Decode 模式逻辑保持不变。
与现有 OpenAI 兼容接口完全兼容，不影响原有 logprobs 字段。

Usage or Command

示例 1：`/completions` 接口

curl https://{ip}:{port}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "prompt": "Hello, world!",
    "logprobs": 5,
    "include_draft_logprobs": true
  }'

示例响应：

{
  "id": "cmpl-xxx",
  "object": "text_completion",
  "choices": [
    {
      "text": "Hello",
      "logprobs": [ ... ], 
      "draft_logprobs": [ ... ]
    }
  ]
}

示例 2：`/chat/completions` 接口

curl https://{ip}:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, world!"}
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "include_draft_logprobs": true
  }'

示例响应：

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello",
        "logprobs": [ ... ], 
        "draft_logprobs": [ ... ]
      }
    }
  ]
}

Accuracy Tests

核心 Speculative Decode 逻辑未修改，输出一致性通过现有单元测试验证。
在开启 include_draft_logprobs 的情况下，新增字段仅用于观测，不影响模型输出结果。
已针对 OpenAI 兼容接口做请求/响应格式验证。

Checklist

Add tag [Speculative Decoding]
Code formatted (pre-commit passed)
Added unit tests for include_draft_logprobs
Verified backward compatibility
(Optional) Accuracy benchmark on large-scale dataset
(Optional) Cherry-pick to release branch if needed

paddle-bot · 2025-10-17T03:54:58Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR adds draft_logprobs support for the Speculative Decode MTP mode, enabling developers to capture intermediate prediction probabilities during speculative decoding. The primary goal is to enhance observability and debuggability of the speculative decoding process while maintaining full backward compatibility with existing OpenAI-compatible interfaces.

Key changes:

Added include_draft_logprobs parameter to /v1/completions and /v1/chat/completions request APIs
Introduced draft_logprobs field in response structures to carry intermediate draft token probabilities
Extended token processing logic to handle and buffer draft logprobs separately from target logprobs in speculative decoding scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
fastdeploy/entrypoints/openai/protocol.py	Added `include_draft_logprobs` request parameter and `draft_logprobs` response field to protocol definitions
fastdeploy/entrypoints/openai/serving_completion.py	Implemented draft logprobs aggregation and processing logic for completion endpoints
fastdeploy/entrypoints/openai/serving_chat.py	Implemented draft logprobs aggregation and processing logic for chat endpoints
fastdeploy/engine/request.py	Extended `CompletionOutput` and `RequestOutput` classes with draft logprobs support and output type tracking
fastdeploy/output/token_processor.py	Core token processing logic updated to extract, buffer, and merge draft/target logprobs in speculative decoding mode
tests/output/test_process_batch_output.py	Added unit test infrastructure for validating speculative decoding with logprobs

Copilot · 2025-10-17T05:33:37Z

fastdeploy/output/token_processor.py

+            if self.use_logprobs:
+                mtype = int(self.output_tokens[1, 0].item())
+                batch = self.output_tokens[2, 0]
+                accept_num = [int(num[0]) for num in self.output_tokens[3 : batch + 3]]
+                tokens = tokens[3 + MAX_BSZ : 3 + MAX_BSZ + batch * MAX_DRAFT_TOKENS * (K + 1)].reshape(
+                    [batch, MAX_DRAFT_TOKENS, K + 1]
+                )


Magic numbers (3, K+1) are used without explanation. Consider extracting these as named constants like METADATA_HEADER_SIZE = 3 and TOPK_SIZE = K + 1 to improve code clarity and maintainability.

Copilot · 2025-10-17T05:33:37Z

fastdeploy/output/token_processor.py

+                        for target, decode in zip(self._batch_result_buffer, draft_batch_result):
+                            target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs
+                            target_batch_result.append(target)
+                        self._batch_result_buffer = None


Potential issue if self._batch_result_buffer and draft_batch_result have different lengths. The zip operation will silently truncate to the shorter list, which could lead to data loss. Add length validation before the zip operation.

Suggested change

for target, decode in zip(self._batch_result_buffer, draft_batch_result):

target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs

target_batch_result.append(target)

self._batch_result_buffer = None

if len(self._batch_result_buffer) != len(draft_batch_result):

llm_logger.error(

f"Length mismatch: _batch_result_buffer ({len(self._batch_result_buffer)}) vs draft_batch_result ({len(draft_batch_result)}). Skipping postprocess for this batch."

)

else:

for target, decode in zip(self._batch_result_buffer, draft_batch_result):

target.outputs.draft_top_logprobs = decode.outputs.draft_top_logprobs

target_batch_result.append(target)

self._batch_result_buffer = None

Copilot · 2025-10-17T05:33:38Z

fastdeploy/entrypoints/openai/serving_chat.py

                        if logprobs_res and logprobs_res.content is not None:
                            logprob_contents.extend(logprobs_res.content)
+
+                        # draf_logprobs


Corrected spelling of 'draf_logprobs' to 'draft_logprobs'.

Suggested change

# draf_logprobs

# draft_logprobs

Copilot · 2025-10-17T05:33:38Z

fastdeploy/entrypoints/openai/serving_completion.py

            final_res = final_res_batch[idx]
            prompt_token_ids = prompt_batched_token_ids[idx]
            assert prompt_token_ids is not None
+            prompt_text = request.prompt


Variable prompt_text is assigned here but immediately reassigned at line 568 if request.echo is true. Consider moving this assignment inside the else block of the echo condition to avoid unnecessary assignment.

Copilot · 2025-10-17T05:33:38Z

fastdeploy/output/token_processor.py

        """
        try:
-            self.cached_generated_tokens.put_results(batch_result)
+            if self.cfg.speculative_config.method and self.use_logprobs:


[nitpick] The condition self.cfg.speculative_config.method and self.use_logprobs is checked here, but the same logic pattern appears multiple times. Consider extracting this into a property method like is_speculative_with_logprobs for better readability and maintainability.

Copilot · 2025-10-17T05:33:38Z

fastdeploy/output/token_processor.py

                    llm_logger.info(
-                        f"Request: {task_id} finished, number of " f"generated tokens: {self.tokens_counter[task_id]}."
+                        f"Request: {task_id} finished, number of "
+                        f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}"


[nitpick] Log message formatting is inconsistent with spacing around colons. Add spaces after colons for better readability: token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}.

Suggested change

f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}"

f"generated tokens: {self.tokens_counter[task_id]}, token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}"

Copilot · 2025-10-17T05:33:38Z

tests/output/test_process_batch_output.py

+        ] * MAX_DRAFT_TOKENS
+        processor.speculative_stats_step = 0
+
+        # processor._recycle_resources = Mock()


Commented-out code should be removed. If this is needed for future reference, document why it's commented or remove it entirely.

Suggested change

# processor._recycle_resources = Mock()

feat: add draft_logprobs for Speculative Decode MTP

41f5052

paddle-bot bot added the contributor External developers label Oct 17, 2025

sunlei1024 added 3 commits October 17, 2025 04:06

feat: add draft_logprobs for Speculative Decode MTP

2b06232

feat: add draft_logprobs for Speculative Decode MTP

9ef2dbd

fix: postprocess for speculative decode

36ad6ed

sunlei1024 force-pushed the feat/draft_logprobs_cp branch from 2c56b6d to 36ad6ed Compare October 17, 2025 04:07

Jiang-Jia-Jun requested a review from Copilot October 17, 2025 05:32

Copilot AI reviewed Oct 17, 2025

View reviewed changes

gongshaotian changed the title ~~[Feat] Add draft_logprobs for Speculative Decode MTP~~ [Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP Oct 17, 2025

This was referenced Oct 17, 2025

[Feature] support mtp logprob #4457

Merged

[Feature] support mtp logprob #4464

Merged

sunlei1024 added 4 commits October 17, 2025 10:58

test: test_speculative_decoding_use_logprobs

adad075

fix: test_completion_echo

bf1cbd7

fix conflicts & merge

f00dfd7

Merge branch 'develop' into feat/draft_logprobs_cp

a609808

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467

sunlei1024 commented Oct 17, 2025

Uh oh!

paddle-bot bot commented Oct 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	f"generated tokens: {self.tokens_counter[task_id]}, token_id:{token_id},is_prefill:{is_prefill},recovery_stop:{recovery_stop}"
	f"generated tokens: {self.tokens_counter[task_id]}, token_id: {token_id}, is_prefill: {is_prefill}, recovery_stop: {recovery_stop}"

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467

Are you sure you want to change the base?

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP #4467

Conversation

sunlei1024 commented Oct 17, 2025

[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP

Motivation

Modifications

1. 新增请求参数

2. 新增响应参数

3. 兼容性

Usage or Command

示例 1：/completions 接口

示例 2：/chat/completions 接口

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Oct 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Speculative Decoding] Add `draft_logprobs` Support for Speculative Decode MTP

示例 1：`/completions` 接口

示例 2：`/chat/completions` 接口