Skip to content

Conversation

@juncaipeng
Copy link
Collaborator

@juncaipeng juncaipeng commented Oct 30, 2025

Motivation

Refine splitwise deployment.

Modifications

  • add simple router support mixed and splitwise deployment
  • refine pd communication in splitwise deployment
  • add example
  • fix unit test for splitwise deployment on multi node
  • run benchmark with splitwise deployment

TODO:

  • add doc
  • add unit test that use rdma to transfer cache

Usage or Command

Refer to examples.

Accuracy Tests

benchmark of v1 splitwise

benchmark_duration: 16.98276040190831 秒
============ Serving Benchmark Result ============
Successful requests:                     997       
Benchmark duration (s):                  16.98     
Total input tokens:                      1275938   
Total generated tokens:                  1994      
Request throughput (req/s):              58.707    
Output token throughput (tok/s):         117.41    
Total Token throughput (tok/s):          75248.78  
-------------------解码速度(tok/s)--------------------
Mean Decode:                             12.51     
Median Decode:                           12.01     
P80 Decode:                              14.32     
P95 Decode:                              17.76     
P99 Decode:                              29.91     
P99.9 Decode:                            48.12     
P99.95 Decode:                           48.38     
P99.99 Decode:                           48.58     
---------------Time to First Token----------------
Mean TTFT (ms):                          3036.59   
Median TTFT (ms):                        3092.77   
P80 TTFT (ms):                           3325.40   
P95 TTFT (ms):                           3671.44   
P99 TTFT (ms):                           4170.79   
P99.9 TTFT (ms):                         4279.39   
P99.95 TTFT (ms):                        4279.76   
P99.99 TTFT (ms):                        4280.05   
------------Infer Time to First Token-------------
Mean S_TTFT (ms):                        77.57     
Median S_TTFT (ms):                      78.24     
P80 S_TTFT (ms):                         103.77    
P95 S_TTFT (ms):                         128.60    
P99 S_TTFT (ms):                         147.21    
P99.9 S_TTFT (ms):                       158.14    
P99.95 S_TTFT (ms):                      160.26    
P99.99 S_TTFT (ms):                      161.96 

benchmark of v2 splitwise (using router):

benchmark_duration: 16.793025318998843 秒
============ Serving Benchmark Result ============
Successful requests:                     997       
Benchmark duration (s):                  16.79     
Total input tokens:                      1275938   
Total generated tokens:                  1994      
Request throughput (req/s):              59.370    
Output token throughput (tok/s):         118.74    
Total Token throughput (tok/s):          76098.97  
-------------------解码速度(tok/s)--------------------
Mean Decode:                             13.51     
Median Decode:                           12.60     
P80 Decode:                              14.95     
P95 Decode:                              20.80     
P99 Decode:                              43.47     
P99.9 Decode:                            54.75     
P99.95 Decode:                           69.16     
P99.99 Decode:                           80.69     
---------------Time to First Token----------------
Mean TTFT (ms):                          3026.10   
Median TTFT (ms):                        3079.82   
P80 TTFT (ms):                           3209.71   
P95 TTFT (ms):                           4362.91   
P99 TTFT (ms):                           4838.31   
P99.9 TTFT (ms):                         4930.10   
P99.95 TTFT (ms):                        4931.17   
P99.99 TTFT (ms):                        4932.04   
------------Infer Time to First Token-------------
Mean S_TTFT (ms):                        72.28     
Median S_TTFT (ms):                      70.63     
P80 S_TTFT (ms):                         93.56     
P95 S_TTFT (ms):                         119.08    
P99 S_TTFT (ms):                         147.24    
P99.9 S_TTFT (ms):                       273.10    
P99.95 S_TTFT (ms):                      277.48    
P99.99 S_TTFT (ms):                      280.98 

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@CLAassistant
Copy link

CLAassistant commented Oct 30, 2025

CLA assistant check
All committers have signed the CLA.

@paddle-bot
Copy link

paddle-bot bot commented Oct 30, 2025

Thanks for your contribution!


export CUDA_VISIBLE_DEVICES=0
export FD_DEBUG=1
export ENABLE_V1_KVCACHE_SCHEDULER=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开关应该已经废弃了,另外这个还开了DEBUG日志

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开 debug 是便于调试,后面可以删掉。 ENABLE_V1_KVCACHE_SCHEDULER也还有效,后面要适配 v1

self.ips = self.ips.split(",")

self.host_ip = get_host_ip()
self.port = port
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是把服务层的端口号下放到config了吗,这个应该只是APIServer层的配置参数

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对 config需要感知 api server的端口才可以上报给 router

@juncaipeng juncaipeng force-pushed the pd branch 2 times, most recently from 7903bb4 to 2f16499 Compare November 4, 2025 07:22
logger = get_logger("cache_messager", "cache_messager.log")


def parse_args():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里移除logger的声明,下面的函数还能正常使用logger吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件使用带 rank_id 的 logger,所以这里多余声明了


self.read_from_config()
self.postprocess()
self.init_cache_info()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_cache_info需要再跟tingdan确认下是否放在config这步,看是否影响profile,以及训练场景的权重重加载

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

from fastdeploy.utils import llm_logger
from fastdeploy.utils import get_logger, llm_logger

config_logger = get_logger("config", "config.log")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在日志文件有些多,config如无必要,应该不用额外增加日志了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config 是之前就有的 log 文件,原先是将 config 保存在 llm_logger 中,这里统一都保存在 config.log 中了

error_msg=task["error_msg"],
)
)
output = RequestOutput.from_dict(task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个代码跟左边是等价的吗,例如像send_idx,以及finished

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我跑下来是等价的,from_dict 的字段会更多

@juncaipeng juncaipeng changed the title [Feature] [PD] splitwise deployment on multi node supports router [Feature] [PD] add simple router and refine splitwise deployment Nov 4, 2025
@Jiang-Jia-Jun Jiang-Jia-Jun requested a review from Copilot November 4, 2025 13:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a router-based splitwise deployment architecture (v2) for distributed LLM inference, supporting flexible prefill/decode instance management across multiple nodes.

  • Introduces a new router service that manages prefill/decode instance registration and request routing
  • Adds support for dynamic instance health monitoring and automatic removal of unhealthy instances
  • Implements request ID propagation through the router to enable proper request tracking in splitwise deployments

Reviewed Changes

Copilot reviewed 33 out of 38 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/e2e/test_ernie_03b_pd_multi_node.py New e2e test for multi-node splitwise deployment with router
tests/e2e/test_ernie_03b_pd.py Updated to use router-based deployment (v2) instead of direct instance communication
requirements.txt Added setproctitle and aistudio_sdk dependencies
fastdeploy/router/*.py New router module with launch script, router server, and utilities
fastdeploy/config.py Added RouterConfig class and splitwise version detection logic
fastdeploy/engine/args_utils.py Added router and port CLI arguments, reorganized splitwise args
fastdeploy/engine/common_engine.py Implemented router registration and refactored splitwise task processing
fastdeploy/entrypoints/openai/*.py Added request_id and disaggregate_info support to API protocols
fastdeploy/splitwise/splitwise_connector.py Enhanced logging and refactored message handling
fastdeploy/scheduler/local_scheduler.py Added has_request method and enhanced logging
fastdeploy/worker/worker_process.py Added execution time logging
examples/splitwise/*.sh New example scripts for v0/v1/v2 splitwise deployments
docs/**/multi-node_deployment.md Removed trailing whitespace
benchmarks/*.py Code formatting fixes

)
from fastdeploy.model_executor.ops.gpu import get_output_kv_signal, set_data_ipc
from fastdeploy.utils import envs, get_logger

Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger initialization at module level (lines 38-39 in the original code) was removed but the code still uses logger variable throughout the file (e.g., line 348, 378-379). This will cause NameError at runtime since logger is now only defined inside a function at line 844. Either restore the module-level logger or ensure all usage sites have access to the logger instance.

Suggested change
logger = get_logger()

Copilot uses AI. Check for mistakes.
Comment on lines 160 to 162
env_decode["CUDA_VISIBLE_DEVICES"] = "1"
env_prefill["ENABLE_V1_KVCACHE_SCHEDULER"] = "0"
env_decode["INFERENCE_MSG_QUEUE_ID"] = str(FD_API_PORT + 1)
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 161 incorrectly sets env_prefill[\"ENABLE_V1_KVCACHE_SCHEDULER\"] instead of env_decode[\"ENABLE_V1_KVCACHE_SCHEDULER\"]. This causes the environment variable to be set on the wrong process environment.

Copilot uses AI. Check for mistakes.
Comment on lines 242 to 244
env_decode["CUDA_VISIBLE_DEVICES"] = "1"
env_prefill["ENABLE_V1_KVCACHE_SCHEDULER"] = "0"
env_decode["INFERENCE_MSG_QUEUE_ID"] = str(FD_API_PORT + 1)
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 243 incorrectly sets env_prefill[\"ENABLE_V1_KVCACHE_SCHEDULER\"] instead of env_decode[\"ENABLE_V1_KVCACHE_SCHEDULER\"]. This causes the environment variable to be set on the wrong process environment.

Copilot uses AI. Check for mistakes.
Returns:
bool: True if the service is healthy, False otherwise.
"""
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't ensure the base_url starts with 'http' before using it, unlike the test file version at line 74. If a URL without protocol is passed, it will make an invalid request. Add the protocol check from the test version: if not base_url.startswith(\"http\"): base_url = f\"http://{base_url}\"

Suggested change
"""
"""
if not base_url.startswith(("http://", "https://")):
base_url = f"http://{base_url}"

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
pkill -9 -f python
pkill -9 -f fastdeploy
pkill -f -9 gunicorn
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pkill -9 to forcefully kill all Python processes is dangerous in a shared environment as it will terminate unrelated Python processes. Consider using more targeted process management (e.g., storing PIDs in files and killing specific processes) or at least warning users about this behavior in comments.

Copilot uses AI. Check for mistakes.
Comment on lines 418 to 419
# for idx, chunk in enumerate(chunks):
# print(f"\nchunk[{idx}]:\n{json.dumps(chunk, indent=2, ensure_ascii=False)}")
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Copilot uses AI. Check for mistakes.
continue
os.kill(pid, signal.SIGKILL)
print(f"Killed process on port {port}, pid={pid}")
except subprocess.CalledProcessError:
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
@juncaipeng juncaipeng force-pushed the pd branch 2 times, most recently from 1fd1ad6 to c7b0414 Compare November 5, 2025 04:05
)
item["layer_idx"] = current_layer_idx
if item["layer_idx"] == self.num_layers:
item["status"] = "finished"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 'error' not in item['status']:
item["status"] = "finished"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 08ca0f6 into PaddlePaddle:develop Nov 6, 2025
10 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants