Skip to content

[router] PD Router Simplification and Reorganization#8838

Merged
slin1237 merged 1 commit intomainfrom
slin/pd-simplication
Aug 6, 2025
Merged

[router] PD Router Simplification and Reorganization#8838
slin1237 merged 1 commit intomainfrom
slin/pd-simplication

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Aug 6, 2025

Motivation

The PD (Prefill-Decode) router in SGLang's Rust router implementation suffered from significant architectural complexity that made it difficult to maintain and extend. The existing implementation used a complex multi-layer type conversion system (OpenAI → PD types → JSON) with redundant wrapper methods, resulting in over 1800 lines of unnecessary code and manual field mapping that was error-prone and required updates in multiple places for new fields.

Key problems addressed:

  • Complex adapter layer: 1400+ lines of manual field mapping in request_adapter.rs
  • Dual type system: Maintaining both OpenAI and PD type definitions
  • SGLang extension pass-through issues: New fields required manual updates in 3+ places
  • Redundant wrapper methods: RouterTrait methods just calling PDRouter methods with no added value
  • Code organization: 5 scattered impl blocks with inconsistent organization

Modifications

This PR implements simplification and reorganization of the PD router architecture:

1. Direct JSON Manipulation (Simplification)

  • New approach: OpenAI Request → JSON → Bootstrap Injection (eliminates intermediate type conversions)
  • Created: bootstrap_injector.rs with intelligent batch detection and field injection
  • Eliminated: Complex ToPdRequest trait and adapter pattern entirely
  • Result: All OpenAI and SGLang fields automatically preserved with zero manual mapping

2. Direct Implementation Architecture (Reorganization)

  • Eliminated: All wrapper methods where RouterTrait just called PDRouter methods
  • Moved: All routing logic directly into RouterTrait implementations (no indirection)
  • Reorganized: 5 scattered impl blocks → 3 well-organized blocks by functionality

3. Code Reduction

  • request_adapter.rs: 1400+ lines → 0 lines (100% reduction)
  • pd_types.rs: 500 lines → 60 lines (88% reduction)
  • pd_router.rs: Removed 370+ lines of wrapper methods and duplicates
  • Total: 1800+ lines eliminated while maintaining 100% functionality

4. Files Modified

  • Created: src/routers/bootstrap_injector.rs (direct JSON field injection)
  • Simplified: src/routers/request_adapter.rs (removed adapter layer)
  • Cleaned: src/routers/pd_types.rs (removed intermediate types)
  • Reorganized: src/routers/pd_router.rs (direct trait implementations)

Benchmark & Profiling

Before

OpenAI Request (GenerateRequest/ChatCompletionRequest/CompletionRequest)
    ↓ .to_pd_request() [ToPdRequest trait]
PD-Specific Types (GenerateReqInput/ChatReqInput) 
    ↓ .add_bootstrap_info() [Bootstrap trait]
Add bootstrap fields (host/port/room)
    ↓ serde_json::to_value()
JSON for backend

Benchmark:

SGLang Router Performance Benchmark Suite
=============================================

Quick Performance Overview:
  * Serialization (avg):          650 ns/req
  * Deserialization (avg):        912 ns/req
  * PD Adaptation (avg):         1053 ns/req
  * Total Pipeline (avg):        2615 ns/req

After

OpenAI Request (GenerateRequest/ChatCompletionRequest/CompletionRequest)
    ↓ serde_json::to_value()
JSON (with all fields preserved)
    ↓ inject_bootstrap_fields()
JSON with bootstrap fields added

Benchmark:

SGLang Router Performance Benchmark Suite
=============================================

Quick Performance Overview:
  * Serialization (avg):          484 ns/req
  * Deserialization (avg):        533 ns/req
  * Bootstrap Injection (avg):   1061 ns/req
  * Total Pipeline (avg):        2078 ns/req

Performance improvements achieved through architectural simplification:

  • Eliminated one type conversion step per request (OpenAI → PD types)
  • Reduced memory allocations from intermediate type creation
  • Removed function call overhead from wrapper methods
  • Direct JSON manipulation is more efficient than struct conversions

Benchmark Validation

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    24.0      
Max request concurrency:                 not set   
Successful requests:                     256       
Benchmark duration (s):                  91.97     
Total input tokens:                      1495240   
Total generated tokens:                  1229      
Total generated tokens (retokenized):    249       
Request throughput (req/s):              2.78      
Input token throughput (tok/s):          16258.14  
Output token throughput (tok/s):         13.36     
Total token throughput (tok/s):          16271.50  
Concurrency:                             126.62    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45488.91  
Median E2E Latency (ms):                 47299.31  
---------------Time to First Token----------------
Mean TTFT (ms):                          44253.95  
Median TTFT (ms):                        46615.94  
P99 TTFT (ms):                           80582.18  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Two llama3.3 70B on one H100 node

Checklist

  • Format your code according to the Code Formatting with Pre-Commit.
  • Add unit tests as outlined in the Running Unit Tests.
    • Added 10 comprehensive bootstrap injection tests
    • Maintained all existing test coverage (211 tests passing)
  • Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
    • Updated inline documentation for new architecture
    • Comprehensive planning documents included
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
    • Performance maintained with architectural improvements
    • All integration tests verify identical behavior
  • For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

@slin1237 slin1237 requested a review from ByronHsu as a code owner August 6, 2025 01:39
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@slin1237 slin1237 force-pushed the slin/pd-simplication branch from 0cd8b5f to 25fddd6 Compare August 6, 2025 02:19
@slin1237 slin1237 merged commit 8c7bb39 into main Aug 6, 2025
24 checks passed
@slin1237 slin1237 deleted the slin/pd-simplication branch August 6, 2025 04:20
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants