[router] complete router oai spec by slin1237 · Pull Request #8828 · sgl-project/sglang

slin1237 · 2025-08-05T18:58:57Z

Motivation

This PR updates the SGLang Router's OpenAI API specification by implementing critical SGLang-specific extensions that were previously missing.

Key Problems Addressed:

Missing Sampling: No support for top_k, min_p, repetition_penalty, min_tokens
No Structured Generation: Missing regex, ebnf, json_schema constraints
Limited Control: No stop_token_ids, ignore_eos, no_stop_trim parameters
No LoRA Support: Missing model customization capabilities
No Reasoning Models: Missing O1-style model features (separate_reasoning, stream_reasoning)
Code Maintenance: Repetitive boilerplate in test files reducing maintainability

This improvement bridges the compatibility gap while maintaining full backward compatibility with existing OpenAI clients.

Modifications

Core SGLang Sampling & Generation Extensions

Completed Chat Completions API (ChatCompletionRequest):

// Advanced Sampling Parameters
pub top_k: Option<i32>,                    // Top-k sampling (-1 to disable)
pub min_p: Option<f32>,                    // Min-p nucleus sampling  
pub min_tokens: Option<u32>,               // Minimum tokens to generate
pub repetition_penalty: Option<f32>,       // Repetition penalty control

// Structured Generation Constraints
pub regex: Option<String>,                 // Regex pattern constraint
pub ebnf: Option<String>,                  // EBNF grammar constraint

// Advanced Generation Control
pub stop_token_ids: Option<Vec<i32>>,      // Token ID stop conditions
pub no_stop_trim: bool,                    // Skip stop token trimming
pub ignore_eos: bool,                      // Ignore end-of-sequence tokens
pub continue_final_message: bool,          // Continue from last message
pub skip_special_tokens: bool,             // Control detokenization

Completed Completions API (CompletionRequest):

// Same advanced sampling and control parameters as chat completions
pub top_k: Option<i32>,
pub min_p: Option<f32>,
pub min_tokens: Option<u32>,
pub repetition_penalty: Option<f32>,
pub regex: Option<String>,
pub ebnf: Option<String>,
pub json_schema: Option<String>,           // JSON schema constraint (completions-specific)
pub stop_token_ids: Option<Vec<i32>>,
pub no_stop_trim: bool,
pub ignore_eos: bool,
pub skip_special_tokens: bool,

Advanced SGLang Features

LoRA & Model Customization:

// Flexible LoRA adapter support (single or batch)
#[derive(Debug, Clone, Deserialize, Serialize)]
#[serde(untagged)]
pub enum LoRAPath {
    Single(Option<String>),
    Batch(Vec<Option<String>>),
}

pub lora_path: Option<LoRAPath>,                    // LoRA adapter paths
pub session_params: Option<HashMap<String, Value>>, // Session management
pub return_hidden_states: bool,                     // Model hidden states

Reasoning Models (O1-style) Support:

pub separate_reasoning: bool,              // Separate reasoning from answer  
pub stream_reasoning: bool,                // Stream reasoning tokens
pub reasoning_content: Option<String>,     // Reasoning content in responses

Enhanced Response Types:

// Extended response capabilities  
pub matched_stop: Option<Value>,           // Which stop condition matched
pub hidden_states: Option<Vec<f32>>,       // Model hidden states
pub reasoning_content: Option<String>,     // O1-style reasoning content

Code Quality Improvements

Test Code Cleanup - Eliminated ~850 lines of repetitive boilerplate:

benchmark_integration.rs: Added 3 helper functions, refactored 5 test functions
request_processing.rs: Added 3 helper functions, cleaned 7+ sample functions
pd_types.rs: Added 1 helper function, refactored 6 test functions

Accuracy Test

This PR does not modify model-side code, kernels, or model architecture. All changes are to the router's API type definitions and request handling logic. Accuracy testing is not applicable.

Benchmark & Profiling

Performance Impact: Minimal - All optimizations maintain performance

Serialization Performance (using existing benchmarks):

JSON Serialization: 452-556 ns/req (unchanged)
JSON Deserialization: 507-793 ns/req (unchanged)
PD Adaptation: 377-904 ns/req (unchanged)
Total Pipeline: ~2μs (excellent, <10μs target)

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to complete the SGLang Router's OpenAI API specification by incorporating a suite of SGLang-specific extensions. The primary goal is to bridge compatibility gaps and provide richer functionality for advanced use cases, such as fine-tuned control over sampling, structured output generation, and integration with specialized model features like LoRA and O1-style reasoning. This expansion ensures that the SGLang Router can fully leverage the capabilities of SGLang models while maintaining seamless integration with existing OpenAI API clients. Additionally, I've taken the opportunity to improve the maintainability of our test suite by reducing redundant code.

Highlights

Enhanced Sampling Parameters: I've added comprehensive SGLang-specific sampling parameters to both Chat Completions and Completions APIs, including top_k, min_p, min_tokens, and repetition_penalty.
Structured Generation Capabilities: New structured generation constraints like regex, ebnf, and json_schema (for Completions) are now supported, enabling more controlled and predictable model outputs.
Fine-Grained Generation Control: Advanced generation control options such as stop_token_ids, no_stop_trim, ignore_eos, continue_final_message, and skip_special_tokens have been integrated.
LoRA and Model Customization: The router now supports LoRA adapters via lora_path and includes session_params for session management, along with the ability to return hidden_states.
Reasoning Model Integration: Support for O1-style reasoning models has been added with separate_reasoning, stream_reasoning, and reasoning_content fields.
Test Code Refactoring and Cleanup: I've significantly cleaned up test code across benchmark_integration.rs, request_processing.rs, and pd_types.rs by introducing helper functions and refactoring existing tests, reducing boilerplate by approximately 850 lines.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request does a great job of extending the OpenAI API types with SGLang-specific features and refactoring the test suite to be more maintainable by removing significant boilerplate. The changes are well-structured and the intent is clear.

However, I've identified a critical issue: while the new API fields are correctly defined in the request structs, they are not being passed through the to_pd_request adapter functions. This means the new parameters would be dropped and ignored by the backend. I've provided suggestions to fix this for each of the affected request types. I also included a minor suggestion to improve the conciseness of one of the new helper functions.

sgl-router/src/openai_api_types.rs

gemini-code-assist · 2025-08-06T01:30:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

81549361 · 2025-08-17T00:37:18Z

@slin1237
After starting the router, the following parameters will not take effect:

"chat_template_kwargs"  : {"enable_thinking":false}

slin1237 requested a review from ByronHsu as a code owner August 5, 2025 18:58

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

sgl-router/src/openai_api_types.rs Outdated Show resolved Hide resolved

[router] complete router oai spec

700afa0

slin1237 force-pushed the slin/spec branch from a997f2b to 700afa0 Compare August 5, 2025 19:18

Merge branch 'main' into slin/spec

0c0b291

slin1237 merged commit 5d62b56 into main Aug 6, 2025
23 of 24 checks passed

slin1237 deleted the slin/spec branch August 6, 2025 01:30

slin1237 mentioned this pull request Aug 6, 2025

[router] upgrade router version to 0.1.9 #8844

Merged

6 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[router] complete router oai spec (sgl-project#8828)

8a6fd37

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[router] complete router oai spec (sgl-project#8828)

4b7dc32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[router] complete router oai spec#8828

[router] complete router oai spec#8828
slin1237 merged 2 commits intomainfrom
slin/spec

slin1237 commented Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

81549361 commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slin1237 commented Aug 5, 2025

Motivation

Modifications

Core SGLang Sampling & Generation Extensions

Advanced SGLang Features

Code Quality Improvements

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

81549361 commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants