Skip to content

[router] complete router oai spec#8828

Merged
slin1237 merged 2 commits intomainfrom
slin/spec
Aug 6, 2025
Merged

[router] complete router oai spec#8828
slin1237 merged 2 commits intomainfrom
slin/spec

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Aug 5, 2025

Motivation

This PR updates the SGLang Router's OpenAI API specification by implementing critical SGLang-specific extensions that were previously missing.

Key Problems Addressed:

  • Missing Sampling: No support for top_k, min_p, repetition_penalty, min_tokens
  • No Structured Generation: Missing regex, ebnf, json_schema constraints
  • Limited Control: No stop_token_ids, ignore_eos, no_stop_trim parameters
  • No LoRA Support: Missing model customization capabilities
  • No Reasoning Models: Missing O1-style model features (separate_reasoning, stream_reasoning)
  • Code Maintenance: Repetitive boilerplate in test files reducing maintainability

This improvement bridges the compatibility gap while maintaining full backward compatibility with existing OpenAI clients.

Modifications

Core SGLang Sampling & Generation Extensions

Completed Chat Completions API (ChatCompletionRequest):

// Advanced Sampling Parameters
pub top_k: Option<i32>,                    // Top-k sampling (-1 to disable)
pub min_p: Option<f32>,                    // Min-p nucleus sampling  
pub min_tokens: Option<u32>,               // Minimum tokens to generate
pub repetition_penalty: Option<f32>,       // Repetition penalty control

// Structured Generation Constraints
pub regex: Option<String>,                 // Regex pattern constraint
pub ebnf: Option<String>,                  // EBNF grammar constraint

// Advanced Generation Control
pub stop_token_ids: Option<Vec<i32>>,      // Token ID stop conditions
pub no_stop_trim: bool,                    // Skip stop token trimming
pub ignore_eos: bool,                      // Ignore end-of-sequence tokens
pub continue_final_message: bool,          // Continue from last message
pub skip_special_tokens: bool,             // Control detokenization

Completed Completions API (CompletionRequest):

// Same advanced sampling and control parameters as chat completions
pub top_k: Option<i32>,
pub min_p: Option<f32>,
pub min_tokens: Option<u32>,
pub repetition_penalty: Option<f32>,
pub regex: Option<String>,
pub ebnf: Option<String>,
pub json_schema: Option<String>,           // JSON schema constraint (completions-specific)
pub stop_token_ids: Option<Vec<i32>>,
pub no_stop_trim: bool,
pub ignore_eos: bool,
pub skip_special_tokens: bool,

Advanced SGLang Features

LoRA & Model Customization:

// Flexible LoRA adapter support (single or batch)
#[derive(Debug, Clone, Deserialize, Serialize)]
#[serde(untagged)]
pub enum LoRAPath {
    Single(Option<String>),
    Batch(Vec<Option<String>>),
}

pub lora_path: Option<LoRAPath>,                    // LoRA adapter paths
pub session_params: Option<HashMap<String, Value>>, // Session management
pub return_hidden_states: bool,                     // Model hidden states

Reasoning Models (O1-style) Support:

pub separate_reasoning: bool,              // Separate reasoning from answer  
pub stream_reasoning: bool,                // Stream reasoning tokens
pub reasoning_content: Option<String>,     // Reasoning content in responses

Enhanced Response Types:

// Extended response capabilities  
pub matched_stop: Option<Value>,           // Which stop condition matched
pub hidden_states: Option<Vec<f32>>,       // Model hidden states
pub reasoning_content: Option<String>,     // O1-style reasoning content

Code Quality Improvements

Test Code Cleanup - Eliminated ~850 lines of repetitive boilerplate:

  • benchmark_integration.rs: Added 3 helper functions, refactored 5 test functions
  • request_processing.rs: Added 3 helper functions, cleaned 7+ sample functions
  • pd_types.rs: Added 1 helper function, refactored 6 test functions

Accuracy Test

This PR does not modify model-side code, kernels, or model architecture. All changes are to the router's API type definitions and request handling logic. Accuracy testing is not applicable.

Benchmark & Profiling

Performance Impact: Minimal - All optimizations maintain performance

Serialization Performance (using existing benchmarks):

  • JSON Serialization: 452-556 ns/req (unchanged)
  • JSON Deserialization: 507-793 ns/req (unchanged)
  • PD Adaptation: 377-904 ns/req (unchanged)
  • Total Pipeline: ~2μs (excellent, <10μs target)

Checklist

@slin1237 slin1237 requested a review from ByronHsu as a code owner August 5, 2025 18:58
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to complete the SGLang Router's OpenAI API specification by incorporating a suite of SGLang-specific extensions. The primary goal is to bridge compatibility gaps and provide richer functionality for advanced use cases, such as fine-tuned control over sampling, structured output generation, and integration with specialized model features like LoRA and O1-style reasoning. This expansion ensures that the SGLang Router can fully leverage the capabilities of SGLang models while maintaining seamless integration with existing OpenAI API clients. Additionally, I've taken the opportunity to improve the maintainability of our test suite by reducing redundant code.

Highlights

  • Enhanced Sampling Parameters: I've added comprehensive SGLang-specific sampling parameters to both Chat Completions and Completions APIs, including top_k, min_p, min_tokens, and repetition_penalty.
  • Structured Generation Capabilities: New structured generation constraints like regex, ebnf, and json_schema (for Completions) are now supported, enabling more controlled and predictable model outputs.
  • Fine-Grained Generation Control: Advanced generation control options such as stop_token_ids, no_stop_trim, ignore_eos, continue_final_message, and skip_special_tokens have been integrated.
  • LoRA and Model Customization: The router now supports LoRA adapters via lora_path and includes session_params for session management, along with the ability to return hidden_states.
  • Reasoning Model Integration: Support for O1-style reasoning models has been added with separate_reasoning, stream_reasoning, and reasoning_content fields.
  • Test Code Refactoring and Cleanup: I've significantly cleaned up test code across benchmark_integration.rs, request_processing.rs, and pd_types.rs by introducing helper functions and refactoring existing tests, reducing boilerplate by approximately 850 lines.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of extending the OpenAI API types with SGLang-specific features and refactoring the test suite to be more maintainable by removing significant boilerplate. The changes are well-structured and the intent is clear.

However, I've identified a critical issue: while the new API fields are correctly defined in the request structs, they are not being passed through the to_pd_request adapter functions. This means the new parameters would be dropped and ignored by the backend. I've provided suggestions to fix this for each of the affected request types. I also included a minor suggestion to improve the conciseness of one of the new helper functions.

@slin1237 slin1237 merged commit 5d62b56 into main Aug 6, 2025
23 of 24 checks passed
@slin1237 slin1237 deleted the slin/spec branch August 6, 2025 01:30
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@81549361
Copy link
Copy Markdown

@slin1237
After starting the router, the following parameters will not take effect:

"chat_template_kwargs"  : {"enable_thinking":false}

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants