-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Supports direct retrieval of tokenizer content and chat template content from the worker. #12513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Supports direct retrieval of tokenizer content and chat template content from the worker. #12513
Conversation
Signed-off-by: ybyang <[email protected]>
Signed-off-by: ybyang <[email protected]>
Summary of ChangesHello @whybeyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang router's flexibility by enabling it to retrieve tokenizer and chat template configurations directly from the worker via a new gRPC service. This eliminates the previous requirement for local tokenizer file paths, simplifying deployment in distributed environments and improving operational resilience. The router now includes a fallback mechanism to fetch these assets from a worker if local paths are not specified, alongside improvements to handle large tokenizer data and more robust chat template rendering. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature allowing the sglang-router to fetch tokenizer information directly from the worker via a new gRPC endpoint. This is particularly useful for deployments where the router and worker do not share a filesystem. The implementation is solid, with changes correctly applied to both the Python worker and the Rust router, including new protobuf definitions and robust fallback logic. My review focuses on improving code clarity and maintainability by suggesting refactors to reduce code duplication and simplify some of the new logic. I've also pointed out a few minor style issues. Overall, this is a great addition to the project.
| requested_files = list(request.requested_files) if request.requested_files else [] | ||
| if not requested_files: | ||
| # Default: read common tokenizer files | ||
| requested_files = [ | ||
| "tokenizer.json", | ||
| "tokenizer_config.json", | ||
| "chat_template.jinja", | ||
| "chat_template.json", | ||
| ] | ||
|
|
||
| files = {} | ||
| base_dir = base_path | ||
|
|
||
| # Check if base_path is a file or directory | ||
| if os.path.isfile(base_path): | ||
| base_dir = os.path.dirname(base_path) | ||
| # If it's tokenizer.json, include it | ||
| if os.path.basename(base_path) == "tokenizer.json": | ||
| if "tokenizer.json" not in requested_files: | ||
| requested_files.append("tokenizer.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a list for requested_files can lead to processing duplicate filenames if the client sends them, and membership checking is less efficient than with a set. It's more robust to use a set to automatically handle duplicates and provide faster lookups (O(1) average time complexity).
requested_files = set(request.requested_files)
if not requested_files:
# Default: read common tokenizer files
requested_files = {
"tokenizer.json",
"tokenizer_config.json",
"chat_template.jinja",
"chat_template.json",
}
files = {}
base_dir = base_path
# Check if base_path is a file or directory
if os.path.isfile(base_path):
base_dir = os.path.dirname(base_path)
# If it's tokenizer.json, include it
if os.path.basename(base_path) == "tokenizer.json":
requested_files.add("tokenizer.json")| let worker_url = match &config.mode { | ||
| crate::config::types::RoutingMode::Regular { worker_urls } | ||
| | crate::config::types::RoutingMode::OpenAI { worker_urls } => { | ||
| worker_urls | ||
| .iter() | ||
| .find(|url| url.starts_with("grpc://")) | ||
| .ok_or_else(|| { | ||
| "No gRPC worker URL found. Cannot fetch tokenizer from worker".to_string() | ||
| })? | ||
| } | ||
| crate::config::types::RoutingMode::PrefillDecode { prefill_urls, decode_urls, .. } => { | ||
| // Try prefill workers first, then decode workers | ||
| // prefill_urls is Vec<(String, Option<u16>)>, decode_urls is Vec<String> | ||
| prefill_urls | ||
| .iter() | ||
| .map(|(url, _)| url) | ||
| .chain(decode_urls.iter()) | ||
| .find(|url| url.starts_with("grpc://")) | ||
| .ok_or_else(|| { | ||
| "No gRPC worker URL found. Cannot fetch tokenizer from worker".to_string() | ||
| })? | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for finding a gRPC worker URL is duplicated for different routing modes. This can be simplified by first creating an iterator over all worker URLs and then searching for a gRPC URL. This refactoring improves code clarity and reduces repetition.
let worker_urls_iter: Box<dyn Iterator<Item = &String>> = match &config.mode {
crate::config::types::RoutingMode::Regular { worker_urls }
| crate::config::types::RoutingMode::OpenAI { worker_urls } => Box::new(worker_urls.iter()),
crate::config::types::RoutingMode::PrefillDecode { prefill_urls, decode_urls, .. } => {
Box::new(prefill_urls.iter().map(|(url, _)| url).chain(decode_urls.iter()))
}
};
let worker_url = worker_urls_iter
.find(|url| url.starts_with("grpc://"))
.ok_or_else(|| {
"No gRPC worker URL found. Cannot fetch tokenizer from worker".to_string()
})?;|
|
||
| /// Load chat template from tokenizer config JSON | ||
| pub fn load_chat_template_from_config(config_path: &str) -> Result<Option<String>> { | ||
| use tracing::info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| tokenizer_config_content: Option<&str>, | ||
| chat_template_content: Option<&str>, | ||
| ) -> Result<Arc<dyn traits::Tokenizer>> { | ||
| use tempfile::TempDir; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let final_chat_template = chat_template_path.or_else(|| { | ||
| let config_path = temp_path.join("tokenizer_config.json"); | ||
| if config_path.exists() { | ||
| // Try to extract chat_template field from tokenizer_config.json | ||
| if let Ok(content) = fs::read_to_string(&config_path) { | ||
| if let Ok(config_json) = serde_json::from_str::<serde_json::Value>(&content) { | ||
| if let Some(template_str) = config_json.get("chat_template").and_then(|v| v.as_str()) { | ||
| // Write extracted template to a separate file | ||
| let extracted_path = temp_path.join("chat_template.jinja"); | ||
| if fs::write(&extracted_path, template_str).is_ok() { | ||
| info!( | ||
| config_path = ?config_path, | ||
| extracted_path = ?extracted_path, | ||
| temp_dir = ?temp_path, | ||
| "Extracted chat template from tokenizer_config.json and saved to temporary file" | ||
| ); | ||
| extracted_path.to_str().map(|s| s.to_string()) | ||
| } else { | ||
| None | ||
| } | ||
| } else { | ||
| None | ||
| } | ||
| } else { | ||
| None | ||
| } | ||
| } else { | ||
| None | ||
| } | ||
| } else { | ||
| None | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for discovering and extracting the chat template from tokenizer_config.json is deeply nested, which can make it hard to read and maintain. This can be flattened using the ? operator within the closure, which works with Option and makes the code more linear and easier to follow.
let final_chat_template = chat_template_path.or_else(|| {
let config_path = temp_path.join("tokenizer_config.json");
if !config_path.exists() {
return None;
}
let content = fs::read_to_string(&config_path).ok()?;
let config_json: serde_json::Value = serde_json::from_str(&content).ok()?;
let template_str = config_json.get("chat_template")?.as_str()?;
let extracted_path = temp_path.join("chat_template.jinja");
if fs::write(&extracted_path, template_str).is_ok() {
info!(
config_path = ?config_path,
extracted_path = ?extracted_path,
temp_dir = ?temp_path,
"Extracted chat template from tokenizer_config.json and saved to temporary file"
);
extracted_path.to_str().map(|s| s.to_string())
} else {
None
}
});
Usage:
When the router starts without displaying the guided chat template or model template, it will fall back to retrieving the content from the worker.
./sglang-router \ --worker-urls=grpc://127.0.0.1:8000