Add ChatCompletionRequest-style support to /v1/tokenize#23981
Add ChatCompletionRequest-style support to /v1/tokenize#23981ishandhanani merged 4 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Great job, we are using this API to work with dynamo kvindexer for precise cache-aware routing. |
Nice, this is helpful for cache-aware routing in general. On the sglang side, we also have KV cache event emission that can work together with this tokenize API for prefix-matching based routing. Good to see the integration with dynamo kvindexer moving forward! |
ShangmingCai
left a comment
There was a problem hiding this comment.
Looks good.
CC: @CatherineSue, do you have time to review this PR?
|
LGTM! |
|
I don't see any CI's failing from this PR. Ok if I merge @ShangmingCai ? |
|
/rerun-failed-ci |
|
Should we add a test for this API? |
|
/rerun-failed-ci |
4 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py
Motivation
/v1/tokenizepreviously only accepted raw string prompts, which made it difficult to inspect the actual token sequence used by/v1/chat/completions.When we aim to build a cache-aware system using KV events, we need to obtain the actual token sequence resulting from rendering the model's chat template.
This change allows
/v1/tokenizeto accept ChatCompletion-style messages input and return token IDs consistent with the chat completion path.Qwen3.5-9B
Without tools
With tools
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci