Skip to content

Add ChatCompletionRequest-style support to /v1/tokenize#23981

Merged
ishandhanani merged 4 commits intosgl-project:mainfrom
antgroup:ChatCompletionRequest_tokenizer
May 7, 2026
Merged

Add ChatCompletionRequest-style support to /v1/tokenize#23981
ishandhanani merged 4 commits intosgl-project:mainfrom
antgroup:ChatCompletionRequest_tokenizer

Conversation

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@huangtingwei9988 huangtingwei9988 commented Apr 29, 2026

Motivation

/v1/tokenize previously only accepted raw string prompts, which made it difficult to inspect the actual token sequence used by /v1/chat/completions.

When we aim to build a cache-aware system using KV events, we need to obtain the actual token sequence resulting from rendering the model's chat template.

This change allows /v1/tokenize to accept ChatCompletion-style messages input and return token IDs consistent with the chat completion path.

Qwen3.5-9B

Without tools

curl -sS http://127.0.0.1:31080/v1/tokenize \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen-tokenize-test","messages":[{"role":"system","content":"You are concise."},{"role":"user","content":"Hello world"}],"max_tokens":1}'

{"tokens":[248045,8678,198,2523,513,61446,13,248046,198,248045,846,198,9419,1814,248046,198,248045,74455,198,248068,198],"count":21,"max_model_len":262144}

With tools

curl -sS http://127.0.0.1:31080/v1/tokenize \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3___5-9B","messages":[{"role":"user","content":"What is the weather in Paris?"}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather for a city.","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]}'

{"tokens":[248045,8678,198,2,13455,271,2523,599,2528,310,279,2614,5568,25,271,27,15449,29,198,4754,1267,763,328,1628,487,328,1628,763,5046,4532,763,328,1882,8831,364,264,3177,10152,328,591,763,328,447,67017,487,328,13390,763,5046,1267,763,328,1640,487,328,12811,763,5046,8656,763,5046,1267,763,328,889,8934,2069,328,6081,763,4241,8656,1293,2069,328,6418,763,867,2069,328,60003,55791,763,819,92,198,510,15449,29,271,2592,488,4992,310,1562,264,709,25835,9559,303,279,2614,3443,440,5486,19900,25,271,248058,198,27,1628,28,8422,8901,1224,29,198,27,15704,28,8422,24109,62,16,29,198,927,62,16,198,510,15704,29,198,27,15704,28,8422,24109,62,17,29,198,1919,369,279,869,364,279,2018,5555,198,8761,628,9111,198,34493,4965,198,510,15704,29,198,510,1628,29,198,248059,271,27,95328,29,198,92065,25,198,12,5534,6526,26834,1732,279,5024,3443,25,449,8906,361,1628,28,1076,1419,1628,29,2424,1902,381,23283,2785,220,248058,248059,11535,9212,198,12,12296,4868,26834,381,5024,198,12,1394,1189,3300,9801,31626,364,678,709,1562,303,5629,3992,54588,279,709,1562,11,694,4045,1238,198,12,1368,1017,369,874,709,1562,2420,11,4087,279,3296,1040,4472,440,678,1428,6337,321,635,524,3184,279,1156,883,709,6526,198,510,95328,29,248046,198,248045,846,198,3710,369,279,8831,303,11751,30,248046,198,248045,74455,198,248068,198],"count":285,"max_model_len":262144}

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@doujiang24
Copy link
Copy Markdown
Contributor

doujiang24 commented Apr 29, 2026

Great job, we are using this API to work with dynamo kvindexer for precise cache-aware routing.
cc @ishandhanani @ShangmingCai @stmatengss

@stmatengss
Copy link
Copy Markdown
Collaborator

Great job, we are using this API to work with dynamo kvindexer for precise cache-aware routing. cc @ishandhanani @ShangmingCai @stmatengss

Nice, this is helpful for cache-aware routing in general. On the sglang side, we also have KV cache event emission that can work together with this tokenize API for prefix-matching based routing. Good to see the integration with dynamo kvindexer moving forward!

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

CC: @CatherineSue, do you have time to review this PR?

@ishandhanani
Copy link
Copy Markdown
Collaborator

LGTM!

@ishandhanani
Copy link
Copy Markdown
Collaborator

I don't see any CI's failing from this PR. Ok if I merge @ShangmingCai ?

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator

Should we add a test for this API?

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

4 similar comments
@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@ishandhanani
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@ishandhanani ishandhanani merged commit 27445f9 into sgl-project:main May 7, 2026
364 of 417 checks passed
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants