Refactor: Replace transformers with vLLM by hyeongyun0916 · Pull Request #234 · llm-d/llm-d-kv-cache

hyeongyun0916 · 2025-12-27T12:22:33Z

This PR refactors the tokenization system to use vLLM's tokenizer wrapper instead of the transformers library.

https://llm-d.slack.com/archives/C0A0SU5J68Y/p1764153758005369

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 · 2025-12-27T12:22:54Z

 			log.Log.Error(err, "failed to render chat template")
 			return err
 		}
+		addSpecialToken = false


completion default is True
chatCompletion default is False

Can you add the reference in a comment?

Copilot

Pull request overview

This PR refactors the tokenization system to replace the transformers library with vLLM's tokenizer, streamlining the dependency chain and aligning with vLLM's tokenization approach.

Key Changes:

Replaced transformers library with vLLM for tokenization and chat template rendering
Renamed core functions and structs for consistency (RenderJinjaTemplate → ApplyChatTemplate, ChatMessage → Conversation, FetchChatTemplate → LoadTokenizerWithCache)
Added addSpecialToken parameter to Encode methods to control special token handling, with logic to disable it when chat templates are applied (as they already include special tokens)

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
pkg/preprocessing/chat_completions/tokenizer_wrapper.py	New Python wrapper using vLLM's get_tokenizer instead of transformers
pkg/preprocessing/chat_completions/render_jinja_template_wrapper.py	Removed old transformers-based wrapper
pkg/preprocessing/chat_completions/requirements.txt	Updated dependencies to use vllm-cpu instead of transformers, torch, and jinja2
pkg/preprocessing/chat_completions/cgo_functions.h	Updated function signatures for renamed functions (load_tokenizer_with_cache, apply_chat_template)
pkg/preprocessing/chat_completions/cgo_functions.c	Implemented new C functions with updated naming and bool return type for LoadTokenizerWithCache
pkg/preprocessing/chat_completions/cgo_functions.go	Updated Go structs and functions: ApplyChatTemplateRequest, LoadTokenizerWithCacheRequest, and corresponding methods
pkg/preprocessing/chat_completions/cgo_functions_test.go	Updated all test cases to use new function names and struct definitions
pkg/tokenization/tokenizer.go	Added HFCachedTokenizer and LocalCachedTokenizer types, updated Encode interface with addSpecialToken parameter, refactored ApplyChatTemplate implementations
pkg/tokenization/tokenizer_test.go	Updated test mocks and calls to match new Encode signature
pkg/tokenization/uds_tokenizer.go	Updated Encode signature and renamed RenderChatTemplate to ApplyChatTemplate
pkg/tokenization/pool.go	Updated Task struct and processTask logic to handle addSpecialToken parameter
pkg/tokenization/pool_test.go	Updated mock implementations and test expectations for new signatures
pkg/kvcache/indexer.go	Updated GetPodScores signature to use ApplyChatTemplateRequest
tests/e2e/redis_mock/e2e_test.go	Updated all test cases to use new function names and added addSpecialToken parameter throughout
tests/e2e/redis_mock/e2e_suite_test.go	Updated promptToEngineAndRequestKeys helper to accept addSpecialToken parameter
examples/testdata/data.go	Updated RenderReq type to ApplyChatTemplateRequest
examples/kv_events/online/main.go	Simplified chat completions endpoint by removing FetchChatTemplate call and using ApplyChatTemplate directly
go.mod	Moved go.uber.org/zap from indirect to direct dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 · 2025-12-27T16:29:40Z

-// BenchmarkRenderJinjaTemplate benchmarks the template rendering performance.
-func BenchmarkRenderJinjaTemplate(b *testing.B) {
+// BenchmarkApplyChatTemplate benchmarks the template rendering performance.
+func BenchmarkApplyChatTemplate(b *testing.B) {


It would probably be best to run a benchmark before and after the change to confirm there’s no regression.

Are you referring to a benchmark like the one below? @sagearc

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkApplyChatTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v goos: linux goarch: amd64 pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions cpu: AMD EPYC 7413 24-Core Processor BenchmarkApplyChatTemplate BenchmarkApplyChatTemplate-96 4382 241736 ns/op 241602 ns/op_overall 105127 ns/op_warm 1053 B/op 9 allocs/op PASS ok github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions 15.210s Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkRenderJinjaTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v goos: linux goarch: amd64 pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions cpu: AMD EPYC 7413 24-Core Processor BenchmarkRenderJinjaTemplate [C] Py_InitializeGo - Already initialized in this process (PID: 1550466) [C] Py_InitChatTemplateModule - Already initialized globally, returning BenchmarkRenderJinjaTemplate-96 8854 254960 ns/op 254672 ns/op_overall 254645 ns/op_warm 15180 B/op 31 allocs/op PASS ok github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions 7.982s

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 · 2025-12-27T17:49:23Z

Renamed core functions and structs for consistency (RenderJinjaTemplate → ApplyChatTemplate, ChatMessage → Conversation, FetchChatTemplate → LoadTokenizerWithCache)

Removed FetchChatTemplate (no longer needed).
Added LoadTokenizerWithCache using vLLM's get_tokenizer.

vMaroon · 2026-01-01T13:44:02Z

-	ContinueFinalMessage      bool                   `json:"continue_final_message,omitempty"`
-	AddGenerationPrompt       bool                   `json:"add_generation_prompt,omitempty"`
-	ChatTemplateKWArgs        map[string]interface{} `json:"chat_template_kwargs,omitempty"`
+	LoadTokenizerWithCacheRequest LoadTokenizerWithCacheRequest `json:"load_tokenizer_with_cache_request,omitempty"`


I think we should separate between the tokenizer loading and the processing of a request.

It is true that previously tokenizer loading was lazy, but since supporting LoRAs, #192 changed the logic s.t. the model/tokenizer info is required on startup time.

In the current implementation, the tokenizer is already being loaded into the cache during the initialization phase. The request is essentially used as a key to retrieve the pre-loaded tokenizer from the cache.

While it would undergo a re-initialization process if a different model is requested, are you suggesting that we should treat such cases as an error instead?

We are dropping support for multiple models since it is not an actual use-case right now in the llm-d design - since the indexer is bound to one EPP and an EPP serves one base model.

When it comes to LoRAs, the request would have the LoRA name as the target model. If we keep dynamic/lazy tokenizer loading, we need to check if every request is coming to the base model or not. So in #192 "missing" tokenizer loading was removed.

To your question: it would not be treated as an error, but the only loaded tokenizer would be used regardless of the model name.

I agree. That’s why in LoadTokenizerWithCacheRequest, the model is derived from the configuration rather than the request.

However, if this still feels a bit ambiguous, I can try refactoring it to make the separation even more explicit. Should I go ahead with that?

Yes, thanks

Co-authored-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com>

…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

vMaroon · 2026-01-07T13:40:02Z

I think this is good to go after resolving conflicts.

cc @delavet @osswangxining the changes within the UDS package are for linting only - assuming it is fine.

…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

sagearc

This looks very good. Just a few minor comments in the review. Thanks @hyeongyun0916!

sagearc · 2026-01-07T16:01:19Z

 			log.Log.Error(err, "failed to render chat template")
 			return err
 		}
+		addSpecialToken = false


Can you add the reference in a comment?

sagearc · 2026-01-07T16:12:14Z

+        # Parse the JSON request
+        request = json.loads(request_json)
+        key = request.pop("key")
+        print("mhg", key, flush=True)


debug print?

oh I'm sorry, I'll do a quick sweep of the entire PR to make sure everything else is clean

sagearc · 2026-01-07T16:16:24Z

+--index-url https://download.pytorch.org/whl/cpu
+--extra-index-url https://pypi.org/simple
+vllm-cpu>=0.11.0; sys_platform != 'darwin'
+vllm @ git+https://github.com/vllm-project/vllm.git@v0.11.0; sys_platform == 'darwin'


I couldn't find any references for vllm-cpu in vllm repo/docs, I assume it is not an official package distribution. Maybe it'll be safer to install vllm from source with cpu flags?

Following your feedback, I'll replace the vllm-cpu dependency with a setup.sh script that builds from source. Let me know if you have any other suggestions.

I've added the setup.sh script to build from source as we discussed. It definitely takes more time to build compared to a simple pip install, but I agree it's a much safer approach given the package distribution issues.

Revisiting this, upon integrating this PR in the inference-scheduler build, a list of requirements is much easier. I know we'll eventually drop the embeddings, but it doesn't seem like it's making it into v0.5.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 · 2026-01-08T15:36:53Z

I see that TestInstrumentedIndexBehavior/ConcurrentOperations failed, but it seems unrelated to the changes in this PR. It looks like an existing issue that should be addressed separately.

vMaroon · 2026-01-09T19:38:02Z

Leaving LGTM to @sagearc

sagearc · 2026-01-12T10:06:45Z

Looks good to me, thanks @hyeongyun0916 !

sagearc · 2026-01-12T13:13:38Z

/lgtm

vMaroon · 2026-01-12T13:14:21Z

/approve

apply chat template

be99da0

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Copilot AI review requested due to automatic review settings December 27, 2025 12:22

hyeongyun0916 requested review from dannyharnik, elevran, kfirtoledo and vMaroon as code owners December 27, 2025 12:22

hyeongyun0916 commented Dec 27, 2025

View reviewed changes

Copilot started reviewing on behalf of hyeongyun0916 December 27, 2025 12:22 View session

Copilot AI reviewed Dec 27, 2025

View reviewed changes

hyeongyun0916 added 2 commits December 27, 2025 13:03

lint

74289d8

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

apply copilot review

d494e14

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 mentioned this pull request Dec 27, 2025

Refactor: Replace daulet/tokenizers with vLLM #221

Closed

hyeongyun0916 commented Dec 27, 2025

View reviewed changes

add example_usage

a69810e

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 commented Dec 27, 2025

View reviewed changes

Comment thread pkg/preprocessing/chat_completions/requirements.txt Outdated

vMaroon reviewed Jan 1, 2026

View reviewed changes

Comment thread pkg/preprocessing/chat_completions/cgo_functions.go Outdated

vMaroon reviewed Jan 1, 2026

View reviewed changes

hyeongyun0916 and others added 3 commits January 3, 2026 01:47

Update pkg/preprocessing/chat_completions/requirements.txt

2fa2043

Co-authored-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com>

Merge commit 'd7fd2183a9e9e1fc97c392de6764c8fabd54e4a4' into vllm-app…

2644725

…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

request with key

029be09

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 requested a review from delavet as a code owner January 5, 2026 06:31

vMaroon requested review from liu-cong and yankay January 5, 2026 06:31

edit data

e7b315c

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 requested a review from vMaroon January 5, 2026 11:25

hyeongyun0916 mentioned this pull request Jan 6, 2026

fix(cgo): resolve CStrings memory leaks #249

Merged

resolve CStrings memory leaks

c4161aa

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Merge commit 'b5dd010b535f13f132552b58b72397ae55369b5f' into vllm-app…

a6c021d

…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

sagearc suggested changes Jan 7, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

sagearc suggested changes Jan 7, 2026

View reviewed changes

hyeongyun0916 added 3 commits January 8, 2026 13:42

apply review

7481fc6

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

edit

3c23b59

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

edit

fff3be7

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 closed this Jan 8, 2026

hyeongyun0916 reopened this Jan 8, 2026

add path

1e462fb

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 force-pushed the vllm-applychattemplate branch from e14bc41 to 1e462fb Compare January 8, 2026 15:13

hyeongyun0916 closed this Jan 8, 2026

hyeongyun0916 reopened this Jan 8, 2026

github-actions Bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jan 12, 2026

github-actions Bot approved these changes Jan 12, 2026

View reviewed changes

vMaroon merged commit a8ca9ba into llm-d:main Jan 12, 2026
9 of 10 checks passed

hhk7734 deleted the vllm-applychattemplate branch January 16, 2026 02:43

sagearc mentioned this pull request Jan 19, 2026

refactor: kv cache manager repo llm-d/llm-d-router#570

Merged

hhk7734 mentioned this pull request Apr 8, 2026

Add Moreh as a contributor to the adopters list llm-d/llm-d#1111

Merged

guygir pushed a commit to guygir/llm-d-kv-cache-manager that referenced this pull request Apr 20, 2026

ignore score in scheduler unit test (llm-d#234)

a963d95

Conversation

hyeongyun0916 commented Dec 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hyeongyun0916 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeongyun0916 Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sagearc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeongyun0916 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyeongyun0916 commented Jan 8, 2026

Uh oh!

vMaroon commented Jan 9, 2026

Uh oh!

sagearc commented Jan 12, 2026

Uh oh!

sagearc commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vMaroon commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hyeongyun0916 commented Dec 27, 2025 •

edited

Loading

vMaroon Jan 3, 2026 •

edited

Loading

hyeongyun0916 Jan 3, 2026 •

edited

Loading

vMaroon commented Jan 7, 2026 •

edited

Loading

hyeongyun0916 Jan 8, 2026 •

edited

Loading

sagearc commented Jan 12, 2026 •

edited

Loading

vMaroon commented Jan 12, 2026 •

edited

Loading