Skip to content

Refactor: Replace transformers with vLLM#234

Merged
vMaroon merged 14 commits into
llm-d:mainfrom
moreh-dev:vllm-applychattemplate
Jan 12, 2026
Merged

Refactor: Replace transformers with vLLM#234
vMaroon merged 14 commits into
llm-d:mainfrom
moreh-dev:vllm-applychattemplate

Conversation

@hyeongyun0916
Copy link
Copy Markdown
Collaborator

This PR refactors the tokenization system to use vLLM's tokenizer wrapper instead of the transformers library.

https://llm-d.slack.com/archives/C0A0SU5J68Y/p1764153758005369

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copilot AI review requested due to automatic review settings December 27, 2025 12:22
Comment thread pkg/tokenization/pool.go
log.Log.Error(err, "failed to render chat template")
return err
}
addSpecialToken = false
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the reference in a comment?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the tokenization system to replace the transformers library with vLLM's tokenizer, streamlining the dependency chain and aligning with vLLM's tokenization approach.

Key Changes:

  • Replaced transformers library with vLLM for tokenization and chat template rendering
  • Renamed core functions and structs for consistency (RenderJinjaTemplate → ApplyChatTemplate, ChatMessage → Conversation, FetchChatTemplate → LoadTokenizerWithCache)
  • Added addSpecialToken parameter to Encode methods to control special token handling, with logic to disable it when chat templates are applied (as they already include special tokens)

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pkg/preprocessing/chat_completions/tokenizer_wrapper.py New Python wrapper using vLLM's get_tokenizer instead of transformers
pkg/preprocessing/chat_completions/render_jinja_template_wrapper.py Removed old transformers-based wrapper
pkg/preprocessing/chat_completions/requirements.txt Updated dependencies to use vllm-cpu instead of transformers, torch, and jinja2
pkg/preprocessing/chat_completions/cgo_functions.h Updated function signatures for renamed functions (load_tokenizer_with_cache, apply_chat_template)
pkg/preprocessing/chat_completions/cgo_functions.c Implemented new C functions with updated naming and bool return type for LoadTokenizerWithCache
pkg/preprocessing/chat_completions/cgo_functions.go Updated Go structs and functions: ApplyChatTemplateRequest, LoadTokenizerWithCacheRequest, and corresponding methods
pkg/preprocessing/chat_completions/cgo_functions_test.go Updated all test cases to use new function names and struct definitions
pkg/tokenization/tokenizer.go Added HFCachedTokenizer and LocalCachedTokenizer types, updated Encode interface with addSpecialToken parameter, refactored ApplyChatTemplate implementations
pkg/tokenization/tokenizer_test.go Updated test mocks and calls to match new Encode signature
pkg/tokenization/uds_tokenizer.go Updated Encode signature and renamed RenderChatTemplate to ApplyChatTemplate
pkg/tokenization/pool.go Updated Task struct and processTask logic to handle addSpecialToken parameter
pkg/tokenization/pool_test.go Updated mock implementations and test expectations for new signatures
pkg/kvcache/indexer.go Updated GetPodScores signature to use ApplyChatTemplateRequest
tests/e2e/redis_mock/e2e_test.go Updated all test cases to use new function names and added addSpecialToken parameter throughout
tests/e2e/redis_mock/e2e_suite_test.go Updated promptToEngineAndRequestKeys helper to accept addSpecialToken parameter
examples/testdata/data.go Updated RenderReq type to ApplyChatTemplateRequest
examples/kv_events/online/main.go Simplified chat completions endpoint by removing FetchChatTemplate call and using ApplyChatTemplate directly
go.mod Moved go.uber.org/zap from indirect to direct dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go.mod
Comment thread pkg/preprocessing/chat_completions/tokenizer_wrapper.py Outdated
Comment thread pkg/preprocessing/chat_completions/tokenizer_wrapper.py Outdated
Comment thread pkg/preprocessing/chat_completions/cgo_functions.go Outdated
Comment thread pkg/preprocessing/chat_completions/tokenizer_wrapper.py Outdated
Comment thread pkg/preprocessing/chat_completions/tokenizer_wrapper.py Outdated
Comment thread pkg/tokenization/tokenizer.go Outdated
Comment thread pkg/preprocessing/chat_completions/tokenizer_wrapper.py Outdated
Comment thread pkg/preprocessing/chat_completions/cgo_functions.c Outdated
Comment thread pkg/preprocessing/chat_completions/cgo_functions.c Outdated
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
// BenchmarkRenderJinjaTemplate benchmarks the template rendering performance.
func BenchmarkRenderJinjaTemplate(b *testing.B) {
// BenchmarkApplyChatTemplate benchmarks the template rendering performance.
func BenchmarkApplyChatTemplate(b *testing.B) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be best to run a benchmark before and after the change to confirm there’s no regression.

Are you referring to a benchmark like the one below? @sagearc

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkApplyChatTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v

goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkApplyChatTemplate
BenchmarkApplyChatTemplate-96    	    4382	    241736 ns/op	    241602 ns/op_overall	    105127 ns/op_warm	    1053 B/op	       9 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions	15.210s

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkRenderJinjaTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v

goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkRenderJinjaTemplate
[C] Py_InitializeGo - Already initialized in this process (PID: 1550466)
[C] Py_InitChatTemplateModule - Already initialized globally, returning
BenchmarkRenderJinjaTemplate-96    	    8854	    254960 ns/op	    254672 ns/op_overall	    254645 ns/op_warm	   15180 B/op	      31 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions	7.982s

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Comment thread pkg/preprocessing/chat_completions/requirements.txt Outdated
@hyeongyun0916
Copy link
Copy Markdown
Collaborator Author

hyeongyun0916 commented Dec 27, 2025

  • Renamed core functions and structs for consistency (RenderJinjaTemplate → ApplyChatTemplate, ChatMessage → Conversation, FetchChatTemplate → LoadTokenizerWithCache)
  • Removed FetchChatTemplate (no longer needed).
  • Added LoadTokenizerWithCache using vLLM's get_tokenizer.

Comment thread pkg/preprocessing/chat_completions/cgo_functions.go Outdated
ContinueFinalMessage bool `json:"continue_final_message,omitempty"`
AddGenerationPrompt bool `json:"add_generation_prompt,omitempty"`
ChatTemplateKWArgs map[string]interface{} `json:"chat_template_kwargs,omitempty"`
LoadTokenizerWithCacheRequest LoadTokenizerWithCacheRequest `json:"load_tokenizer_with_cache_request,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should separate between the tokenizer loading and the processing of a request.

It is true that previously tokenizer loading was lazy, but since supporting LoRAs, #192 changed the logic s.t. the model/tokenizer info is required on startup time.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, the tokenizer is already being loaded into the cache during the initialization phase. The request is essentially used as a key to retrieve the pre-loaded tokenizer from the cache.

While it would undergo a re-initialization process if a different model is requested, are you suggesting that we should treat such cases as an error instead?

Copy link
Copy Markdown
Member

@vMaroon vMaroon Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dropping support for multiple models since it is not an actual use-case right now in the llm-d design - since the indexer is bound to one EPP and an EPP serves one base model.

When it comes to LoRAs, the request would have the LoRA name as the target model. If we keep dynamic/lazy tokenizer loading, we need to check if every request is coming to the base model or not. So in #192 "missing" tokenizer loading was removed.

To your question: it would not be treated as an error, but the only loaded tokenizer would be used regardless of the model name.

Copy link
Copy Markdown
Collaborator Author

@hyeongyun0916 hyeongyun0916 Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. That’s why in LoadTokenizerWithCacheRequest, the model is derived from the configuration rather than the request.

However, if this still feels a bit ambiguous, I can try refactoring it to make the separation even more explicit. Should I go ahead with that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks

hyeongyun0916 and others added 3 commits January 3, 2026 01:47
Co-authored-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com>
…lychattemplate

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@hyeongyun0916 hyeongyun0916 requested a review from delavet as a code owner January 5, 2026 06:31
@vMaroon vMaroon requested review from liu-cong and yankay January 5, 2026 06:31
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 7, 2026

I think this is good to go after resolving conflicts.

cc @delavet @osswangxining the changes within the UDS package are for linting only - assuming it is fine.

…lychattemplate

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link
Copy Markdown
Collaborator

@sagearc sagearc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good. Just a few minor comments in the review. Thanks @hyeongyun0916!

Comment thread pkg/tokenization/pool.go
log.Log.Error(err, "failed to render chat template")
return err
}
addSpecialToken = false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the reference in a comment?

# Parse the JSON request
request = json.loads(request_json)
key = request.pop("key")
print("mhg", key, flush=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug print?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I'm sorry, I'll do a quick sweep of the entire PR to make sure everything else is clean

sagearc

This comment was marked as duplicate.

Comment on lines +1 to +4
--index-url https://download.pytorch.org/whl/cpu
--extra-index-url https://pypi.org/simple
vllm-cpu>=0.11.0; sys_platform != 'darwin'
vllm @ git+https://github.com/vllm-project/vllm.git@v0.11.0; sys_platform == 'darwin' No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any references for vllm-cpu in vllm repo/docs, I assume it is not an official package distribution. Maybe it'll be safer to install vllm from source with cpu flags?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your feedback, I'll replace the vllm-cpu dependency with a setup.sh script that builds from source. Let me know if you have any other suggestions.

Copy link
Copy Markdown
Collaborator Author

@hyeongyun0916 hyeongyun0916 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the setup.sh script to build from source as we discussed. It definitely takes more time to build compared to a simple pip install, but I agree it's a much safer approach given the package distribution issues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this, upon integrating this PR in the inference-scheduler build, a list of requirements is much easier. I know we'll eventually drop the embeddings, but it doesn't seem like it's making it into v0.5.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@hyeongyun0916
Copy link
Copy Markdown
Collaborator Author

I see that TestInstrumentedIndexBehavior/ConcurrentOperations failed, but it seems unrelated to the changes in this PR. It looks like an existing issue that should be addressed separately.

@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 9, 2026

Leaving LGTM to @sagearc

@sagearc
Copy link
Copy Markdown
Collaborator

sagearc commented Jan 12, 2026

Looks good to me, thanks @hyeongyun0916 !

@sagearc
Copy link
Copy Markdown
Collaborator

sagearc commented Jan 12, 2026

/lgtm

@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 12, 2026

/approve

@github-actions github-actions Bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jan 12, 2026
@vMaroon vMaroon merged commit a8ca9ba into llm-d:main Jan 12, 2026
9 of 10 checks passed
@hhk7734 hhk7734 deleted the vllm-applychattemplate branch January 16, 2026 02:43
guygir pushed a commit to guygir/llm-d-kv-cache-manager that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Looks good to me, indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants