Refactor: Replace transformers with vLLM#234
Conversation
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
| log.Log.Error(err, "failed to render chat template") | ||
| return err | ||
| } | ||
| addSpecialToken = false |
There was a problem hiding this comment.
There was a problem hiding this comment.
Can you add the reference in a comment?
There was a problem hiding this comment.
Pull request overview
This PR refactors the tokenization system to replace the transformers library with vLLM's tokenizer, streamlining the dependency chain and aligning with vLLM's tokenization approach.
Key Changes:
- Replaced transformers library with vLLM for tokenization and chat template rendering
- Renamed core functions and structs for consistency (RenderJinjaTemplate → ApplyChatTemplate, ChatMessage → Conversation, FetchChatTemplate → LoadTokenizerWithCache)
- Added
addSpecialTokenparameter toEncodemethods to control special token handling, with logic to disable it when chat templates are applied (as they already include special tokens)
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/preprocessing/chat_completions/tokenizer_wrapper.py | New Python wrapper using vLLM's get_tokenizer instead of transformers |
| pkg/preprocessing/chat_completions/render_jinja_template_wrapper.py | Removed old transformers-based wrapper |
| pkg/preprocessing/chat_completions/requirements.txt | Updated dependencies to use vllm-cpu instead of transformers, torch, and jinja2 |
| pkg/preprocessing/chat_completions/cgo_functions.h | Updated function signatures for renamed functions (load_tokenizer_with_cache, apply_chat_template) |
| pkg/preprocessing/chat_completions/cgo_functions.c | Implemented new C functions with updated naming and bool return type for LoadTokenizerWithCache |
| pkg/preprocessing/chat_completions/cgo_functions.go | Updated Go structs and functions: ApplyChatTemplateRequest, LoadTokenizerWithCacheRequest, and corresponding methods |
| pkg/preprocessing/chat_completions/cgo_functions_test.go | Updated all test cases to use new function names and struct definitions |
| pkg/tokenization/tokenizer.go | Added HFCachedTokenizer and LocalCachedTokenizer types, updated Encode interface with addSpecialToken parameter, refactored ApplyChatTemplate implementations |
| pkg/tokenization/tokenizer_test.go | Updated test mocks and calls to match new Encode signature |
| pkg/tokenization/uds_tokenizer.go | Updated Encode signature and renamed RenderChatTemplate to ApplyChatTemplate |
| pkg/tokenization/pool.go | Updated Task struct and processTask logic to handle addSpecialToken parameter |
| pkg/tokenization/pool_test.go | Updated mock implementations and test expectations for new signatures |
| pkg/kvcache/indexer.go | Updated GetPodScores signature to use ApplyChatTemplateRequest |
| tests/e2e/redis_mock/e2e_test.go | Updated all test cases to use new function names and added addSpecialToken parameter throughout |
| tests/e2e/redis_mock/e2e_suite_test.go | Updated promptToEngineAndRequestKeys helper to accept addSpecialToken parameter |
| examples/testdata/data.go | Updated RenderReq type to ApplyChatTemplateRequest |
| examples/kv_events/online/main.go | Simplified chat completions endpoint by removing FetchChatTemplate call and using ApplyChatTemplate directly |
| go.mod | Moved go.uber.org/zap from indirect to direct dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
| // BenchmarkRenderJinjaTemplate benchmarks the template rendering performance. | ||
| func BenchmarkRenderJinjaTemplate(b *testing.B) { | ||
| // BenchmarkApplyChatTemplate benchmarks the template rendering performance. | ||
| func BenchmarkApplyChatTemplate(b *testing.B) { |
There was a problem hiding this comment.
Are you referring to a benchmark like the one below? @sagearc
Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkApplyChatTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v
goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkApplyChatTemplate
BenchmarkApplyChatTemplate-96 4382 241736 ns/op 241602 ns/op_overall 105127 ns/op_warm 1053 B/op 9 allocs/op
PASS
ok github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions 15.210s
Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkRenderJinjaTemplate$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v
goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkRenderJinjaTemplate
[C] Py_InitializeGo - Already initialized in this process (PID: 1550466)
[C] Py_InitChatTemplateModule - Already initialized globally, returning
BenchmarkRenderJinjaTemplate-96 8854 254960 ns/op 254672 ns/op_overall 254645 ns/op_warm 15180 B/op 31 allocs/op
PASS
ok github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions 7.982s
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
|
| ContinueFinalMessage bool `json:"continue_final_message,omitempty"` | ||
| AddGenerationPrompt bool `json:"add_generation_prompt,omitempty"` | ||
| ChatTemplateKWArgs map[string]interface{} `json:"chat_template_kwargs,omitempty"` | ||
| LoadTokenizerWithCacheRequest LoadTokenizerWithCacheRequest `json:"load_tokenizer_with_cache_request,omitempty"` |
There was a problem hiding this comment.
I think we should separate between the tokenizer loading and the processing of a request.
It is true that previously tokenizer loading was lazy, but since supporting LoRAs, #192 changed the logic s.t. the model/tokenizer info is required on startup time.
There was a problem hiding this comment.
In the current implementation, the tokenizer is already being loaded into the cache during the initialization phase. The request is essentially used as a key to retrieve the pre-loaded tokenizer from the cache.
While it would undergo a re-initialization process if a different model is requested, are you suggesting that we should treat such cases as an error instead?
There was a problem hiding this comment.
We are dropping support for multiple models since it is not an actual use-case right now in the llm-d design - since the indexer is bound to one EPP and an EPP serves one base model.
When it comes to LoRAs, the request would have the LoRA name as the target model. If we keep dynamic/lazy tokenizer loading, we need to check if every request is coming to the base model or not. So in #192 "missing" tokenizer loading was removed.
To your question: it would not be treated as an error, but the only loaded tokenizer would be used regardless of the model name.
There was a problem hiding this comment.
I agree. That’s why in LoadTokenizerWithCacheRequest, the model is derived from the configuration rather than the request.
However, if this still feels a bit ambiguous, I can try refactoring it to make the separation even more explicit. Should I go ahead with that?
Co-authored-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com>
…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
|
I think this is good to go after resolving conflicts. cc @delavet @osswangxining the changes within the UDS package are for linting only - assuming it is fine. |
…lychattemplate Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
sagearc
left a comment
There was a problem hiding this comment.
This looks very good. Just a few minor comments in the review. Thanks @hyeongyun0916!
| log.Log.Error(err, "failed to render chat template") | ||
| return err | ||
| } | ||
| addSpecialToken = false |
There was a problem hiding this comment.
Can you add the reference in a comment?
| # Parse the JSON request | ||
| request = json.loads(request_json) | ||
| key = request.pop("key") | ||
| print("mhg", key, flush=True) |
There was a problem hiding this comment.
oh I'm sorry, I'll do a quick sweep of the entire PR to make sure everything else is clean
| --index-url https://download.pytorch.org/whl/cpu | ||
| --extra-index-url https://pypi.org/simple | ||
| vllm-cpu>=0.11.0; sys_platform != 'darwin' | ||
| vllm @ git+https://github.com/vllm-project/vllm.git@v0.11.0; sys_platform == 'darwin' No newline at end of file |
There was a problem hiding this comment.
I couldn't find any references for vllm-cpu in vllm repo/docs, I assume it is not an official package distribution. Maybe it'll be safer to install vllm from source with cpu flags?
There was a problem hiding this comment.
Following your feedback, I'll replace the vllm-cpu dependency with a setup.sh script that builds from source. Let me know if you have any other suggestions.
There was a problem hiding this comment.
I've added the setup.sh script to build from source as we discussed. It definitely takes more time to build compared to a simple pip install, but I agree it's a much safer approach given the package distribution issues.
There was a problem hiding this comment.
Revisiting this, upon integrating this PR in the inference-scheduler build, a list of requirements is much easier. I know we'll eventually drop the embeddings, but it doesn't seem like it's making it into v0.5.
e14bc41 to
1e462fb
Compare
|
I see that TestInstrumentedIndexBehavior/ConcurrentOperations failed, but it seems unrelated to the changes in this PR. It looks like an existing issue that should be addressed separately. |
|
Leaving LGTM to @sagearc |
|
Looks good to me, thanks @hyeongyun0916 ! |
|
/lgtm |
|
/approve |
This PR refactors the tokenization system to use vLLM's tokenizer wrapper instead of the transformers library.
https://llm-d.slack.com/archives/C0A0SU5J68Y/p1764153758005369