Refactor: Replace daulet/tokenizers with vLLM tokenizer by hyeongyun0916 · Pull Request #254 · llm-d/llm-d-kv-cache

hyeongyun0916 · 2026-01-13T16:20:23Z

This PR refactors the tokenization system to use vLLM's tokenizer wrapper instead of the daulet/tokenizers.

https://llm-d.slack.com/archives/C0A0SU5J68Y/p1764153758005369

Copilot

Pull request overview

This PR refactors the tokenization system by replacing the daulet/tokenizers Go library with vLLM's Python-based tokenizer wrapper. The change introduces a new encode function through CGO bindings that communicates with vLLM's tokenizer, allowing for more consistent tokenization behavior with vLLM's inference engine.

Changes:

Removed daulet/tokenizers dependency and replaced with vLLM tokenizer via Python/CGO bindings
Updated Encode interface to accept EncodeRequest struct instead of individual parameters
Added new encode Python function and corresponding C/CGO bindings

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
go.mod, go.sum	Removed daulet/tokenizers dependency
pkg/preprocessing/chat_completions/types.go	Added Offset type and Tokenizer struct
pkg/preprocessing/chat_completions/tokenizer_wrapper.py	Added encode function for tokenization
pkg/preprocessing/chat_completions/cgo_functions.h	Added encode function declarations
pkg/preprocessing/chat_completions/cgo_functions.c	Implemented encode function C bindings
pkg/preprocessing/chat_completions/cgo_functions.go	Added Encode Go wrapper and EncodeRequest/Response types
pkg/preprocessing/chat_completions/cgo_functions_test.go	Added comprehensive tests for encode functionality
pkg/tokenization/tokenizer.go	Refactored to use vLLM tokenizer, removed provider interfaces
pkg/tokenization/uds_tokenizer.go	Updated to use EncodeRequest struct
pkg/tokenization/pool.go	Updated to construct EncodeRequest
pkg/tokenization/tokenizer_test.go	Updated all tests to use new Encode interface
pkg/tokenization/pool_test.go	Updated mock tokenizer and test cases
pkg/tokenization/prefixstore/*.go	Updated Offset type references
tests/e2e/redis_mock/*.go	Updated all e2e tests to use new Encode interface
pkg/preprocessing/chat_completions/README.md	Updated documentation to reflect vLLM usage
docs/architecture.md	Updated dependencies documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T16:24:27Z

+func (u *UdsTokenizer) Encode(req *preprocessing.EncodeRequest) ([]uint32, []preprocessing.Offset, error) {
+	httpReq, err := http.NewRequestWithContext(
 		context.Background(),
 		http.MethodPost,
 		u.baseURL+"/tokenize",
-		strings.NewReader(input),
+		strings.NewReader(req.Text),
 	)


The AddSpecialTokens field from EncodeRequest is not being used or passed to the UDS tokenizer service. If the external tokenizer service needs to know whether to add special tokens, this parameter should be included in the request (e.g., as a query parameter or in the request body). If the service handles this automatically, this should be documented.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

vMaroon · 2026-01-17T14:27:22Z

 require (
 	github.com/alicebob/miniredis/v2 v2.35.0
 	github.com/cespare/xxhash/v2 v2.3.0
-	github.com/daulet/tokenizers v1.22.1


CI, makefile and dockerfile should also be updated.

Regarding the CI updates (removing the tokenizer), I was planning to separate them into a different PR as discussed in the review. However, would it be better to just merge them into this PR instead?

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

vMaroon · 2026-01-19T17:27:00Z

Good to go after rebase. Thanks @hyeongyun0916!

Do you have any performance benchmarks?

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 · 2026-01-20T05:17:05Z

Good to go after rebase. Thanks @hyeongyun0916!

Do you have any performance benchmarks?

I’ve added the CGO benchmarks to the PR. Since daulet/tokenizers didn't have existing benchmarks, I created and ran them myself.
Although the pure Go implementation shows better performance, this transition is essential as it paves the way for integrating vLLM's rendering logic.

cgo tokenize

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkEncode$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v

[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
	>  goroutine 1 [running]:
	>  runtime/debug.Stack()
	>  	/mnt/config/home/.asdf/installs/golang/1.24.7/go/src/runtime/debug/stack.go:26 +0x5e
	>  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	>  	/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/log/log.go:60 +0xcd
	>  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc000158c40, 0x0)
	>  	/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/log/deleg.go:111 +0x32
	>  github.com/go-logr/logr.Logger.Info({{0x91d2d8?, 0xc000158c40?}, 0xc000231e80?}, {0x88ff30, 0x2c}, {0x0, 0x0, 0x0})
	>  	/root/go/pkg/mod/github.com/go-logr/logr@v1.4.2/logr.go:276 +0x6e
	>  github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions_test.TestMain(0xc00027b5e0)
	>  	/mnt/config/home/docs/heimdall/third_party/heimdall-kv-cache-manager/pkg/preprocessing/chat_completions/cgo_functions_test.go:947 +0xd4
	>  main.main()
	>  	_testmain.go:79 +0xa8
goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkEncode
BenchmarkEncode-96    	   12322	    104087 ns/op	    103978 ns/op_overall	    103943 ns/op_warm	    1239 B/op	      19 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions	39.970s

daulet/tokenizers

// BenchmarkEncode benchmarks the encode performance.
func BenchmarkEncode(b *testing.B) {
	tokenizer, _ := NewCachedHFTokenizer(context.Background(),
		"ibm-granite/granite-3.3-8b-instruct", &HFTokenizerConfig{
			TokenizersCacheDir: b.TempDir(),
		})

	// Track first iteration time and total time
	var firstIterationTime time.Duration
	var totalTime time.Duration

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		start := time.Now()
		_, _, err := tokenizer.Encode("What is the capital of France?", "", true)
		require.NoError(b, err, "Benchmark should not return errors")
		iterTime := time.Since(start)

		if i == 0 {
			firstIterationTime = iterTime
		}
		totalTime += iterTime
	}

	// Calculate both overall average and warm performance average
	overallAvg := totalTime / time.Duration(b.N)

	var warmAvg time.Duration
	if b.N > 1 {
		warmAvg = (totalTime - firstIterationTime) / time.Duration(b.N-1)
	} else {
		warmAvg = overallAvg // If only one iteration, warm avg = overall avg
	}

	b.ReportMetric(float64(overallAvg.Nanoseconds()), "ns/op_overall")
	b.ReportMetric(float64(warmAvg.Nanoseconds()), "ns/op_warm")
}

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkEncode$ github.com/llm-d/llm-d-kv-cache/pkg/tokenization -v

goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/tokenization
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkEncode
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
BenchmarkEncode-96    	  122052	     10399 ns/op	     10343 ns/op_overall	     10343 ns/op_warm	     184 B/op	       4 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/tokenization	8.190s

vMaroon · 2026-01-20T16:16:23Z

Sounds good - overall this "slowdown" will become a speedup once we move to tokens-in architecture, in which this will be the only tokenization stage on the entire serving path.

hyeongyun0916 · 2026-01-21T13:26:27Z

Run Examples Test / run-examples (pull_request) fail

will pass when #265 is merged.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

vMaroon · 2026-01-30T12:21:50Z

/lgtm
/approve

Copilot AI review requested due to automatic review settings January 13, 2026 16:20

hyeongyun0916 requested review from dannyharnik, elevran, kfirtoledo and vMaroon as code owners January 13, 2026 16:20

vMaroon requested review from liu-cong, sagearc and yankay January 13, 2026 16:20

Copilot started reviewing on behalf of hyeongyun0916 January 13, 2026 16:20 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

Replace daulet/tokenizers with vLLM tokenizer

95ef324

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 force-pushed the vllm-encode branch from b0b814b to 95ef324 Compare January 13, 2026 16:28

hyeongyun0916 added 2 commits January 14, 2026 05:02

remove tokenizer type

4b4800e

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

edit tokenizer_wrapper

24f21f4

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 force-pushed the vllm-encode branch from 8cb392b to 24f21f4 Compare January 15, 2026 09:42

hyeongyun0916 mentioned this pull request Jan 15, 2026

Refactor: Replace daulet/tokenizers with vLLM #221

Closed

vMaroon reviewed Jan 17, 2026

View reviewed changes

hyeongyun0916 added 2 commits January 19, 2026 04:10

Merge commit '4627a240db215d3455502d77e54d9b4079d24cc1' into vllm-encode

5aaf814

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

lint

8041b3b

Merge commit 'e2a0a64d33d3bf027eefa712095d4e779f97737c' into vllm-encode

dda0e47

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 requested a review from vMaroon January 20, 2026 14:08

vMaroon mentioned this pull request Jan 21, 2026

Update llm-d-kv-cache dependency to use the new disaggregated tokenizer service. llm-d/llm-d-router#552

Closed

hyeongyun0916 force-pushed the vllm-encode branch from 14393a2 to d804deb Compare January 21, 2026 14:12

Merge commit 'b3499dfc0c1b11060a5cfe0b5618112dcac78db3' into vllm-encode

e9fc1a2

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 force-pushed the vllm-encode branch from d804deb to e9fc1a2 Compare January 21, 2026 16:56

github-actions Bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jan 30, 2026

github-actions Bot approved these changes Jan 30, 2026

View reviewed changes

github-actions Bot merged commit 676e691 into llm-d:main Jan 30, 2026
5 checks passed

hhk7734 deleted the vllm-encode branch February 1, 2026 15:37

hhk7734 mentioned this pull request Apr 8, 2026

Add Moreh as a contributor to the adopters list llm-d/llm-d#1111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Replace daulet/tokenizers with vLLM tokenizer#254

Refactor: Replace daulet/tokenizers with vLLM tokenizer#254
github-actions[bot] merged 7 commits into
llm-d:mainfrom
moreh-dev:vllm-encode

hyeongyun0916 commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

vMaroon Jan 17, 2026

Uh oh!

hyeongyun0916 Jan 17, 2026

Uh oh!

vMaroon commented Jan 19, 2026 •

edited

Loading

Uh oh!

hyeongyun0916 commented Jan 20, 2026

Uh oh!

vMaroon commented Jan 20, 2026

Uh oh!

hyeongyun0916 commented Jan 21, 2026

Uh oh!

vMaroon commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hyeongyun0916 commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

vMaroon Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

hyeongyun0916 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

vMaroon commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyeongyun0916 commented Jan 20, 2026

Uh oh!

vMaroon commented Jan 20, 2026

Uh oh!

hyeongyun0916 commented Jan 21, 2026

Uh oh!

vMaroon commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vMaroon commented Jan 19, 2026 •

edited

Loading