Skip to content

Refactor: Replace daulet/tokenizers with vLLM tokenizer#254

Merged
github-actions[bot] merged 7 commits into
llm-d:mainfrom
moreh-dev:vllm-encode
Jan 30, 2026
Merged

Refactor: Replace daulet/tokenizers with vLLM tokenizer#254
github-actions[bot] merged 7 commits into
llm-d:mainfrom
moreh-dev:vllm-encode

Conversation

@hyeongyun0916
Copy link
Copy Markdown
Collaborator

This PR refactors the tokenization system to use vLLM's tokenizer wrapper instead of the daulet/tokenizers.

https://llm-d.slack.com/archives/C0A0SU5J68Y/p1764153758005369

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the tokenization system by replacing the daulet/tokenizers Go library with vLLM's Python-based tokenizer wrapper. The change introduces a new encode function through CGO bindings that communicates with vLLM's tokenizer, allowing for more consistent tokenization behavior with vLLM's inference engine.

Changes:

  • Removed daulet/tokenizers dependency and replaced with vLLM tokenizer via Python/CGO bindings
  • Updated Encode interface to accept EncodeRequest struct instead of individual parameters
  • Added new encode Python function and corresponding C/CGO bindings

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
go.mod, go.sum Removed daulet/tokenizers dependency
pkg/preprocessing/chat_completions/types.go Added Offset type and Tokenizer struct
pkg/preprocessing/chat_completions/tokenizer_wrapper.py Added encode function for tokenization
pkg/preprocessing/chat_completions/cgo_functions.h Added encode function declarations
pkg/preprocessing/chat_completions/cgo_functions.c Implemented encode function C bindings
pkg/preprocessing/chat_completions/cgo_functions.go Added Encode Go wrapper and EncodeRequest/Response types
pkg/preprocessing/chat_completions/cgo_functions_test.go Added comprehensive tests for encode functionality
pkg/tokenization/tokenizer.go Refactored to use vLLM tokenizer, removed provider interfaces
pkg/tokenization/uds_tokenizer.go Updated to use EncodeRequest struct
pkg/tokenization/pool.go Updated to construct EncodeRequest
pkg/tokenization/tokenizer_test.go Updated all tests to use new Encode interface
pkg/tokenization/pool_test.go Updated mock tokenizer and test cases
pkg/tokenization/prefixstore/*.go Updated Offset type references
tests/e2e/redis_mock/*.go Updated all e2e tests to use new Encode interface
pkg/preprocessing/chat_completions/README.md Updated documentation to reflect vLLM usage
docs/architecture.md Updated dependencies documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/preprocessing/chat_completions/types.go Outdated
Comment thread pkg/tokenization/uds_tokenizer.go Outdated
Comment on lines 107 to 113
func (u *UdsTokenizer) Encode(req *preprocessing.EncodeRequest) ([]uint32, []preprocessing.Offset, error) {
httpReq, err := http.NewRequestWithContext(
context.Background(),
http.MethodPost,
u.baseURL+"/tokenize",
strings.NewReader(input),
strings.NewReader(req.Text),
)
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AddSpecialTokens field from EncodeRequest is not being used or passed to the UDS tokenizer service. If the external tokenizer service needs to know whether to add special tokens, this parameter should be included in the request (e.g., as a query parameter or in the request body). If the service handles this automatically, this should be documented.

Copilot uses AI. Check for mistakes.
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Comment thread go.mod
require (
github.com/alicebob/miniredis/v2 v2.35.0
github.com/cespare/xxhash/v2 v2.3.0
github.com/daulet/tokenizers v1.22.1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI, makefile and dockerfile should also be updated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the CI updates (removing the tokenizer), I was planning to separate them into a different PR as discussed in the review. However, would it be better to just merge them into this PR instead?

@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 19, 2026

Good to go after rebase. Thanks @hyeongyun0916!

Do you have any performance benchmarks?

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@hyeongyun0916
Copy link
Copy Markdown
Collaborator Author

Good to go after rebase. Thanks @hyeongyun0916!

Do you have any performance benchmarks?

I’ve added the CGO benchmarks to the PR. Since daulet/tokenizers didn't have existing benchmarks, I created and ran them myself.
Although the pure Go implementation shows better performance, this transition is essential as it paves the way for integrating vLLM's rendering logic.

cgo tokenize

Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkEncode$ github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions -v

[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
	>  goroutine 1 [running]:
	>  runtime/debug.Stack()
	>  	/mnt/config/home/.asdf/installs/golang/1.24.7/go/src/runtime/debug/stack.go:26 +0x5e
	>  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	>  	/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/log/log.go:60 +0xcd
	>  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc000158c40, 0x0)
	>  	/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/log/deleg.go:111 +0x32
	>  github.com/go-logr/logr.Logger.Info({{0x91d2d8?, 0xc000158c40?}, 0xc000231e80?}, {0x88ff30, 0x2c}, {0x0, 0x0, 0x0})
	>  	/root/go/pkg/mod/github.com/go-logr/logr@v1.4.2/logr.go:276 +0x6e
	>  github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions_test.TestMain(0xc00027b5e0)
	>  	/mnt/config/home/docs/heimdall/third_party/heimdall-kv-cache-manager/pkg/preprocessing/chat_completions/cgo_functions_test.go:947 +0xd4
	>  main.main()
	>  	_testmain.go:79 +0xa8
goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkEncode
BenchmarkEncode-96    	   12322	    104087 ns/op	    103978 ns/op_overall	    103943 ns/op_warm	    1239 B/op	      19 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/preprocessing/chat_completions	39.970s

daulet/tokenizers

// BenchmarkEncode benchmarks the encode performance.
func BenchmarkEncode(b *testing.B) {
	tokenizer, _ := NewCachedHFTokenizer(context.Background(),
		"ibm-granite/granite-3.3-8b-instruct", &HFTokenizerConfig{
			TokenizersCacheDir: b.TempDir(),
		})

	// Track first iteration time and total time
	var firstIterationTime time.Duration
	var totalTime time.Duration

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		start := time.Now()
		_, _, err := tokenizer.Encode("What is the capital of France?", "", true)
		require.NoError(b, err, "Benchmark should not return errors")
		iterTime := time.Since(start)

		if i == 0 {
			firstIterationTime = iterTime
		}
		totalTime += iterTime
	}

	// Calculate both overall average and warm performance average
	overallAvg := totalTime / time.Duration(b.N)

	var warmAvg time.Duration
	if b.N > 1 {
		warmAvg = (totalTime - firstIterationTime) / time.Duration(b.N-1)
	} else {
		warmAvg = overallAvg // If only one iteration, warm avg = overall avg
	}

	b.ReportMetric(float64(overallAvg.Nanoseconds()), "ns/op_overall")
	b.ReportMetric(float64(warmAvg.Nanoseconds()), "ns/op_warm")
}
Running tool: /mnt/config/home/.asdf/installs/golang/1.24.7/go/bin/go test -test.fullpath=true -benchmem -run=^$ -tags integration_tests -bench ^BenchmarkEncode$ github.com/llm-d/llm-d-kv-cache/pkg/tokenization -v

goos: linux
goarch: amd64
pkg: github.com/llm-d/llm-d-kv-cache/pkg/tokenization
cpu: AMD EPYC 7413 24-Core Processor
BenchmarkEncode
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode752981921/001/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode2631951444/002/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode1954700086/003/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/added_tokens.json
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/special_tokens_map.json
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/merges.txt
Successfully downloaded /tmp/BenchmarkEncode2123434448/004/ibm-granite/granite-3.3-8b-instruct/tokenizer.json
BenchmarkEncode-96    	  122052	     10399 ns/op	     10343 ns/op_overall	     10343 ns/op_warm	     184 B/op	       4 allocs/op
PASS
ok  	github.com/llm-d/llm-d-kv-cache/pkg/tokenization	8.190s

@hyeongyun0916 hyeongyun0916 requested a review from vMaroon January 20, 2026 14:08
@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 20, 2026

Sounds good - overall this "slowdown" will become a speedup once we move to tokens-in architecture, in which this will be the only tokenization stage on the entire serving path.

@hyeongyun0916
Copy link
Copy Markdown
Collaborator Author

Run Examples Test / run-examples (pull_request) fail

will pass when #265 is merged.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Jan 30, 2026

/lgtm
/approve

@github-actions github-actions Bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jan 30, 2026
@github-actions github-actions Bot merged commit 676e691 into llm-d:main Jan 30, 2026
5 checks passed
@hhk7734 hhk7734 deleted the vllm-encode branch February 1, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Looks good to me, indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants