Skip to content

tokenization: pool should report unrecoverable failures#210

Closed
evacchi wants to merge 2 commits into
llm-d:mainfrom
evacchi:test-tokenization-pool-error-reporting
Closed

tokenization: pool should report unrecoverable failures#210
evacchi wants to merge 2 commits into
llm-d:mainfrom
evacchi:test-tokenization-pool-error-reporting

Conversation

@evacchi
Copy link
Copy Markdown
Contributor

@evacchi evacchi commented Dec 11, 2025

while working on llm-d/llm-d-router#505 I noticed that a misconfigured Tokenizers might report an error, but the error would not bubble up, causing the test to hang indefinitely, waiting on the internal task queue.

In this PR:

  • we add a FatalInitError wrapper, representing a nonrecoverable error (e.g. initialization error of the tokenizer)
  • we add an err field to tokenizationResponse

on error:

  • if task.ResultCh != nil we send a task { err }
  • Pool#processTask() in addition to checking err != nil, checks the type of the error; if it is unrecoverable, the task is forgotten instead of being rate limited.
  • Pool#Tokenize() now returns ([]uint32, error), handled in GetPodScores()

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the tokenization pool to properly handle and report unrecoverable initialization failures. Previously, misconfigured tokenizers would fail silently, causing tests to hang indefinitely while waiting on internal task queues. The changes introduce a FatalInitError wrapper to distinguish fatal initialization errors from transient failures, ensuring they're immediately reported to callers rather than being indefinitely retried.

Key changes:

  • Introduced FatalInitError type to represent unrecoverable tokenizer initialization errors
  • Modified Pool#Tokenize() to return errors to callers via the new err field in tokenizationResponse
  • Updated worker loop to forget tasks with fatal errors instead of rate-limiting them for retry

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/tokenization/pool.go Added FatalInitError type, error reporting in processTask(), and error handling in worker loop; updated Tokenize() signature to return errors
pkg/tokenization/tokenizer.go Wrapped tokenizer initialization errors with FatalInitError in Encode() method
pkg/tokenization/pool_test.go Added TestPool_RunIntegrationFailed to verify error handling for misconfigured tokenizers; updated benchmark to handle new error return
pkg/kvcache/indexer.go Updated GetPodScores() to handle and propagate tokenization errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tokenization/pool.go
Comment thread pkg/tokenization/pool.go
err error
}

func (fe FatalInitError) Error() string {
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Error method could panic if fe.err is nil. While this may not happen in normal operation, defensive programming suggests adding a nil check to prevent potential panics.

Suggested change
func (fe FatalInitError) Error() string {
func (fe FatalInitError) Error() string {
if fe.err == nil {
return "fatal init error: <nil>"
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will happen in practice... 🤔

Comment thread pkg/tokenization/pool_test.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Dec 13, 2025

Hi @evacchi - thank you for starting this.

I think a part of this PR will have a conflict with the ongoing #192 - perhaps we can focus this one on reporting encoding errors, and leave loading for after #192 if needed?

@evacchi
Copy link
Copy Markdown
Contributor Author

evacchi commented Dec 29, 2025

#192 definitely solves the issue with loading, raising an error at instantiation time instead of delaying it to task processing time. I think this issue can be considered solved.

@evacchi evacchi closed this Dec 29, 2025
@evacchi evacchi deleted the test-tokenization-pool-error-reporting branch December 29, 2025 10:05
guygir pushed a commit to guygir/llm-d-kv-cache-manager that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants