Skip to content

fix: populate tokens field in BatchedEngine.generate()#229

Merged
Thump604 merged 2 commits intowaybarrios:mainfrom
mmcaulif:fix/batched-engine-tokens-field
Apr 10, 2026
Merged

fix: populate tokens field in BatchedEngine.generate()#229
Thump604 merged 2 commits intowaybarrios:mainfrom
mmcaulif:fix/batched-engine-tokens-field

Conversation

@mmcaulif
Copy link
Copy Markdown
Contributor

@mmcaulif mmcaulif commented Mar 28, 2026

Encountered this while first investigating vllm-mlx for for RL training, I want to be able to see the tokens produced via BatchedEngine.generate. This PR was written exclusively with Claude so please let me know if you have any suggestions and I will implement them manually 😃. I also will try to implement and upstream any RL-related features that are missing.

Thank you for the project, it has been very useful!

The below is AI-generated:

Summary

  • BatchedEngine.generate() always returned tokens=[] despite AsyncEngineCore tracking output_token_ids per request
  • The fix passes output.output_token_ids through to GenerationOutput
  • Adds tests/test_batched_engine.py with three unit tests covering the tokens field and other output fields

Root cause

In engine/batched.py, the GenerationOutput was constructed without the tokens field:

# before
return GenerationOutput(
    text=text,
    prompt_tokens=output.prompt_tokens,
    completion_tokens=output.completion_tokens,
    finish_reason=output.finish_reason,
)

# after
return GenerationOutput(
    text=text,
    tokens=output.output_token_ids,
    prompt_tokens=output.prompt_tokens,
    completion_tokens=output.completion_tokens,
    finish_reason=output.finish_reason,
)

Test plan

  • test_tokens_field_is_populated — verifies token IDs are forwarded correctly
  • test_tokens_field_empty_when_no_tokens_generated — verifies empty list is handled
  • test_other_output_fields_still_populated — verifies no regression on existing fields

🤖 Generated with Claude Code

The output_token_ids from AsyncEngineCore were tracked internally but
never forwarded to GenerationOutput, leaving tokens always []. Also
adds tests for the generate() output fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR correctly fixes the missing tokens field in BatchedEngine.generate() for the LLM (text-only) code path. Good catch on the root cause.

What's Good

  1. Fix is minimal and correct: line 493 in batched.py now passes tokens=output.output_token_ids to GenerationOutput.
  2. Tests are well-structured: three cases covering normal output, empty tokens, and field regression.
  3. Test mocks are properly constructed to simulate AsyncEngineCore behavior.

Issues Found

1. Incomplete Fix: MLLM Path Still Missing Tokens

The PR only fixes the LLM path (line 493), but BatchedEngine.generate() also returns GenerationOutput on line 469 for MLLM output (when media is present):

if has_media or self._is_mllm:
    output = await self._mllm_scheduler.generate(...)  # line ~459
    return GenerationOutput(
        text=clean_output_text(output.output_text),
        # tokens field missing here too
        prompt_tokens=output.prompt_tokens,
        ...
    )

The output from MLLMScheduler.generate() also returns a RequestOutput with output_token_ids. This path should also include tokens=output.output_token_ids.

2. Tests Don't Cover MLLM Path

All three tests call engine.generate() with only text (no images/videos), so they only exercise the fixed LLM path. The MLLM path (with media) is completely untested.

3. Missing Test: Verify Tokens Field Default Value

The GenerationOutput dataclass has tokens: list[int] = field(default_factory=list). The tests don't verify that when omitted from the constructor, it defaults correctly. This is less critical since the fix adds it explicitly, but would be safer.

Recommendation

Approve with request to fix:

  1. Apply the same tokens=output.output_token_ids fix to line 469 (MLLM path)
  2. Add a test that exercises the MLLM path (with images/videos)
  3. Optional: add a test verifying the default value

The one-line fix is correct and low-risk. The incomplete coverage is the only concern.

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment to verify posting works

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR correctly fixes the missing tokens field in BatchedEngine.generate() for the LLM (text-only) code path. Good catch on the root cause.

What's Good

  1. Fix is minimal and correct: line 493 in batched.py now passes tokens=output.output_token_ids to GenerationOutput.
  2. Tests are well-structured: three cases covering normal output, empty tokens, and field regression.
  3. Test mocks are properly constructed to simulate AsyncEngineCore behavior.

Issues Found

Incomplete Fix: MLLM Path Still Missing Tokens

The PR only fixes the LLM path (line 493), but BatchedEngine.generate() also returns GenerationOutput on line 469 for MLLM output (when media is present):

if has_media or self._is_mllm:
    output = await self._mllm_scheduler.generate(...)  # line ~459
    return GenerationOutput(
        text=clean_output_text(output.output_text),
        # tokens field missing here too
        prompt_tokens=output.prompt_tokens,
        ...
    )

The output from MLLMScheduler.generate() also returns a RequestOutput with output_token_ids. This path should also include tokens=output.output_token_ids.

Tests Don't Cover MLLM Path

All three tests call engine.generate() with only text (no images/videos), so they only exercise the fixed LLM path. The MLLM path (with media) is completely untested.

Recommendation

Minor issue: apply the same tokens=output.output_token_ids fix to line 469 (MLLM path) and add a test for it. The one-line fix is correct and low-risk. Incomplete coverage is the only concern.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @mmcaulif: brief endorsement.

The use case is real. BatchedEngine.generate() not populating tokens is a real gap for downstream consumers (RL training, custom samplers, anything that needs the token IDs in addition to the decoded text). The diff is small enough to verify by reading: the token list was already being assembled in the streaming loop, this PR plumbs it through to the final GenerationOutput. Sound shape.

Mergeable on current main.

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd been grepping for consumers of GenerationOutput.tokens in-tree as part of an unrelated SimpleEngine refactor and was about to conclude the field was unused and propose dropping it. Your PR is evidence I was wrong: grep doesn't see out-of-tree consumers like RL training pipelines. The populated-but-empty-as-default-factory shape keeps it drop-in compatible.

Question, not blocking: for RL training use, would per-token logprobs also be useful on GenerationOutput, or do you compute them separately? Happy to file a follow-up if there's a gap there, since the work of plumbing through token IDs is similar to plumbing through logprobs.

Approving as-is.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…l generate+stream_generate

Pre-existing regression from an earlier rebase that dropped bdf7dcc's
llm.py additions. The server.py request handlers still pass top_k,
min_p, presence_penalty, repetition_penalty through to SimpleEngine,
which forwards them via **kwargs to MLXLanguageModel.chat() (which
accepts **kwargs) which then calls self.generate(..., **kwargs). But
MLXLanguageModel.generate() and stream_generate() had been left with
only (temperature, top_p, repetition_penalty) in their signatures, so
any non-MLLM SimpleEngine request crashed with:

    TypeError: MLXLanguageModel.stream_generate() got an unexpected
    keyword argument 'top_k'

Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in
the feature matrix regression sweep after the Session 87 cherry-picks
of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause
this regression — they exposed it by finally running the LLM-path
tests that no one had exercised since the rebase happened. Confirmed
via stderr.log:

  TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k'
  TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k'

Fix: restore the signatures and bodies of _create_sampler,
_create_logits_processors, generate, and stream_generate to match
bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter
and non-str prompt support on stream_generate. Adds **kwargs to both
generate and stream_generate so future param additions degrade
gracefully instead of crashing.

This is a runtime-local fix. The equivalent upstream fix lives in
bdf7dcc which was never upstreamed (confirmed via
git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR
to upstream could carry this forward.

Verification:
  bin/verify-patches: 33/33 clean
  Full feature matrix regression sweep pending re-run after this commit.

Related: runtime PR waybarrios#265 (waybarrios#265) fixed the
CompletionRequest schema side of the same bdf7dcc drop; this commit
fixes the engine-model side.
@mmcaulif
Copy link
Copy Markdown
Contributor Author

@Thump604

Populated the tokens in GenerationOutput for multimodal models, on reflection I am not convinced at all my the AI generated tests so I will try and improve these. I think that would be best is more sort of dummy model that just returns the tokens [0, 1, 2, ..., i] when max_tokens=i.

And yes returning per-token log_probs would be great!

@Thump604
Copy link
Copy Markdown
Collaborator

Thanks — the missing MLLM path was the only real hole I was worried about, and that is closed now that tokens=output.output_token_ids is wired through both return sites in BatchedEngine.generate().

I still agree the current tests are weaker than ideal because they only exercise the text path, but for a fix this small I would treat the stronger dummy-model / MLLM-path test shape as a follow-up rather than a merge blocker. Same for per-token logprobs: useful, but separate scope.

From my side this now looks clean enough to merge.

@mmcaulif
Copy link
Copy Markdown
Contributor Author

Thanks — the missing MLLM path was the only real hole I was worried about, and that is closed now that tokens=output.output_token_ids is wired through both return sites in BatchedEngine.generate().

I still agree the current tests are weaker than ideal because they only exercise the text path, but for a fix this small I would treat the stronger dummy-model / MLLM-path test shape as a follow-up rather than a merge blocker. Same for per-token logprobs: useful, but separate scope.

From my side this now looks clean enough to merge.

Sounds good to me, feel free to merge as it doesn't look like I am able to

@Thump604 Thump604 merged commit b062186 into waybarrios:main Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants