Update handling EOS token id detection by kunal-vaishnavi · Pull Request #1925 · microsoft/onnxruntime-genai

kunal-vaishnavi · 2025-12-18T08:22:41Z

Version 2

Description

This PR updates the examples to show how EOS token id detection is handled with ONNX Runtime GenAI when generating tokens. With the addition of the C# binding for GetNextTokens(), all of the published examples now cover the cases listed below in version 1 of this PR. Previously, the earlier PR mentioned different variations of the generation loop and all of the variations had an issue.

This PR also introduces new APIs for tracking token count and querying the generator params:

Generator.TokenCount
Params.GetSearchNumber
Params.GetSearchBool

Additionally, this PR adds some missing Tokenizer APIs for Objective-C.

Tokenizer.GetBosTokenId
Tokenizer.GetEosTokenIds
Tokenizer.GetPadTokenId

Motivation and Context

This PR is a follow-up to the issue fixed in an earlier PR. These APIs can be used by users to distinguish between the cases that Generator.IsDone() covers.

For example:

bool hit_eos = generator->IsDone() && (generator->TokenCount() < static_cast<int>(params->GetSearchNumber("max_length")));
bool hit_max_length = generator->IsDone() && (generator->TokenCount() == static_cast<int>(params->GetSearchNumber("max_length")));

Version 1

Description

This PR updates how EOS token id detection is handled with ONNX Runtime GenAI when generating tokens. A new API called Generator.HitEOS() is introduced to detect whether an EOS token id has been generated. Another API called Generator.HitMaxLength() is also introduced to detect whether the max length has been hit before the generation loop has completed.

Motivation and Context

This PR is a follow-up to the issue fixed in an earlier PR. The earlier PR mentions different variations of the generation loop but all of the variations have an issue.

There are two scenarios for terminating the generation loop: 1) hitting the EOS token id and completing the generation loop or 2) hitting the max length before the generation loop has completed. However, none of the variations adequately cover the two scenarios for terminating the generation loop.

1. Original Generation Loop

while not IsDone():
    GenerateToken()
    GetLastToken()
    PrintLastToken()

Consider scenario 1 with this loop. After GenerateToken() produces the EOS token id, GetLastToken() will attempt to retrieve that token. However, ORT GenAI does not append the EOS token id to the list of sequences returned to the user (see the earlier PR for why). Instead, the second-to-last token will still be the last token in the list of sequences. Thus, GetLastToken() and PrintLastToken() will retrieve and again print the last token that the user saw.

2. Return Early Generation Loop

while not IsDone():
    GenerateToken()
    if IsDone():
        break
    GetLastToken()
    PrintLastToken()

Consider scenario 2 with this loop. After GenerateToken() produces a token and the max length has been reached, the generator's state is marked as done. Then IsDone() will be true and the newest token won't be retrieved and printed since the loop is exited early.

3. Infinite Generation Loop

while True:
    GenerateToken()
    if IsDone():
        break
    GetLastToken()
    PrintLastToken()

Consider scenario 2 with this loop. The same issue as the prior loop still applies. GenerateToken() will generate all of the tokens but once the max length is hit, IsDone() is true and the last token won't be retrieved and printed.

Conclusion

The reason that none of these generation loop variants work is because IsDone() currently covers both scenarios in one API and does not distinguish between them. One check needs to be in place in the condition of the while loop so that the loop continues, and another check needs to be after token generation to decide whether retrieving the last token should be done or not.

Solution

To fix this, a new API called Generator.HitEOS() is introduced. It returns true when the EOS token id is generated. The generation loop should be modified to the following.

while not IsDone():
    GenerateToken()
    if HitEOS():
        break
    GetLastToken()
    PrintLastToken()

If scenario 1 occurs in this loop, HitEOS() is true and the generation loop will exit early. If scenario 2 occurs in this loop, HitEOS() is false when the max length is reached. The last generated token can still be retrieved and printed. Then because the generator's state is done, IsDone() is true and the generation loop ends.

Here is a full end-to-end example demonstrating its usage.

import onnxruntime_genai as og

model = og.Model("/path/to/model/folder")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
params.set_search_options(max_length=25)

generator = og.Generator(model, params)

tokens = tokenizer.encode("<|system|>You are a helpful AI assistant.<|end|><|user|>What color is the sky?<|end|><|assistant|>")
print(f"Prompt: {len(tokens)}")
generator.append_tokens(tokens)

count = 0
while not generator.is_done():
    generator.generate_next_token()
    count += 1
    if generator.hit_eos():
        break

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

print()
print(f"Generated: {count}")
print(f"Total: {len(tokens) + count}")

Scenario 1

Before with loop version 1:

Prompt: 18
The color of the sky can vary depending on the viewing conditions and the presence of particles and moisture in the atmosphere. On a clear day, the sky appears blue due to Rayleigh scattering, where the atmosphere scatters sunlight in all directions and blue wavelengths are scattered more than other colors because they travel as shorter, smaller waves. This scattering causes the sky to appear blue to an observer on the ground. However, the sky can also appear various shades of blue, gray, or even take on vibrant hues like red or orange just before or just after sunrise or sunset, due to the scattering of sunlight by particles and moisture in the atmosphere..
Generated: 128
Total: 146

After with generator.hit_eos():

Prompt: 18
The color of the sky can vary depending on the viewing conditions and the presence of particles and moisture in the atmosphere. On a clear day, the sky appears blue due to Rayleigh scattering, where the atmosphere scatters sunlight in all directions and blue wavelengths are scattered more than other colors because they travel as shorter, smaller waves. This scattering causes the sky to appear blue to an observer on the ground. However, the sky can also appear various shades of blue, gray, or even take on vibrant hues like red or orange just before or just after sunrise or sunset, due to the scattering of sunlight by particles and moisture in the atmosphere.
Generated: 128
Total: 146

Scenario 2

Before with loop version 2:

Prompt: 18
The color of the sky can
Generated: 7
Total: 25

After with generator.hit_eos():

Prompt: 18
The color of the sky can vary
Generated: 7
Total: 25

apsonawane · 2025-12-19T22:58:21Z

In search_cuda.cpp line 173, we are checking for eos do we need to set hit_eos there or no?

onnxruntime-genai/src/cuda/search_cuda.cpp

Line 173 in 4a13b80

assert(next_tokens_.size() == eos_seen_.size());

baijumeswani · 2026-01-13T19:47:04Z

Should we go ahead and close this PR now that the issue can be addressed through the existing API?

Copilot

Pull request overview

This PR refines EOS/max-length handling in generation loops by relying on Generator.GetNextTokens() and a new TokenCount API, and surfaces search-parameter query APIs across languages so callers can distinguish termination conditions. It also adds tokenizer BOS/EOS/PAD accessors for Objective-C and refreshes examples and docs to reflect the new recommended generation patterns and model support.

Changes:

Introduces Generator.TokenCount and GeneratorParams.GetSearchNumber/GetSearchBool at the core/C-API level and threads them through Python, C#, Java, Objective-C, with corresponding tests.
Standardizes generation loops in tests and examples to use while !IsDone() plus GetNextTokens() for streaming, and updates platform tests to assert consistency between token count and sequence lengths.
Adds missing tokenizer APIs (BOS/EOS/PAD token ids) for Objective-C and updates the README support matrix and example usage/docs, while removing the deprecated graph-capture parameter API from public headers.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test/python/test_onnxruntime_genai_e2e.py	Simplifies the Python e2e test generation loop to `while not generator.is_done()` without an inner break, reflecting the new loop guidance.
test/python/test_onnxruntime_genai_api.py	Adds assertions around `get_search_options()` and `generator.token_count()` and converts multiple generation loops to `while not generator.is_done()`.
test/platform/apple/apple_package_test/macos_package_testUITests/macos_package_uitest_cpp_api.mm	Uses C++ `GetSearchNumber`/`GetSearchBool` and `TokenCount` in the macOS UI test and updates the loop to `while (!generator->IsDone())`, plus post-generation TokenCount/sequence-length consistency checks.
test/platform/apple/apple_package_test/ios_package_testUITests/ios_package_uitest_cpp_api.mm	Mirrors the macOS C++ API UI test changes for iOS, validating search params and `TokenCount` before and after generation.
test/model_tests.cpp	Replaces numerous `while (true)` + `if (IsDone()) break;` loops with `while (!generator->IsDone())` in C++ model tests.
test/csharp/TestOnnxRuntimeGenAIAPI.cs	Updates C# tests to use `GeneratorParams.GetSearchNumber/GetSearchBool` and `Generator.TokenCount()`, and simplifies generation loops to `while (!generator.IsDone())`.
test/c_api_tests.cpp	Converts C API tests to `while (!generator->IsDone())` and adds `TokenCount` assertions in the EOS/PAD test, relying on the new C API getter.
src/search.h	Extends the `Search` interface with a virtual `ResetDone()` method and declares overrides in CPU/CUDA search implementations.
src/search.cpp	Implements `Search_Cpu::ResetDone` and `GreedySearch_Cpu::ResetDone`, and reuses `ResetDone()` from `AppendTokens` and `RewindTo` to centralize done/eos_seen reset logic.
src/python/python.cpp	Removes the deprecated `try_graph_capture_with_max_batch_size`, adds `PyGeneratorParams.get_search_options()`, and binds `generator.token_count()` in the Python wrapper.
src/ort_genai_c.h	Removes the deprecated `OgaGeneratorParamsTryGraphCaptureWithMaxBatchSize`, documents and declares C APIs for getting/setting search numbers/bools, `OgaGenerator_TokenCount`, runtime options, and tokenizer BOS/EOS/PAD accessors.
src/ort_genai_c.cpp	Implements the new C APIs for getting search numbers/bools and `OgaGenerator_TokenCount`, and wires tokenizer BOS/EOS/PAD getters to the underlying C++ APIs.
src/ort_genai.h	Adds C++ RAII wrappers `GeneratorParams::GetSearchNumber/GetSearchBool`, `OgaGenerator::TokenCount`, and a non-span `GetNextTokens()` variant while removing the deprecated graph-capture helper.
src/objectivec/oga_tokenizer.mm	Implements Objective-C methods to read BOS/EOS/PAD token ids via the C++ tokenizer and return them as scalar/NSArray values.
src/objectivec/oga_generator_params.mm	Adds Objective-C `getSearchNumber:` and `getSearchBool:` that call into `OgaGeneratorParams::GetSearchNumber/GetSearchBool`.
src/objectivec/oga_generator.mm	Introduces Objective-C `tokenCount:` that forwards to C++ `OgaGenerator::TokenCount`.
src/objectivec/include/ort_genai_objc.h	Updates the public Objective-C header to expose BOS/EOS/PAD tokenizer accessors, generator-param search getters, and generator `tokenCount:` with documentation.
src/objectivec/error_utils.h	Adds new Objective-C helper macros for catching C++ exceptions when returning `double` and `int`, but currently defines them with malformed macro signatures (compile-time bug).
src/java/src/test/java/ai/onnxruntime/genai/GeneratorParamsTest.java	Clarifies a Java test comment to indicate it validates setting a valid search option.
src/java/src/test/java/ai/onnxruntime/genai/GenerationTest.java	Uses `getSearchNumber`, `getSearchBool`, and `generator.tokenCount()` in the Java generation test alongside the updated `while (!generator.isDone())` loop.
src/java/src/main/native/ai_onnxruntime_genai_GeneratorParams.cpp	Implements JNI bridges for `GeneratorParams.getSearchNumber` and `getSearchBool` using the new C API getters.
src/java/src/main/native/ai_onnxruntime_genai_Generator.cpp	Adds a JNI method `tokenCount` that forwards to `OgaGenerator_TokenCount`.
src/java/src/main/java/ai/onnxruntime/genai/GeneratorParams.java	Exposes `getSearchNumber` and `getSearchBool` in the Java API and adds the corresponding native method declarations.
src/java/src/main/java/ai/onnxruntime/genai/Generator.java	Exposes `tokenCount()` in the Java Generator API and wires it to the new JNI-native method.
src/generators.h	Adds `GeneratorParams::GetSearchNumber/GetSearchBool` and `Generator::TokenCount()` to the core C++ generator interfaces.
src/generators.cpp	Implements search-option querying with explicit name dispatch and error on invalid names, and `Generator::TokenCount()` via `search_->GetSequenceLength()`.
src/cuda/search_cuda.h	Declares `Search_Cuda::ResetDone()` to match the extended `Search` interface.
src/cuda/search_cuda.cpp	Centralizes CUDA done/eos_seen reset logic in `Search_Cuda::ResetDone()` and reuses it from constructors, `AppendTokens`, and `RewindTo`.
src/csharp/NativeMethods.cs	Adds P/Invoke signatures for `OgaGeneratorParamsGetSearchNumber/GetSearchBool` and `OgaGenerator_TokenCount`.
src/csharp/GeneratorParams.cs	Wraps the new C# interop getters as `GetSearchNumber`/`GetSearchBool` on `GeneratorParams` and removes the deprecated graph-capture stub.
src/csharp/Generator.cs	Exposes `TokenCount()` on the C# Generator class, backed by the new native function.
src/config.h	Fixes a minor spacing typo in the `early_stopping` comment in `Config::Search`.
examples/python/phi4-mm.py	Updates the multimodal Python example’s generation loop to `while not generator.is_done()` with `get_next_tokens()` for streaming.
examples/python/phi3-qa.py	Aligns the QA Python example with the new loop pattern and keeps the timing logic around `generate_next_token`.
examples/python/model-vision.py	Uses `while not generator.is_done()` + `get_next_tokens()` in the vision example instead of `while True` with an inner `is_done()` break.
examples/python/model-qa.py	Similarly updates the text QA example loop to rely on `is_done()` only in the loop condition.
examples/python/model-generate.py	Simplifies the batch generation loop to `while not generator.is_done()` without an inner `if is_done(): break`.
examples/python/model-chat.py	Updates the chat example to use the new loop pattern while retaining timing and streaming decode via `get_next_tokens()`.
examples/python/awq-quantized-model.py	Applies the same loop update to the AWQ-quantized model example.
examples/csharp/HelloPhi4MM/Program.cs	Updates the C# multimodal example to `while (!generator.IsDone())` with streaming decode via `GetNextTokens()[0]`.
examples/csharp/HelloPhi3V/Program.cs	Mirrors the multimodal loop update for the Phi3V C# example.
examples/csharp/HelloPhi/Program.cs	Converts several Phi C# examples (batch and streaming/chat) to the new `while (!generator.IsDone())` pattern.
examples/c/src/whisper.cpp	Adjusts the Whisper C++ example’s generation loop to use `while (!generator->IsDone())`.
examples/c/src/phi4-mm.cpp	Switches the multimodal C++ example to `while (!generator->IsDone())` and uses `GetNextTokens()[0]` instead of manually indexing `GetSequenceData`.
examples/c/src/model_vision.cpp	Same as phi4-mm: uses `while (!generator->IsDone())` and `GetNextTokens()[0]` for last token retrieval.
examples/c/src/model_qa.cpp	Updates the QA C++ example loop and uses `GetNextTokens()[0]` for streaming outputs.
examples/c/src/model_chat.cpp	Updates the chat C++ example to use `while (!generator->IsDone())` and `GetNextTokens()[0]` for token streaming.
README.md	Refreshes the support matrix (model names/lines), updates the recommended Python generation loop, revises version checkout instructions, and replaces the old “Breaking API changes” section with nightly install guidance.

kunal-vaishnavi added 4 commits December 17, 2025 19:52

Add IsMaxLength API

7df4aa7

Add HitEOS and HitMaxLength APIs

9d41ed9

Add language bindings and update unit tests

b12e260

Update README

795719d

kunal-vaishnavi added the 0.11.5 label Dec 18, 2025

Fix Java build and initialize pointers

4a13b80

apsonawane reviewed Dec 18, 2025

View reviewed changes

Comment thread src/cuda/search_cuda.cpp

Add checks for beam search

3165142

apsonawane previously approved these changes Jan 13, 2026

View reviewed changes

kunal-vaishnavi added 2 commits January 21, 2026 08:45

Merge branch 'main' into kvaishnavi/fix-early-return

a3299aa

Remove HitEOS from examples since GetNextTokens works in streaming mode

74c4fac

kunal-vaishnavi dismissed apsonawane’s stale review via 74c4fac January 21, 2026 09:00

kunal-vaishnavi added 16 commits January 21, 2026 11:46

Add C++ API for GetNextTokens and use in examples

5cf30ae

Introduce TokenCount API instead

f96fa85

Add GetSearchNumber and GetSearchBool APIs

06f6703

Undo accidental change

d18fe38

Fix return types in Java bindings

93b6279

Add missing value call

87ef255

Update return type for Objective-C binding of TokenCount

d5c51a1

Add missing return in Java binding of TokenCount

deac1b1

Fix names of APIs called in C API

606b494

Add missing const references

279eaa2

Add changes suggested by C++ linter

67487c2

Add some more missing const references

09fc909

Change how Python binding is done

8d9fda8

Use fullname for pybind dict

a95ac69

Define TokenCount binding with PyGenerator instead of OgaGenerator

d70baa1

Add changes suggested by C++ linter

d160ea0

kunal-vaishnavi added 7 commits January 22, 2026 06:39

Add assertions in unit tests for new APIs

ebe9eaf

Fix language binding API names

e4bc169

Update how chunk size is obtained

5ca18c7

Fix max length in assertion

6f6c0b0

Remove default values from Java bindings

1c2ed9d

Fix C API call inside Java API

fb1f8d5

Fix value in assert

2197b1a

kunal-vaishnavi removed the 0.11.5 label Jan 22, 2026

kunal-vaishnavi added 2 commits January 22, 2026 18:45

Merge branch 'main' into kvaishnavi/fix-early-return

735b73a

Remove breaking changes documentation from README

7afeb51

baijumeswani reviewed Jan 23, 2026

View reviewed changes

baijumeswani requested a review from Copilot January 23, 2026 21:17

Copilot started reviewing on behalf of baijumeswani January 23, 2026 21:18 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

kunal-vaishnavi added 5 commits January 24, 2026 00:44

Construct vector in return statement

3b990df

Make changes based on PR feedback

27d42b5

Add back missing definition

4461a6e

Pin transformers to be before v5

6913f9f

Cast token count from size_t to int in C#

5a668f3

baijumeswani previously approved these changes Jan 27, 2026

View reviewed changes

Comment thread src/generators.cpp Outdated

baijumeswani enabled auto-merge (squash) January 27, 2026 17:59

Simplify getting chunk size value

c2a45e9

kunal-vaishnavi dismissed baijumeswani’s stale review via c2a45e9 January 27, 2026 18:14

kunal-vaishnavi requested a review from baijumeswani January 27, 2026 18:16

baijumeswani approved these changes Jan 27, 2026

View reviewed changes

kunal-vaishnavi mentioned this pull request Jan 27, 2026

Add RAII wrappers for ORT Model Editor API types #1953

Merged

baijumeswani merged commit 862852a into main Jan 27, 2026
15 checks passed

baijumeswani deleted the kvaishnavi/fix-early-return branch January 27, 2026 20:04

dependabot Bot mentioned this pull request Feb 16, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.0 yuniko-software/qwen3-onnx#23

Closed

dependabot Bot mentioned this pull request Mar 2, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.1 yuniko-software/qwen3-onnx#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update handling EOS token id detection#1925

Update handling EOS token id detection#1925
baijumeswani merged 40 commits into
mainfrom
kvaishnavi/fix-early-return

kunal-vaishnavi commented Dec 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

apsonawane commented Dec 19, 2025

Uh oh!

baijumeswani commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kunal-vaishnavi commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Version 2

Description

Motivation and Context

Version 1

Description

Motivation and Context

1. Original Generation Loop

2. Return Early Generation Loop

3. Infinite Generation Loop

Conclusion

Solution

Scenario 1

Scenario 2

Uh oh!

Uh oh!

apsonawane commented Dec 19, 2025

Uh oh!

baijumeswani commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kunal-vaishnavi commented Dec 18, 2025 •

edited

Loading