Set `add_special_tokens` to false by default in Encode #1442

sayanshaw24 · 2025-05-02T19:56:40Z

Sets add_special_tokens from OrtxTokenizeWithOptions added in microsoft/onnxruntime-extensions#940 to false to solve chat template issue in GenAI with extra BOS tokens.

See huggingface/transformers#37686 for more context.

src/ort_genai_c.h

RyanUnderhill · 2025-05-02T22:25:38Z

Ok, meeting conclusion is that we don't need this API currently as our internal default values will do what users want. This way we avoid exposing an option that nobody knows what value to set to.

src/models/model.cpp

Co-authored-by: Ryan Hill <[email protected]>

RyanUnderhill

Not sure if you want to say in the PR comments why we default to false, just to have some history for it if we forget why we did this in the future.

sayanshaw24 · 2025-05-02T23:27:57Z

Not sure if you want to say in the PR comments why we default to false, just to have some history for it if we forget why we did this in the future.

add_special_tokens from OrtxTokenizeWithOptions is a tokenizer param we set to false in Encode in GenAI so as to not confuse the user deciding what to set it to - setting it to false omits the extra BOS token added in the case of Gemma-3. See huggingface/transformers#37686 for more context.

Sets `add_special_tokens` from `OrtxTokenizeWithOptions` added in microsoft/onnxruntime-extensions#940 to false to solve chat template issue in GenAI with extra BOS tokens. See huggingface/transformers#37686 for more context. --------- Co-authored-by: Sayan Shaw <[email protected]> Co-authored-by: Ryan Hill <[email protected]>

Update version to 0.8.0-rc2 and cherry pick these 3 changes: #1435 update ESRP settings #1434 make WebGPU name consistent #1432 Missed an all lowercase "webgpu" string #1440 Apply provider name backwards compatibility at runtime #1452 Update Extensions Commit to Support Chat Template Override for Unsupported Models #1439 Sign macos binaries #1442 Set `add_special_tokens` --------- Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: Sayan Shaw <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Sayan Shaw <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]>

Sayan Shaw added 2 commits May 2, 2025 12:53

add tokenize with special tokens api changes

d4bd10b

lint check

7fa3dcd

RyanUnderhill reviewed May 2, 2025

View reviewed changes

src/ort_genai_c.h Outdated Show resolved Hide resolved

remove new method, set param by default in encode

1b1dead

RyanUnderhill reviewed May 2, 2025

View reviewed changes

src/models/model.cpp Outdated Show resolved Hide resolved

sayanshaw24 changed the title ~~Add API changes for Encode with special tokens~~ Set add_special_tokens to false by default in Encode May 2, 2025

sayanshaw24 marked this pull request as ready for review May 2, 2025 23:05

label bool for readability

59f79bd

Co-authored-by: Ryan Hill <[email protected]>

RyanUnderhill approved these changes May 2, 2025

View reviewed changes

RyanUnderhill enabled auto-merge (squash) May 2, 2025 23:42

RyanUnderhill merged commit 3a334c8 into main May 2, 2025
14 checks passed

RyanUnderhill deleted the sayanshaw/special-tokens branch May 2, 2025 23:57

natke added the 0.8.0 label May 3, 2025

kunal-vaishnavi mentioned this pull request May 5, 2025

og.Tokenizer does not expose chat_template and other desirable methods and properties #1349

Closed

RyanUnderhill mentioned this pull request May 6, 2025

0.8.0 RC2 cherry picks #1438

Merged

RyanUnderhill added the cherry picked label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `add_special_tokens` to false by default in Encode #1442

Set `add_special_tokens` to false by default in Encode #1442

Uh oh!

sayanshaw24 commented May 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

RyanUnderhill commented May 2, 2025

Uh oh!

Uh oh!

RyanUnderhill left a comment

Uh oh!

sayanshaw24 commented May 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Set add_special_tokens to false by default in Encode #1442

Set add_special_tokens to false by default in Encode #1442

Uh oh!

Conversation

sayanshaw24 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RyanUnderhill commented May 2, 2025

Uh oh!

Uh oh!

RyanUnderhill left a comment

Choose a reason for hiding this comment

Uh oh!

sayanshaw24 commented May 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Set `add_special_tokens` to false by default in Encode #1442

Set `add_special_tokens` to false by default in Encode #1442

sayanshaw24 commented May 2, 2025 •

edited

Loading