Add backwards-compatible support for multiple EOS tokens#305
Conversation
Models like Qwen 3/3.5 define multiple EOS token IDs in their GenerationConfig (e.g. [151645, 151643]), but llguidance only supported a single EOS token. This caused models to enter infinite loops when they tried to emit an EOS token that was masked out. Changes: - Add eos_tokens: Vec<TokenId> to TokTrie with accessors and with_eos_tokens() builder - Update TokenParser to check full EOS set for mask computation, token consumption, rollback, and stop detection - Add tok_eos_extra/tok_eos_extra_count to C API LlgTokenizerInit - Python eos_token parameter now accepts int | list[int] - Add eos_tokens getter property to Python LLTokenizer - Update type stubs and all Python helper modules (hf, tiktoken, llamacpp) All existing APIs remain unchanged; single EOS usage is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds multi-EOS-token support across the tokenizer/trie layers and the parser so models that define multiple EOS IDs (e.g., Qwen/GLM) can terminate correctly without masking alternative EOS tokens.
Changes:
- Extend
TokTrie/tokenizer wrappers to carry a list of EOS token IDs (while keeping a primary EOS for compatibility). - Update
TokenParserto treat any configured EOS token as valid for mask computation and stop detection. - Plumb multi-EOS through the C API and Python bindings/helpers (including typing updates).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
toktrie_tiktoken/src/lib.rs |
Adds a setter to apply multiple EOS tokens to the internal TokTrie. |
toktrie_hf_tokenizers/src/lib.rs |
Tracks extra EOS tokens and applies them when constructing the TokTrie. |
toktrie/src/toktree.rs |
Adds eos_tokens storage + builder/accessor; adds unit tests for new behavior. |
parser/src/tokenparser.rs |
Switches EOS handling from single token to a token set for masking/stop logic. |
parser/src/ffi.rs |
Extends C tokenizer initialization to accept additional EOS token IDs. |
parser/llguidance.h |
Exposes the new C init fields for extra EOS token IDs. |
c_sample/c_sample.cpp |
Documents how to pass multiple EOS tokens via the C API. |
python_ext/src/py.rs |
Accepts eos_token as `int |
python_ext/src/llamatokenizer.rs |
Updates llama.cpp bridge to accept multiple EOS token IDs and apply them to TokTrie. |
python/llguidance/hf.py |
Updates helper typing/docs to allow eos_token as `int |
python/llguidance/tiktoken.py |
Updates helper typing/docs to allow eos_token as `int |
python/llguidance/llamacpp.py |
Updates helper typing/docs to allow eos_token as `int |
python/llguidance/_lib.pyi |
Updates stubs for new eos_tokens property and widened eos_token parameter type. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@hudson-ai what would the Rust mechanism be for those when creating an llg factory from a tokenzier? EDIT: also does this mean i should hold off on hacking-together that |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Document that LlgTokenizerInit must be zero-initialized before setting fields, as new fields may be appended in future versions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If you're working from a tokenizer JSON, you'll want to do something like this: use toktrie_hf_tokenizers::ByteTokenizer;
use llguidance::{ParserFactory, api::TopLevelGrammar};
use llguidance::toktrie::InferenceCapabilities;
use llguidance::earley::SlicedBiasComputer;
let mut byte_tok = ByteTokenizer::from_json_str(&tokenizer_json)?;
// Probably get this from the generation_config.json?
byte_tok.set_eos_tokens(&[151645, 151643]);
let tok_env = byte_tok.into_tok_env(None)?;
let factory = ParserFactory::new(
&tok_env,
InferenceCapabilities::default(),
&SlicedBiasComputer::general_slices(),
)?;
// factory now produces parsers that allow both EOS tokens in masks
let grammar = TopLevelGrammar::from_lark(r#"start: /[a-z]+/"#.to_string());
let parser = factory.create_parser(grammar)?;If you already have a RE: holding off on the "hack" -- I think that I can reasonably get a release out in the next week or so, but no strong guarantee. Need to await some review on this PR too. But if you don't mind doing and undoing a hack, go for it 😉 |
Handle runaway model output in "normal grammar" modality masking possible EOS tokens and producing nonsensical output once the model has completed its normal tool-calls and chat stream: - guidance-ai/llguidance#304 - guidance-ai/llguidance#305
Handle runaway model output in "normal grammar" modality masking possible EOS tokens and producing nonsensical output once the model has completed its normal tool-calls and chat stream: - guidance-ai/llguidance#304 - guidance-ai/llguidance#305
Move tok_eos_extra/tok_eos_extra_count out of LlgTokenizerInit into a new LlgTokenizerInitV2 struct that embeds the original as its 'base' field. This keeps LlgTokenizerInit identical to its pre-multi-EOS layout, avoiding any ABI break for existing C consumers. Add llg_new_tokenizer_v2() which accepts the v2 struct. The original llg_new_tokenizer() continues to work unchanged with single-EOS. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The leading struct_size field (set to sizeof(LlgTokenizerInitV2) by callers) lets the library detect which fields are present. Future fields can be appended to the struct without a v3 — callers compiled against an older header will simply have a smaller struct_size, and new fields will be treated as zero/default. llg_new_tokenizer_v2() validates struct_size >= the minimum expected size and returns an error if it's unset or too small. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the nested 'base: LlgTokenizerInit' member with flat copies of all fields so C consumers write init.vocab_size instead of init.base.vocab_size. Since v2 is the recommended struct going forward, this avoids a permanent ergonomic tax. Internally, from_init_v2() builds a temporary LlgTokenizerInit to delegate to from_init(), keeping the code DRY. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add create_tokenizer_v2() and create_byte_tokenizer_v2() that exercise LlgTokenizerInitV2 with struct_size, flat fields, and an extra EOS token. Extract run_constraint_test() helper and run the full constraint test with both v1 and v2 tokenizers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
with_eos_tokens() now asserts all token IDs are within vocab_size, preventing out-of-bounds panics during mask computation. This covers all paths (C API, Python bindings, Rust API). from_init_v2() now accepts smaller struct_size values from callers compiled against older headers. Fields beyond what struct_size covers are treated as zero/default. The minimum accepted size is the base fields through slices (matching v1 + struct_size). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert struct_size to strict check (require >= sizeof) since the function takes &LlgTokenizerInitV2 — Rust assumes the full struct is readable, so accepting smaller sizes would be UB. Update docs to note struct_size is reserved for future forward compatibility. from_init_v2(), before calling with_eos_tokens(). This gives C callers a graceful error instead of a panic across FFI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change llg_new_tokenizer_v2() to take a raw pointer instead of a Rust reference. The function reads struct_size first, then copies only min(struct_size, sizeof) bytes into a local zeroed struct. This means callers compiled against an older (smaller) header genuinely work with newer library versions — new fields default to zero. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix TokenizerWrapper path in py_new to apply eos_token override - Add TokTrie::eos_token_set() that includes all EOS tokens - Fix LLMatcher::eos_token_set() to use all EOS tokens (was singleton) - Fix LLMatcher::consume_token_inner() to accept any EOS token - Fix Matcher::compute_mask_or_eos() to use all EOS tokens - Add Python tests for multi-EOS via TokenizerWrapper mock Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add test_eos_token_set_single and test_eos_token_set_multiple in toktrie - Add test_multi_eos_mask_when_stopped in sample_parser (Matcher level) - Simplify Python mock test to only verify TokenizerWrapper override applies Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Why not make this a wrapper for create_tokenizer_v2()? Isn't the code largely the same?
There was a problem hiding this comment.
Under the hood, each function is using a different init struct, and consolidating would effectively remove the "living documentation" (slash test) of the old API. I think that removing the commented-out block is a good idea though.
Thoughts?
Replace raw new[]/delete[] allocations for token_lens and token_bytes with std::vector in both create_tokenizer_v2() and create_tokenizer(). This is exception-safe and avoids manual memory management. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of separate tok_eos + extra_eos_tokens parameters, accept a single std::vector<uint32_t> where [0] is the primary EOS and any remaining entries are extra EOS tokens. Cleaner C++ API while still mapping naturally to the underlying C struct fields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The v2 API now has a real working example in create_tokenizer_v2() above, so the inline commented-out snippet is redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| // Same as above but using the v2 API with an extra (unused) EOS token. | ||
| LlgTokenizer *create_byte_tokenizer_v2(void) { | ||
| std::vector<std::vector<uint8_t>> tokens; |
There was a problem hiding this comment.
Small point, but you can also pre-allocate a capacity in the constructor if you have a pretty good idea of the size you'll need (note that size <= capacity)
| size_t offset = 0; | ||
| for (size_t i = 0; i < tokens.size(); i++) { | ||
| memcpy(token_bytes + offset, tokens[i].data(), token_lens[i]); | ||
| memcpy(token_bytes.data() + offset, tokens[i].data(), token_lens[i]); |
There was a problem hiding this comment.
If you really want to go all C++ then there's std::copy() but memcpy() will work fine.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@riedgar-ms addressed the cpp stuff. Any other objections / comments? The C FFI itself looking good to you? |
riedgar-ms
left a comment
There was a problem hiding this comment.
No objections from me
|
@hudson-ai - I'm still pushing code for the second half of my work on this to refine grammar generation and syntax. Does this change syntax from the (<[I'd]>|<[I'd]>... ) notation of the eos ids I'm deriving from the various config sections? |
|
@sempervictus no syntax changes -- your approach shouldn't stop working, BUT you'd be able to avoid adding the tokens to the grammar at all if you want. You could rewrite pub fn build_llg_factory(
tokenizer: Tokenizer,
vocab_size: Option<usize>
) -> Result<Arc<ParserFactory>> {
let mut bt = ByteTokenizer::from_tokenizer(tokenizer)?;
// NEW
let eos_tokens = todo!();
if !eos_tokens.is_empty() {
bt.set_eos_tokens(eos_tokens);
}
let env = bt.into_tok_env(target_vocab)?;
let factory = ParserFactory::new_simple(&env)?;
Ok(Arc::new(factory))
} |
|
@sempervictus that being said, I'd definitely stick to your approach with the EOS tokens more explicitly represented in the grammar if you need fine-grained control like "use when finishing a message but when finishing a turn" or whatever that may look like. |
|
Thanks, @hudson-ai for this enhancement. It removes some of the confusion we faced when dealing with models that support multiple eos tokens. I’m curious if this behaviour is intentional or if you’d be open to accept a PR that unifies the handling.
// Masks allow all eos tokens
let mask = matcher.compute_mask_or_eos()?;
assert!(mask.is_allowed(151643));
assert!(mask.is_allowed(151645));
// but validate_tokens will only accept one:
assert_eq!(matcher.validate_tokens(&[151643]).unwrap(), 1);
assert_eq!(matcher.validate_tokens(&[151645]).unwrap(), 1); // this one is not validrunnable test case
/// Reproduces the inconsistency reported for models with multiple EOS tokens (e.g. Qwen3 Coder):
/// `compute_mask_or_eos` allows all registered EOS tokens in the stopped mask, but
/// `validate_tokens` should also accept any of them — not just the primary one.
///
/// Requires `tokenizer.json` (a Qwen3-family tokenizer) in the workspace root.
#[test]
fn test_multi_eos_validate_tokens_consistency() {
let tok_path = std::path::Path::new(env!("CARGO_MANIFEST_DIR")).join("../tokenizer.json");
let tok_bytes = std::fs::read(&tok_path).unwrap();
let mut byte_tok =
toktrie_hf_tokenizers::ByteTokenizer::from_json_bytes(&tok_bytes).unwrap();
// Qwen3 Coder: <|im_end|>=151645 is the chat EOS; <|endoftext|>=151643 is the base EOS.
byte_tok.set_eos_tokens(&[151645, 151643]);
let tok_env = byte_tok.into_tok_env(None).unwrap();
let factory = ParserFactory::new(
&tok_env,
InferenceCapabilities::default(),
&SlicedBiasComputer::general_slices(),
)
.unwrap();
let grm = TopLevelGrammar::from_lark(r#"start: "hello""#.to_string());
let mut parser = factory.create_parser(grm).unwrap();
parser.start_without_prompt();
let mut matcher = Matcher::new(Ok(parser));
let hello_tokens = tok_env.tokenize("hello");
for &tok in &hello_tokens {
matcher.consume_token(tok).unwrap();
}
// Both EOS tokens must appear in the stopped mask.
let mask = matcher.compute_mask_or_eos().unwrap();
assert!(mask.is_allowed(151643), "primary EOS 151643 should be in stopped mask");
assert!(mask.is_allowed(151645), "extra EOS 151645 should be in stopped mask");
// validate_tokens must accept the primary EOS.
assert_eq!(
matcher.validate_tokens(&[151643]).unwrap(),
1,
"validate_tokens should accept primary EOS 151643"
);
// validate_tokens must also accept the extra EOS — consistent with the mask.
matcher.reset().unwrap();
for &tok in &hello_tokens {
matcher.consume_token(tok).unwrap();
}
assert_eq!(
matcher.validate_tokens(&[151645]).unwrap(),
1,
"validate_tokens should accept extra EOS 151645"
);
}validate_tokens only expects a single eos token while the mask already supports multiple eos tokens.
The |
|
In vllm.rs we extract both eos anyway so were book-ending with (<[eos0]>|<[eos1]>) special token patterns. Reason being that some eos are specifically for strings or other block types in various models while the other is really end of message terminator. This allows the model to terminate "naturally" in our tests |
|
Hey @fellhorn! Thanks for reaching out
|
Thanks for your reply. Please find my fix in #347. |
`validate_tokens` in `parser/src/earley/parser.rs` only recognized the primary EOS token, which was overlooked in #305 when adding support for multiple EOS tokens. This PR aligns the behaviour with e.g. `compute_mask` that now allows multiple eos tokens. Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Motivation
Models like Qwen 3/3.5 define multiple EOS token IDs in their GenerationConfig (e.g.
[151645, 151643]), but llguidance only supported a single EOS token. Qwen 3 uses<|im_end|>(151645) to end turns in chat mode and<|endoftext|>(151643) as a general end-of-text marker. When llguidance is configured with only one of these (e.g. the tokenizer's default 151645), the other gets masked out. If the model tries to emit the masked EOS, it's forced to pick garbage tokens and enters an infinite repetition loop, never terminating.Closes #253
Related: #304
Changes
eos_tokens: Vec<TokenId>toTokTriewith accessors,with_eos_tokens()builder, and validation (asserts IDs are within vocab range)TokenParserto check the full EOS set for mask computation, token consumption, rollback, and stop detectionLlgTokenizerInitis unchanged. NewLlgTokenizerInitV2struct (flat, withstruct_sizefor forward compatibility) +llg_new_tokenizer_v2()function for multi-EOS supportstruct_sizeenables forward compatibility: the FFI function takes a raw pointer, reads onlystruct_sizebytes, and zero-fills any new fields the caller's header doesn't know aboutllg_new_tokenizer_v2()validates EOS token IDs against vocab size and returns an error (not a panic) for out-of-range IDseos_tokenparameter now acceptsint | list[int]across all entry pointseos_tokensgetter property to PythonLLTokenizerhf,tiktoken,llamacpp)Usage
Python:
C (v2 API):
Rust:
API compatibility
Python: Fully backwards compatible.
eos_tokenstill accepts a singleinteverywhere. The only additions are that it also acceptslist[int], and there's a neweos_tokensproperty.C API: Fully backwards compatible.
LlgTokenizerInitis identical to its pre-PR layout — zero fields added or removed.llg_new_tokenizer()is unchanged. Multi-EOS requires the newLlgTokenizerInitV2+llg_new_tokenizer_v2(), which are purely additive.Rust (published crates —
toktrie,toktrie_hf_tokenizers,toktrie_tiktoken): Only additive changes — new methods likewith_eos_tokens(),eos_tokens(),set_eos_tokens(). No existing signatures changed.Rust (
python_ext— not published):tokenv_from_llamacppchanged fromeos_token: u32toeos_tokens: &[u32]. This is technically a breaking signature change, butpython_extis only consumed internally to build the Python wheel, so no external Rust consumers are affected.Known limitations
TokRxInfo.tok_eosstill holds only the first (primary) EOS token. Code that readstok_eosdirectly rather than going throughTokTrie::eos_tokens()will only see the primary one.with_info()resetseos_tokensback tovec![info.tok_eos], silently dropping extra EOS tokens. Callers that replaceTokRxInfoafter setting multi-EOS must re-applywith_eos_tokens().