Skip to content

New stream#1856

Merged
ArthurZucker merged 19 commits intomainfrom
new-stream
Aug 29, 2025
Merged

New stream#1856
ArthurZucker merged 19 commits intomainfrom
new-stream

Conversation

@ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Aug 27, 2025

New api:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255, 255]) # init the state with prefill
out = stream.step(tokenizer, 109)
'ั'

and:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255]) # init the state with prefill
out = stream.step(tokenizer, [255,109])  # imagine you had an assitant that generated 2 tokens at the same time
'ั'

Non breaking:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream(False)
stream.step(tokenizer, 19567)
stream.step(tokenizer, 255)
stream.step(tokenizer, 19567)
out = stream.step(tokenizer, 109)
out
'ั'
tokenizer.encode("อั").ids
[19567, 255, 19567, 109]
tokenizer.decode(tokenizer.encode("อั").ids)
'อั'

This could be somewhat expected, but if you initialize your stream, say with [19567, 255, 19567] first, then you should be able to properly get 'อั' if you step with 109.

We can't go against the fact that [19567, 109] is a "valid" token, so in the context of token generation, we cannot go against this (the first token will always be emitted because it is a valid token). However initializing the stream should still be helpful

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from McPatate August 29, 2025 07:12
@ArthurZucker ArthurZucker merged commit abee958 into main Aug 29, 2025
30 checks passed
@ArthurZucker ArthurZucker deleted the new-stream branch August 29, 2025 08:06
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

shenxiangzhuang pushed a commit to shenxiangzhuang/tokenizers that referenced this pull request Aug 29, 2025
* update

* update

* updates

* up

* oikay

* use stream input

* nice all test pass?

* fmt

* dev

* rename

* simplify a hell lot

* proper testing

* fix inti

* fix test

* nits

* make clippy happy now

* fmt fml

* remove the prints

* fix gate
@njhill
Copy link

njhill commented Aug 29, 2025

Thanks @ArthurZucker!

@michaeltheologitis
Copy link

michaeltheologitis commented Jan 19, 2026

This is very helpful. Thank you a lot!!! 🚀 🚀 🚀

@ArthurZucker is there a way to expose a __copy__ method (or similar)? Currently, my workaround is:

# Stream that has been fed token_ids = [121, 32, ...] which I also want to copy
stream = DecodeStream(...) 

stream_copy = DecodeStream([])
for tid in token_ids:
    stream_copy.step(tokenizer, tid)

This is obviously very very very bad.

I also tried two potential speedups, but both cause later .step calls to return (unexpected?) cached strings:

  1. Using prefill: stream_copy = DecodeStream(token_ids)
  2. Batching: stream_copy.step(tokenizer, token_ids)

@ArthurZucker
Copy link
Collaborator Author

stream_copy = DecodeStream(token_ids) should work can you share a repro ?

@ArthurZucker
Copy link
Collaborator Author

Okay fixing this in: #1930

@michaeltheologitis
Copy link

michaeltheologitis commented Jan 19, 2026

@ArthurZucker Thank you for the fast reply and shipping this so fast!!!

stream_copy = DecodeStream(token_ids) should work can you share a repro ?

Repro: DecodeStream prefill causes .step() to flush entire buffer

Environment:

  • Python 3.12.11
  • macOS-15.7.3-arm64-arm-64bit
  • tokenizers 0.22.1
  • transformers 4.57.1
from transformers import AutoTokenizer
from tokenizers.decoders import DecodeStream

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

# Text that reproduces the DecodeStream prefill/step flush issue
text = (
    'def format_task(status):\n'
    '    icons = {\n'
    '        "pending": "○",\n'
    '        "running": "◐",\n'
)
# [822, 3402, 29918, 7662, 29898, 4882, 1125, 13, 1678, 27673, 353, 426, 13, 4706, 376, 29886, 
#  2548, 1115, 376, 31236, 613, 13, 4706, 376, 21094, 1115, 376, 229, 154, 147, 613, 13]
token_ids = tokenizer.encode(text)

stream = DecodeStream([])

for i, tid in enumerate(token_ids):
    chars = stream.step(tokenizer._tokenizer, tid)

    # 'copy' by prefilling all previous tokens that 'stream' has seen (w/o the current one)
    copy_stream = DecodeStream(token_ids[:i])
    # feed the current token
    copy_chars = copy_stream.step(tokenizer._tokenizer, tid)

    # They should match exactly
    if chars != copy_chars:
        print(f"i={i}, tid={tid}")
        print(f"Expected: {chars!r}")
        print(f"Got:      {copy_chars!r}")
        print()

Expected: DecodeStream(prefill_ids).step(tok, new_id) returns only the delta for new_id, same as building incrementally.
Actual: First .step() after prefill flushes the entire buffer (prefill + new token) as one string.

i=29, tid=147
Expected: '◐'
Got:      'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "◐'

@ArthurZucker
Copy link
Collaborator Author

Ah for you bug I think this is kind of a missunderstanding about the decode stream. The equivalence you are looking for is not really possible:

In [4]: tokenizer.decode(token_ids[:29])
Out[4]: 'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "��'

as you can see, the 30th token is required to have a valid sentence. But when you initialize a stream with token_ids[:29] its not considered a valid string, because it finishes with ��. As such, it does not return anything.

See:

stream =  DecodeStream(token_ids[:28])
stream.step(tokenizer._tokenizer, [token_ids[28]])

does not output anything either while:

In [20]: stream.step(tokenizer._tokenizer, [token_ids[29]])
Out[20]: 'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "◐'

does.
This is because otherwise it would return a non valid sequence.
The issue is that the string you initialized it with is already non valid, you step and its still non valid. So the full "prefill" is still not "outputed"

@michaeltheologitis
Copy link

Oh, I see, this makes a lot of sense! Thank you so much for explaining 🙂

@ArthurZucker
Copy link
Collaborator Author

of course! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants