New stream by ArthurZucker · Pull Request #1856 · huggingface/tokenizers

ArthurZucker · 2025-08-27T14:29:50Z

New api:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255, 255]) # init the state with prefill
out = stream.step(tokenizer, 109)
'ั'

and:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255]) # init the state with prefill
out = stream.step(tokenizer, [255,109])  # imagine you had an assitant that generated 2 tokens at the same time
'ั'

Non breaking:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream(False)
stream.step(tokenizer, 19567)
stream.step(tokenizer, 255)
stream.step(tokenizer, 19567)
out = stream.step(tokenizer, 109)
out
'ั'
tokenizer.encode("อั").ids
[19567, 255, 19567, 109]
tokenizer.decode(tokenizer.encode("อั").ids)
'อั'

This could be somewhat expected, but if you initialize your stream, say with [19567, 255, 19567] first, then you should be able to properly get 'อั' if you step with 109.

We can't go against the fact that [19567, 109] is a "valid" token, so in the context of token generation, we cannot go against this (the first token will always be emitted because it is a valid token). However initializing the stream should still be helpful

HuggingFaceDocBuilderDev · 2025-08-27T14:32:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate

🔥

* update * update * updates * up * oikay * use stream input * nice all test pass? * fmt * dev * rename * simplify a hell lot * proper testing * fix inti * fix test * nits * make clippy happy now * fmt fml * remove the prints * fix gate

njhill · 2025-08-29T22:22:04Z

Thanks @ArthurZucker!

michaeltheologitis · 2026-01-19T06:05:37Z

This is very helpful. Thank you a lot!!! 🚀 🚀 🚀

@ArthurZucker is there a way to expose a __copy__ method (or similar)? Currently, my workaround is:

# Stream that has been fed token_ids = [121, 32, ...] which I also want to copy
stream = DecodeStream(...) 

stream_copy = DecodeStream([])
for tid in token_ids:
    stream_copy.step(tokenizer, tid)

This is obviously very very very bad.

I also tried two potential speedups, but both cause later .step calls to return (unexpected?) cached strings:

Using prefill: stream_copy = DecodeStream(token_ids)
Batching: stream_copy.step(tokenizer, token_ids)

ArthurZucker · 2026-01-19T08:38:35Z

stream_copy = DecodeStream(token_ids) should work can you share a repro ?

ArthurZucker · 2026-01-19T08:52:42Z

Okay fixing this in: #1930

michaeltheologitis · 2026-01-19T17:26:52Z

@ArthurZucker Thank you for the fast reply and shipping this so fast!!!

stream_copy = DecodeStream(token_ids) should work can you share a repro ?

Repro: DecodeStream prefill causes .step() to flush entire buffer

Environment:

Python 3.12.11
macOS-15.7.3-arm64-arm-64bit
tokenizers 0.22.1
transformers 4.57.1

from transformers import AutoTokenizer
from tokenizers.decoders import DecodeStream

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

# Text that reproduces the DecodeStream prefill/step flush issue
text = (
    'def format_task(status):\n'
    '    icons = {\n'
    '        "pending": "○",\n'
    '        "running": "◐",\n'
)
# [822, 3402, 29918, 7662, 29898, 4882, 1125, 13, 1678, 27673, 353, 426, 13, 4706, 376, 29886, 
#  2548, 1115, 376, 31236, 613, 13, 4706, 376, 21094, 1115, 376, 229, 154, 147, 613, 13]
token_ids = tokenizer.encode(text)

stream = DecodeStream([])

for i, tid in enumerate(token_ids):
    chars = stream.step(tokenizer._tokenizer, tid)

    # 'copy' by prefilling all previous tokens that 'stream' has seen (w/o the current one)
    copy_stream = DecodeStream(token_ids[:i])
    # feed the current token
    copy_chars = copy_stream.step(tokenizer._tokenizer, tid)

    # They should match exactly
    if chars != copy_chars:
        print(f"i={i}, tid={tid}")
        print(f"Expected: {chars!r}")
        print(f"Got:      {copy_chars!r}")
        print()

Expected: DecodeStream(prefill_ids).step(tok, new_id) returns only the delta for new_id, same as building incrementally.
Actual: First .step() after prefill flushes the entire buffer (prefill + new token) as one string.

i=29, tid=147
Expected: '◐'
Got:      'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "◐'

ArthurZucker · 2026-01-20T21:10:33Z

Ah for you bug I think this is kind of a missunderstanding about the decode stream. The equivalence you are looking for is not really possible:

In [4]: tokenizer.decode(token_ids[:29])
Out[4]: 'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "��'

as you can see, the 30th token is required to have a valid sentence. But when you initialize a stream with token_ids[:29] its not considered a valid string, because it finishes with ��. As such, it does not return anything.

See:

stream =  DecodeStream(token_ids[:28])
stream.step(tokenizer._tokenizer, [token_ids[28]])

does not output anything either while:

In [20]: stream.step(tokenizer._tokenizer, [token_ids[29]])
Out[20]: 'def format_task(status):\n    icons = {\n        "pending": "○",\n        "running": "◐'

does.
This is because otherwise it would return a non valid sequence.
The issue is that the string you initialized it with is already non valid, you step and its still non valid. So the full "prefill" is still not "outputed"

michaeltheologitis · 2026-01-20T21:18:09Z

Oh, I see, this makes a lot of sense! Thank you so much for explaining 🙂

ArthurZucker · 2026-01-21T15:30:37Z

of course! 🤗

ArthurZucker added 4 commits August 27, 2025 15:11

update

8c079cb

update

c96f877

updates

f716f7a

up

ccfb497

ArthurZucker added 12 commits August 27, 2025 16:49

oikay

7a81012

use stream input

874561d

nice all test pass?

6098765

fmt

64b4a78

dev

7e7e369

rename

b1ec287

simplify a hell lot

75feb92

proper testing

2f3a575

fix inti

8929e81

fix test

d855e9d

nits

155582b

make clippy happy now

8c908dd

ArthurZucker requested a review from McPatate August 29, 2025 07:12

ArthurZucker added 3 commits August 29, 2025 09:14

fmt fml

2a5008f

remove the prints

9b64b73

fix gate

95d12b4

ArthurZucker merged commit abee958 into main Aug 29, 2025
30 checks passed

ArthurZucker deleted the new-stream branch August 29, 2025 08:06

McPatate reviewed Aug 29, 2025

View reviewed changes

ArthurZucker mentioned this pull request Nov 28, 2025

Incremental Detokenization #1666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New stream#1856

New stream#1856
ArthurZucker merged 19 commits intomainfrom
new-stream

ArthurZucker commented Aug 27, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Uh oh!

McPatate left a comment

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

michaeltheologitis commented Jan 19, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Jan 19, 2026

Uh oh!

ArthurZucker commented Jan 19, 2026

Uh oh!

michaeltheologitis commented Jan 19, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Jan 20, 2026

Uh oh!

michaeltheologitis commented Jan 20, 2026

Uh oh!

ArthurZucker commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ArthurZucker commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New api:

Non breaking:

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

michaeltheologitis commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jan 19, 2026

Uh oh!

ArthurZucker commented Jan 19, 2026

Uh oh!

michaeltheologitis commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jan 20, 2026

Uh oh!

michaeltheologitis commented Jan 20, 2026

Uh oh!

ArthurZucker commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArthurZucker commented Aug 27, 2025 •

edited

Loading

michaeltheologitis commented Jan 19, 2026 •

edited

Loading

michaeltheologitis commented Jan 19, 2026 •

edited

Loading