Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* update * update * updates * up * oikay * use stream input * nice all test pass? * fmt * dev * rename * simplify a hell lot * proper testing * fix inti * fix test * nits * make clippy happy now * fmt fml * remove the prints * fix gate
|
Thanks @ArthurZucker! |
|
This is very helpful. Thank you a lot!!! 🚀 🚀 🚀 @ArthurZucker is there a way to expose a # Stream that has been fed token_ids = [121, 32, ...] which I also want to copy
stream = DecodeStream(...)
stream_copy = DecodeStream([])
for tid in token_ids:
stream_copy.step(tokenizer, tid)This is obviously very very very bad. I also tried two potential speedups, but both cause later
|
|
|
|
Okay fixing this in: #1930 |
|
@ArthurZucker Thank you for the fast reply and shipping this so fast!!!
Repro: DecodeStream prefill causes Environment:
from transformers import AutoTokenizer
from tokenizers.decoders import DecodeStream
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
# Text that reproduces the DecodeStream prefill/step flush issue
text = (
'def format_task(status):\n'
' icons = {\n'
' "pending": "○",\n'
' "running": "◐",\n'
)
# [822, 3402, 29918, 7662, 29898, 4882, 1125, 13, 1678, 27673, 353, 426, 13, 4706, 376, 29886,
# 2548, 1115, 376, 31236, 613, 13, 4706, 376, 21094, 1115, 376, 229, 154, 147, 613, 13]
token_ids = tokenizer.encode(text)
stream = DecodeStream([])
for i, tid in enumerate(token_ids):
chars = stream.step(tokenizer._tokenizer, tid)
# 'copy' by prefilling all previous tokens that 'stream' has seen (w/o the current one)
copy_stream = DecodeStream(token_ids[:i])
# feed the current token
copy_chars = copy_stream.step(tokenizer._tokenizer, tid)
# They should match exactly
if chars != copy_chars:
print(f"i={i}, tid={tid}")
print(f"Expected: {chars!r}")
print(f"Got: {copy_chars!r}")
print()Expected: |
|
Ah for you bug I think this is kind of a missunderstanding about the decode stream. The equivalence you are looking for is not really possible: as you can see, the 30th token is required to have a valid sentence. But when you initialize a stream with See: stream = DecodeStream(token_ids[:28])
stream.step(tokenizer._tokenizer, [token_ids[28]])does not output anything either while: In [20]: stream.step(tokenizer._tokenizer, [token_ids[29]])
Out[20]: 'def format_task(status):\n icons = {\n "pending": "○",\n "running": "◐'does. |
|
Oh, I see, this makes a lot of sense! Thank you so much for explaining 🙂 |
|
of course! 🤗 |
New api:
and:
Non breaking:
This could be somewhat expected, but if you initialize your stream, say with
[19567, 255, 19567]first, then you should be able to properly get'อั'if you step with109.We can't go against the fact that
[19567, 109]is a "valid" token, so in the context of token generation, we cannot go against this (the first token will always be emitted because it is a valid token). However initializing the stream should still be helpful