Generate: basic token streaming by gante · Pull Request #22449 · huggingface/transformers

gante · 2023-03-29T15:54:18Z

What does this PR do?

Adds token streaming to .generate() 🎉

Why now?

I want to showcase and communicate how much faster assisted generation can be... and for that, I need token streaming :D Non-image/video results have a much lower impact.

What's being added

This PR adds a streamer input to generate. If it is non-None, generate will call streamer.put(new_tokens) as they are being generated. streamer can, therefore, be a wide array of things. This PR adds the simplest case: print tokens as they are generated.

At first, I thought of adding a simpler stream=True option. However, the tokenizer would have to be passed into .generate(), which we have been avoiding, and it wouldn't be nearly as flexible. I've made the call to make streaming+.generate() flexible, and to keep it simple at a pipeline level.

If this PR gets accepted

The plan is to:

Communicate this feature on Twitter (w/Colab examples)
Add to pipelines, maybe with a simpler stream=True flag to start
Add Gradio examples (and, if needed, a specific streamer class)
Add the beam search case to the streamer classes (beam search is much trickier -- we should only print tokens when all candidate beams agree, which means logic needs to be added)

How does it look

Here's an example. Note that it is running on CPU, so we can actually see the streaming effect (3090 is too fast 😅 ). On GPU it also streams, but much faster 🔥

Screen.Recording.2023-03-29.at.16.39.55.mov

sgugger

I am not a fan of this API at all. Why is there a need for the TextStreamer to spawn a new process? The put method could directly call the print statement.

HuggingFaceDocBuilderDev · 2023-03-29T17:32:30Z

The documentation is not available anymore as the PR was closed or merged.

gante · 2023-03-29T18:05:38Z

@sgugger revised with the simpler implementation (no context manager nor multiprocessing) 🤗

sgugger

Better this way, thanks!

… Doctests

oobabooga · 2023-03-31T05:19:48Z

Just a FYI: I have been doing this using transformers.StoppingCriteria to create a callback:

class Stream(transformers.StoppingCriteria):
    def __init__(self, callback_func=None):
        self.callback_func = callback_func

    def __call__(self, input_ids, scores) -> bool:
        if self.callback_func is not None:
            self.callback_func(input_ids[0])
        return False

The callback is then used to create an iterator with the Iteratorize class here: https://github.com/oobabooga/text-generation-webui/blob/main/modules/callbacks.py#L42

Usage becomes:

def generate_with_callback(callback=None, **kwargs):
    kwargs['stopping_criteria'].append(Stream(callback_func=callback))
    with torch.no_grad():
        shared.model.generate(**kwargs)

def generate_with_streaming(**kwargs):
    return Iteratorize(generate_with_callback, kwargs, callback=None)

with generate_with_streaming(**generate_params) as generator:
    for output in generator:

gante · 2023-03-31T08:42:00Z

@oobabooga 🧠 That's a smart (and unexpected!) use of the stopping criteria.

I'm going to work on a standardized Gradio solution today, and a Queue+iterator was indeed my plan. If you don't mind, I will take inspiration in your code 💛

A question regarding your implementation -- you use a separate thread in the Iteratorize, not a separate process. Any reason for in picking a thread over a process? (Without running the code, I'd argue in favor of a separate thread for GIL purposes)

oobabooga · 2023-03-31T13:42:00Z

If you don't mind, I will take inspiration in your code

Feel free to copy anything you want.

Any reason for in picking a thread over a process?

Honestly, I have no specific reason to give. I just spent several days trying to get the text generation to run in the background independently of where the for loop was at in the queue, and this is what ended up working. With this, I get close to as many tokens/s with streaming as without.

* haha tokens go brrrr

haha tokens go brrrr

36ed5d7

gante requested a review from sgugger March 29, 2023 15:55

gante added 3 commits March 29, 2023 16:01

docstring

bb48f34

add test

be24b6c

make fixup

2b5f574

sgugger reviewed Mar 29, 2023

View reviewed changes

Comment thread src/transformers/generation/utils.py

Comment thread src/transformers/generation/streamers.py Outdated

Comment thread src/transformers/__init__.py Outdated

PR comments -- simpler implementation, proper import structure

f401488

sgugger approved these changes Mar 29, 2023

View reviewed changes

gante added 3 commits March 29, 2023 18:50

final nits

9f7c907

Add documentation; More robust word-by-word printing in TextStreamer;…

8d481e8

… Doctests

More references to streaming in docs

1e18199

gante merged commit 228792a into huggingface:main Mar 30, 2023

gante deleted the stream_generate branch March 30, 2023 11:00

fragro mentioned this pull request Mar 30, 2023

Stream tokens output tloen/alpaca-lora#51

Open

Titaniumtown mentioned this pull request Mar 30, 2023

Proper text-streaming support via transformers library oobabooga/textgen#669

Closed

gante mentioned this pull request Mar 31, 2023

Generate: TextIteratorStreamer (streamer for gradio) #22501

Merged

vblagoje mentioned this pull request Apr 2, 2023

Generate: Enable easier TextStreamer customization #22516

Merged

5 tasks

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

Generate: basic token streaming (huggingface#22449)

8e0179a

* haha tokens go brrrr

ambiSk mentioned this pull request May 23, 2023

Use python generator instead of streamer for generation #23640

Open

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Generate: basic token streaming (huggingface#22449)

45ccbff

* haha tokens go brrrr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: basic token streaming#22449

Generate: basic token streaming#22449
gante merged 8 commits into
huggingface:mainfrom
gante:stream_generate

gante commented Mar 29, 2023 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 29, 2023 •

edited

Loading

Uh oh!

gante commented Mar 29, 2023

Uh oh!

sgugger left a comment

Uh oh!

oobabooga commented Mar 31, 2023

Uh oh!

gante commented Mar 31, 2023

Uh oh!

oobabooga commented Mar 31, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gante commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why now?

What's being added

If this PR gets accepted

How does it look

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Mar 29, 2023

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

oobabooga commented Mar 31, 2023

Uh oh!

gante commented Mar 31, 2023

Uh oh!

oobabooga commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gante commented Mar 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 29, 2023 •

edited

Loading

oobabooga commented Mar 31, 2023 •

edited

Loading