Major bug & fix: Fix bug in batched multi sample generation #1025

JulesGM · 2024-07-08T03:50:22Z

The following lines break in batched generation.
There is a single list that is

[b_0_s_0, b_0_s_1, b_0_s_2, b_0_s_3, b_1_s_0, b_1_s_1, ...]

with b_0_s_0 being the example generated for batch 0 and sample 0 of multi-sample batch generation.

At the end of the generation code, the following tries to separate the generated samples in a batch_size quantity of sub lists. The problematic code is as follows:

for i in range(batch_size):
            output.append(formatted[i : i + num_samples])

We indeed get a list of batch size, but not the expected ones. Instead it should be:

for i in range(0, batch_size * num_samples, num_samples):
                    output.append(next_tokens[i : i + num_samples])

As an example, using the prompts:

'1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 =

and

'2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 =

which gives a batch size of 2, with the \d( \d)+ regex, and with multinomial generation with a number of samples of 10, the current wrong output is (if you remove excluding the prompt from the output, which I did because the outputs looked fishy):

[[
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 9 9 9',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 6 6',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 1 2',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 5 5 5',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 0 0 0',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 4 5',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 3 3 3'
], [
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 9 9 9',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 6 6',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 1 2',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 5 5 5',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 0 0 0',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 4 5',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 3 3 3',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 8 0 8 8'
]]

We can see the problem, it returned next_tokens[0 : num_samples] and next_tokens[1 : num_samples + 1] which is not what we want.

The new code returns

[[
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 6 8 1',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 2 4',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 7 7',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 9 3 3',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 4 9',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 6 6 6',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 4 4 4',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 7 3 2',
    '1 1 1 + 3 3 3 =? Solution: 1 1 1 + 3 3 3 = 1 1 0'
], [
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 4 4 4 4',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 2 4 4 4',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 8 2 2 2',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 8 4 4 4',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 4 4 4 4',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 2 2 2 2',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 5 5 5 5',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 2 2 2 2',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 2 2 2 2',
    '2 2 2 2 + 4 4 4 4 =? Solution: 2 2 2 2 + 4 4 4 4 = 8 8 8 8'
]]

Which corresponds to the index ranges [0 : num_samples] and [num_samples : 2 * num_samples], which is what we want.

A more simple complete example is

#!/usr/bin/env python
# coding: utf-8

import outlines
import outlines.models.transformers
import outlines.samplers
from outlines.generate.generator import sequence_generator
import transformers
import more_itertools as mit
import rich
import rich.panel
from typing import *


MODEL          = "susnato/phi-2"
hf_model       = transformers .AutoModelForCausalLM .from_pretrained (MODEL).cuda()
hf_tokenizer   = transformers .AutoTokenizer        .from_pretrained (
    MODEL, padding_side="right")
outlines_model = outlines     .models               .Transformers    (hf_model, hf_tokenizer)

generator = outlines.generate.regex(outlines_model, "\d+", sampler=outlines.samplers.MultinomialSampler(10))
output = generator.stream(prompts=["What is 11 + 33 ? Solution: 11 + 33 = ", "What is 2222 + 4444 ? Solution: 2222 + 4444 = "], max_tokens=100)

for o in output:
    rich.print(o)

gives

[
    ['21', '11', '41', '14', '55', '10', '11', '09', '31', '10'],
    [ '11', '41', '14', '55', '10', '11', '09', '31', '10', '6666']
]

when not fixed (see how the second list is just the first list with one item at the start fewer, and one reasonable-looking generation at the end),

and, when fixed:

[
    ['21', '11', '41', '14', '55', '10', '11', '09', '31', '10'],
    ['6666', '66666666', '6666', '0000000000000000', '6666', '6666', '6666', '6666', '6666', '6666']
]

JulesGM · 2024-07-08T04:59:58Z

@lapp0 @rlouf

outlines/generate/api.py

Co-authored-by: Patrice Bechard <[email protected]>

JulesGM · 2024-07-10T01:05:52Z

folks this is serious

JulesGM · 2024-07-10T01:06:49Z

@brandonwillard

JulesGM · 2024-07-10T01:07:24Z

Merci @patricebechard are you able to give approval / merge?

lapp0 · 2024-07-10T02:40:44Z

Thanks so much for finding and fixing this bug!

Could you please add a test case which fails in main and passes with your fix?

Also as an alternative we might consider #966

patricebechard · 2024-07-10T03:07:11Z

Merci @patricebechard are you able to give approval / merge?

Nop I am not a maintainer, just trying to help :)

JulesGM · 2024-07-10T08:08:20Z

It's, when you use a sampler with multiple sequences returned per batch sample (so, like generating 10 possible outputs for each input with multinomial generation), the batch and sequences are merged into a single super batch of size n_samples * batch_size, on which the constraints & generation are applied (correctly) as is it was a regular batch. Everything is fine until you try to re split these back into a list of batch size with n_samples generations for each input, giving a dimension of (batch_size, num_samples), this is done incorrectly. In practice you almost only get samples from the first batch generations in the second batch units, it's completely broken.

…

On Tue, Jul 9, 2024, 7:41 PM Andrew Lapp ***@***.***> wrote: Thanks so much for finding and fixing this bug! Am I understanding correctly that the pattern is enforced within each sequence, however the next token from sequence A ends up in sequence B? Could you please add a test case which fails in main and passes with your fix? — Reply to this email directly, view it on GitHub <#1025 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYU34P72TYYCA3IIWOXSL3ZLSNMFAVCNFSM6AAAAABKP7VPICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJZGM4TSNZRGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

JulesGM · 2024-07-10T15:08:27Z

Making a test that fails is really hard without an option to return the inputs as part of the outputs. Having that option would make it easier though.

lapp0

I was able to reproduce your results on main and verify your fix resolves the issue!

JulesGM · 2024-07-11T16:43:24Z

It's not letting me merge

rlouf · 2024-07-11T19:02:30Z

Thank you for contributing a fix! In the future, please do not tag maintainers and other users in the PR.

JulesGM · 2024-07-12T06:54:44Z

Sorry, I thought you would have liked to know that was broken, wanted to be sure that you would see it. It was a very serious bug.

…

On Thu, Jul 11, 2024, 12:03 PM Rémi Louf ***@***.***> wrote: Merged #1025 <#1025> into main. — Reply to this email directly, view it on GitHub <#1025 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYU34P6E6JMZ7UZ3FOEIE3ZL3JG5AVCNFSM6AAAAABKP7VPICVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGQ3TMMRZHE4TGNA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

JulesGM added 2 commits July 7, 2024 20:37

Fix bug in batched multi sample generation

0800552

Fixed __call__ as well

937f0f4

JulesGM changed the title ~~Major bug fix: Fix bug in batched multi sample generation~~ Major bug & fix: Fix bug in batched multi sample generation Jul 8, 2024

patricebechard reviewed Jul 8, 2024

View reviewed changes

outlines/generate/api.py Outdated Show resolved Hide resolved

Update outlines/generate/api.py

c4f37c9

Co-authored-by: Patrice Bechard <[email protected]>

lapp0 approved these changes Jul 11, 2024

View reviewed changes

Merge branch 'main' into patch-1

ae8327f

rlouf merged commit b54a964 into dottxt-ai:main Jul 11, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major bug & fix: Fix bug in batched multi sample generation #1025

Major bug & fix: Fix bug in batched multi sample generation #1025

JulesGM commented Jul 8, 2024 •

edited

Loading

JulesGM commented Jul 8, 2024

JulesGM commented Jul 10, 2024

JulesGM commented Jul 10, 2024

JulesGM commented Jul 10, 2024

lapp0 commented Jul 10, 2024 •

edited

Loading

patricebechard commented Jul 10, 2024

JulesGM commented Jul 10, 2024 via email

JulesGM commented Jul 10, 2024

lapp0 left a comment

JulesGM commented Jul 11, 2024

rlouf commented Jul 11, 2024

JulesGM commented Jul 12, 2024 via email

Major bug & fix: Fix bug in batched multi sample generation #1025

Major bug & fix: Fix bug in batched multi sample generation #1025

Conversation

JulesGM commented Jul 8, 2024 • edited Loading

JulesGM commented Jul 8, 2024

JulesGM commented Jul 10, 2024

JulesGM commented Jul 10, 2024

JulesGM commented Jul 10, 2024

lapp0 commented Jul 10, 2024 • edited Loading

patricebechard commented Jul 10, 2024

JulesGM commented Jul 10, 2024 via email

JulesGM commented Jul 10, 2024

lapp0 left a comment

Choose a reason for hiding this comment

JulesGM commented Jul 11, 2024

rlouf commented Jul 11, 2024

JulesGM commented Jul 12, 2024 via email

JulesGM commented Jul 8, 2024 •

edited

Loading

lapp0 commented Jul 10, 2024 •

edited

Loading