Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about the implementation of end_of_generation_condition() #7187

Closed
odelalleau opened this issue Aug 8, 2023 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@odelalleau
Copy link
Collaborator

The code of end_of_generation_condition()

seems weird to me => opening an issue to clarify whether it is working as intended (cc @gshennvm @yidong72)

I have two concerns:

  1. I find it weird to see <extra_id_1> hardcoded in this function -- I thought that the goal of providing end_strings as argument would be to avoid having to hardcode special strings
  2. I am confused by something that is best explained through an example. Let's say I want to use end_strings = ["abc"], which is tokenized into [1, 2, 3] (tokens for "a", "b", "c"), while <extra_id_1> is tokenized into [0]. It seems to me that this code will lead to ids_1 = tokenizer.text_to_ids(f'<extra_id_1>abc') = [0, 1, 2, 3], ids_2 = [0], and token_id = ids_1[len(ids_2) :][0] = 1. As a result, as soon as p (the last token generated) is equal to 1 (i.e., as soon as an "a" is generated), we consider that the generation has ended. Am I missing something?
@odelalleau odelalleau added the bug Something isn't working label Aug 8, 2023
@yidong72
Copy link
Collaborator

It is because sentence piece tokenize tends to add an extra token in front of the string. the "<extra_id_1>" is added in the front so it won't merge with the later string after tokenization. eventually we only take the difference between id2 and id1. This is a hack to deal with the sentencepiece behavior.

@yidong72
Copy link
Collaborator

I am confused by something that is best explained through an example. Let's say I want to use end_strings = ["abc"], which is tokenized into [1, 2, 3] (tokens for "a", "b", "c"), while <extra_id_1> is tokenized into [0]. It seems to me that this code will lead to ids_1 = tokenizer.text_to_ids(f'<extra_id_1>abc') = [0, 1, 2, 3], ids_2 = [0], and token_id = ids_1[len(ids_2) :][0] = 1. As a result, as soon as p (the last token generated) is equal to 1 (i.e., as soon as an "a" is generated), we consider that the generation has ended. Am I missing something?

This is a bug. The stop criteria are too aggressive as it only matches the first token of the string. If we set stop sting as ['abc'] it will stop at 'a' even the model wants to generate other string starts with 'a'.
This is the fix to the problem.

            for end_string in end_strings:
                ids_1 = tokenizer.text_to_ids(f'<extra_id_1>{end_string}')
                ids_2 = tokenizer.text_to_ids('<extra_id_1>')
                if len(ids_1) <= len(ids_2):
                    continue
                if  len(ids_1[len(ids_2) :]) == 1:
                    token_id = ids_1[len(ids_2) :][0]
                    end_tokens.add(token_id)

@odelalleau
Copy link
Collaborator Author

Ah thanks, I understand now. I have a follow-up question:

The stop conditions are obtained with

conditions.append(
    any([text.endswith(end_string) for end_string in end_strings] + [p.item() in end_tokens])
)

where the tokens in end_tokens are eod_id plus any other token added through the processus you showed above.

Let's say that p (the last generated token) is equal to one of those other tokens found in end_tokens, which triggers one of [p.item() in end_tokens]. Isn't it guaranteed that we would also have one of [text.endswith(end_string) for end_string in end_strings] being True? (for instance, if p is the token corresponding to <extra_id_1>, then the string should end with <extra_id_1>) If this is wrong, can you provide an example where this would not be the case?

@yidong72
Copy link
Collaborator

it is not guaranteed. [p.item() in end_tokens] is intended to check for single tokens. [text.endswith(end_string) for end_string in end_strings] is intended to check the full string (which has more tokens)

odelalleau added a commit to odelalleau/NeMo that referenced this issue Aug 18, 2023
1. Bugfix

The previous implementation did not verify that `end_string` was encoded
into a single token, which could trigger the end of generation earlier
than intended (see discussion in NVIDIA#7187)

2. Optimization

The previous implementation was scaling linearly with the batch size and
quadratically with the length of the generated sequence, which could
lead to a significant overhead in some situations.

The new implementation is much more efficient in "normal" situations
(where the end of generation is identified by a set of unique tokens),
and raises a warning when it needs to fallback to the inefficient string
matching case.

Note that it does not behave exactly the same as before, because we skip
the string comparison when the end strings all have unique tokens
associated to them. For instance, in the previous implementation, if the
model had generated the string
        "Some string.<|endoftext|>"
(where "<|endoftext|>" would really be generated as a string, and not as
a single token), then the previous implementation would have considered
it to be the end of generation (assuming `end_strings` has length > 1),
while the new one would not. The previous behavior was likely a bug
though, since we expect models to generate the special tokens associated
to end strings when they exist (for instance, the standard case
`end_strings=["<|endoftext|>"]` has always been handled by just
comparing the last token to `eod_id`).

Fixes NVIDIA#7187
odelalleau added a commit to odelalleau/NeMo that referenced this issue Aug 18, 2023
1. Bugfix

The previous implementation did not verify that `end_string` was encoded
into a single token, which could trigger the end of generation earlier
than intended (see discussion in NVIDIA#7187)

2. Optimization

The previous implementation was scaling linearly with the batch size and
quadratically with the length of the generated sequence, which could
lead to a significant overhead in some situations.

The new implementation is much more efficient in "normal" situations
(where the end of generation is identified by a set of unique tokens),
and raises a warning when it needs to fallback to the inefficient string
matching case.

Note that it does not behave exactly the same as before, because we skip
the string comparison when the end strings all have unique tokens
associated to them. For instance, in the previous implementation, if the
model had generated the string
        "Some string.<|endoftext|>"
(where "<|endoftext|>" would really be generated as a string, and not as
a single token), then the previous implementation would have considered
it to be the end of generation (assuming `end_strings` has length > 1),
while the new one would not. The previous behavior was likely a bug
though, since we expect models to generate the special tokens associated
to end strings when they exist (for instance, the standard case
`end_strings=["<|endoftext|>"]` has always been handled by just
comparing the last token to `eod_id`).

Fixes NVIDIA#7187

Signed-off-by: Olivier Delalleau <[email protected]>
odelalleau added a commit to odelalleau/NeMo that referenced this issue Aug 27, 2023
1. Bugfix

The previous implementation did not verify that `end_string` was encoded
into a single token, which could trigger the end of generation earlier
than intended (see discussion in NVIDIA#7187)

2. Optimization

The previous implementation was scaling linearly with the batch size and
quadratically with the length of the generated sequence, which could
lead to a significant overhead in some situations.

The new implementation is much more efficient in "normal" situations
(where the end of generation is identified by a set of unique tokens),
and raises a warning when it needs to fallback to the inefficient string
matching case.

Note that it does not behave exactly the same as before, because we skip
the string comparison when the end strings all have unique tokens
associated to them. For instance, in the previous implementation, if the
model had generated the string
        "Some string.<|endoftext|>"
(where "<|endoftext|>" would really be generated as a string, and not as
a single token), then the previous implementation would have considered
it to be the end of generation (assuming `end_strings` has length > 1),
while the new one would not. The previous behavior was likely a bug
though, since we expect models to generate the special tokens associated
to end strings when they exist (for instance, the standard case
`end_strings=["<|endoftext|>"]` has always been handled by just
comparing the last token to `eod_id`).

Fixes NVIDIA#7187

Signed-off-by: Olivier Delalleau <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this issue Jun 25, 2024
* Bugfix and optimization in `end_of_generation_condition()`

1. Bugfix

The previous implementation did not verify that `end_string` was encoded
into a single token, which could trigger the end of generation earlier
than intended (see discussion in NVIDIA#7187)

2. Optimization

The previous implementation was scaling linearly with the batch size and
quadratically with the length of the generated sequence, which could
lead to a significant overhead in some situations.

The new implementation is much more efficient in "normal" situations
(where the end of generation is identified by a set of unique tokens),
and raises a warning when it needs to fallback to the inefficient string
matching case.

Note that it does not behave exactly the same as before, because we skip
the string comparison when the end strings all have unique tokens
associated to them. For instance, in the previous implementation, if the
model had generated the string
        "Some string.<|endoftext|>"
(where "<|endoftext|>" would really be generated as a string, and not as
a single token), then the previous implementation would have considered
it to be the end of generation (assuming `end_strings` has length > 1),
while the new one would not. The previous behavior was likely a bug
though, since we expect models to generate the special tokens associated
to end strings when they exist (for instance, the standard case
`end_strings=["<|endoftext|>"]` has always been handled by just
comparing the last token to `eod_id`).

Fixes NVIDIA#7187

Signed-off-by: Olivier Delalleau <[email protected]>

* Add warning when model is not in eval mode during generation

Systematically calling `mode.eval()` does not seem like a good idea, as
it might have side effects leading to unexpected behavior. It would be
better to raise an exception if one attempts to generate while in
training mode, but this may break existing code => sticking to a warning
for now.

Signed-off-by: Olivier Delalleau <[email protected]>

* Use "<extra_id_1>" as prefix string

Signed-off-by: Olivier Delalleau <[email protected]>

* Add TODO for potential failure mode of the string match mechanism

Signed-off-by: Olivier Delalleau <[email protected]>

---------

Signed-off-by: Olivier Delalleau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants