Concerns about the implementation of `end_of_generation_condition()` #7187

odelalleau · 2023-08-08T20:54:23Z

The code of end_of_generation_condition()

NeMo/nemo/collections/nlp/modules/common/text_generation_strategy.py

Line 138 in 13791d2

def end_of_generation_condition(

seems weird to me => opening an issue to clarify whether it is working as intended (cc @gshennvm @yidong72)

I have two concerns:

I find it weird to see <extra_id_1> hardcoded in this function -- I thought that the goal of providing end_strings as argument would be to avoid having to hardcode special strings
I am confused by something that is best explained through an example. Let's say I want to use end_strings = ["abc"], which is tokenized into [1, 2, 3] (tokens for "a", "b", "c"), while <extra_id_1> is tokenized into [0]. It seems to me that this code will lead to ids_1 = tokenizer.text_to_ids(f'<extra_id_1>abc') = [0, 1, 2, 3], ids_2 = [0], and token_id = ids_1[len(ids_2) :][0] = 1. As a result, as soon as p (the last token generated) is equal to 1 (i.e., as soon as an "a" is generated), we consider that the generation has ended. Am I missing something?

The text was updated successfully, but these errors were encountered:

yidong72 · 2023-08-16T19:31:48Z

It is because sentence piece tokenize tends to add an extra token in front of the string. the "<extra_id_1>" is added in the front so it won't merge with the later string after tokenization. eventually we only take the difference between id2 and id1. This is a hack to deal with the sentencepiece behavior.

yidong72 · 2023-08-16T19:43:53Z

I am confused by something that is best explained through an example. Let's say I want to use end_strings = ["abc"], which is tokenized into [1, 2, 3] (tokens for "a", "b", "c"), while <extra_id_1> is tokenized into [0]. It seems to me that this code will lead to ids_1 = tokenizer.text_to_ids(f'<extra_id_1>abc') = [0, 1, 2, 3], ids_2 = [0], and token_id = ids_1[len(ids_2) :][0] = 1. As a result, as soon as p (the last token generated) is equal to 1 (i.e., as soon as an "a" is generated), we consider that the generation has ended. Am I missing something?

This is a bug. The stop criteria are too aggressive as it only matches the first token of the string. If we set stop sting as ['abc'] it will stop at 'a' even the model wants to generate other string starts with 'a'.
This is the fix to the problem.

            for end_string in end_strings:
                ids_1 = tokenizer.text_to_ids(f'<extra_id_1>{end_string}')
                ids_2 = tokenizer.text_to_ids('<extra_id_1>')
                if len(ids_1) <= len(ids_2):
                    continue
                if  len(ids_1[len(ids_2) :]) == 1:
                    token_id = ids_1[len(ids_2) :][0]
                    end_tokens.add(token_id)

odelalleau · 2023-08-16T21:39:00Z

Ah thanks, I understand now. I have a follow-up question:

The stop conditions are obtained with

conditions.append(
    any([text.endswith(end_string) for end_string in end_strings] + [p.item() in end_tokens])
)

where the tokens in end_tokens are eod_id plus any other token added through the processus you showed above.

Let's say that p (the last generated token) is equal to one of those other tokens found in end_tokens, which triggers one of [p.item() in end_tokens]. Isn't it guaranteed that we would also have one of [text.endswith(end_string) for end_string in end_strings] being True? (for instance, if p is the token corresponding to <extra_id_1>, then the string should end with <extra_id_1>) If this is wrong, can you provide an example where this would not be the case?

yidong72 · 2023-08-16T22:20:18Z

it is not guaranteed. [p.item() in end_tokens] is intended to check for single tokens. [text.endswith(end_string) for end_string in end_strings] is intended to check the full string (which has more tokens)

1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187

1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]>

* Bugfix and optimization in `end_of_generation_condition()` 1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]> * Add warning when model is not in eval mode during generation Systematically calling `mode.eval()` does not seem like a good idea, as it might have side effects leading to unexpected behavior. It would be better to raise an exception if one attempts to generate while in training mode, but this may break existing code => sticking to a warning for now. Signed-off-by: Olivier Delalleau <[email protected]> * Use "<extra_id_1>" as prefix string Signed-off-by: Olivier Delalleau <[email protected]> * Add TODO for potential failure mode of the string match mechanism Signed-off-by: Olivier Delalleau <[email protected]> --------- Signed-off-by: Olivier Delalleau <[email protected]>

odelalleau added the bug Something isn't working label Aug 8, 2023

odelalleau mentioned this issue Aug 18, 2023

Bugfix and optimization in end_of_generation_condition() #7267

Merged

8 tasks

yidong72 closed this as completed in 6861215 Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns about the implementation of `end_of_generation_condition()` #7187

Concerns about the implementation of `end_of_generation_condition()` #7187

odelalleau commented Aug 8, 2023

yidong72 commented Aug 16, 2023

yidong72 commented Aug 16, 2023

odelalleau commented Aug 16, 2023

yidong72 commented Aug 16, 2023

Concerns about the implementation of end_of_generation_condition() #7187

Concerns about the implementation of end_of_generation_condition() #7187

Comments

odelalleau commented Aug 8, 2023

yidong72 commented Aug 16, 2023

yidong72 commented Aug 16, 2023

odelalleau commented Aug 16, 2023

yidong72 commented Aug 16, 2023

Concerns about the implementation of `end_of_generation_condition()` #7187

Concerns about the implementation of `end_of_generation_condition()` #7187