-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concerns about the implementation of end_of_generation_condition()
#7187
Comments
It is because sentence piece tokenize tends to add an extra token in front of the string. the "<extra_id_1>" is added in the front so it won't merge with the later string after tokenization. eventually we only take the difference between id2 and id1. This is a hack to deal with the sentencepiece behavior. |
This is a bug. The stop criteria are too aggressive as it only matches the first token of the string. If we set stop sting as ['abc'] it will stop at 'a' even the model wants to generate other string starts with 'a'. for end_string in end_strings:
ids_1 = tokenizer.text_to_ids(f'<extra_id_1>{end_string}')
ids_2 = tokenizer.text_to_ids('<extra_id_1>')
if len(ids_1) <= len(ids_2):
continue
if len(ids_1[len(ids_2) :]) == 1:
token_id = ids_1[len(ids_2) :][0]
end_tokens.add(token_id) |
Ah thanks, I understand now. I have a follow-up question: The stop conditions are obtained with conditions.append(
any([text.endswith(end_string) for end_string in end_strings] + [p.item() in end_tokens])
) where the tokens in Let's say that |
it is not guaranteed. |
1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187
1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]>
1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]>
* Bugfix and optimization in `end_of_generation_condition()` 1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]> * Add warning when model is not in eval mode during generation Systematically calling `mode.eval()` does not seem like a good idea, as it might have side effects leading to unexpected behavior. It would be better to raise an exception if one attempts to generate while in training mode, but this may break existing code => sticking to a warning for now. Signed-off-by: Olivier Delalleau <[email protected]> * Use "<extra_id_1>" as prefix string Signed-off-by: Olivier Delalleau <[email protected]> * Add TODO for potential failure mode of the string match mechanism Signed-off-by: Olivier Delalleau <[email protected]> --------- Signed-off-by: Olivier Delalleau <[email protected]>
The code of
end_of_generation_condition()
NeMo/nemo/collections/nlp/modules/common/text_generation_strategy.py
Line 138 in 13791d2
I have two concerns:
<extra_id_1>
hardcoded in this function -- I thought that the goal of providingend_strings
as argument would be to avoid having to hardcode special stringsend_strings = ["abc"]
, which is tokenized into [1, 2, 3] (tokens for "a", "b", "c"), while<extra_id_1>
is tokenized into [0]. It seems to me that this code will lead toids_1 = tokenizer.text_to_ids(f'<extra_id_1>abc') = [0, 1, 2, 3]
,ids_2 = [0]
, andtoken_id = ids_1[len(ids_2) :][0] = 1
. As a result, as soon asp
(the last token generated) is equal to 1 (i.e., as soon as an"a"
is generated), we consider that the generation has ended. Am I missing something?The text was updated successfully, but these errors were encountered: