Bugfix and optimization in `end_of_generation_condition()` #7267

odelalleau · 2023-08-18T15:18:05Z

What does this PR do ?

It fixes a bug in end_of_generation_condition() (#7187) and makes it significantly faster in some cases.

Collection: nlp

Changelog

Fixed edge case that could make text generation stop earlier than intended
Speed up text generation when using custom end strings

Detailed explanation

Bugfix

The previous implementation did not verify that end_string was encoded
into a single token, which could trigger the end of generation earlier
than intended (see discussion in #7187)

Optimization

The previous implementation was scaling linearly with the batch size and
quadratically with the length of the generated sequence, which could
lead to a significant overhead in some situations.

The new implementation is much more efficient in "normal" situations
(where the end of generation is identified by a set of unique tokens),
and raises a warning when it needs to fallback to the inefficient string
matching case.

Note that it does not behave exactly the same as before, because we skip
the string comparison when the end strings all have unique tokens
associated to them. For instance, in the previous implementation, if the
model had generated the string
"Some string.<|endoftext|>"
(where "<|endoftext|>" would really be generated as a string, and not as
a single token), then the previous implementation would have considered
it to be the end of generation (assuming end_strings has length > 1),
while the new one would not. The previous behavior was likely a bug
though, since we expect models to generate the special tokens associated
to end strings when they exist (for instance, the standard case
end_strings=["<|endoftext|>"] has always been handled by just
comparing the last token to eod_id).

Minor change

See commit message of ab56968 for the explanation regarding the warning that was added in __init__()

Tests

Hopefully there are existing tests on CI -- I have tested this myself on my own jobs.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Fixes ##7187

nemo/collections/nlp/modules/common/text_generation_strategy.py

yidong72 · 2023-08-25T20:37:44Z

nemo/collections/nlp/modules/common/text_generation_strategy.py

+            # tokenizers (e.g., SentencePiece) may prefix the special token with another token associated
+            # to an empty string. The code below is thus meant to extract the special token associated to
+            # `end_string` (if it exists). Note that using "This is a sequence." as reference is arbitrary.
+            ids_ref = tokenizer.text_to_ids("This is a sequence.")


"This is a sequence." is not a good prefix string since the . character might be merged with other characters in the end_string. Ideally we should use some special token so it will always stand alone.

was exactly commenting this, but the code checks ids_with_end_string[:-1] == ids_ref below and also . is usually an independent token in most tokenizers

ideally a seaprate special token might be better, but do we have any token that we can guarantee will always be present?

odds wise, I feel having . as the special token is more likely than any other thing. worst case we do a string check but it still doesnt give incorrect answer

I started to write an explanation for my reasoning here, but this brought up a doubt in my mind that is very important to clear up, as it may invalidate a strong assumption made in this PR:

Is it possible that we may want to use an end string associated to a unique token (e.g., <extra_id_1>) and yet expect the model to end a response with the string "<extra_id_1>" but without generating this token?
I assumed no (similar to how we don't stop generation if the model generates the string "<|endoftext|>" instead of eos_id), but can you confirm this is correct?

"This is a sequence." is not a good prefix string since the . character might be merged with other characters in the end_string. Ideally we should use some special token so it will always stand alone.

So, following offline discussion, there are three cases:

end_string is actually a special token for the tokenizer (e.g. <extra_id_1> typically is, when we use it). Then we are guaranteed that it will be tokenized as a single token, and we will identify it properly.

end_string is not a special token for the tokenizer, and is tokenized into more than one token: in that case we don't care whether or not it is merged with the preceding ., because we will need to use string comparisons anyway.

end_string is not a special token for the tokenizer, and is tokenized into a single token: in that case either we identify the single token and rely on token comparison, or we don't (because of the tokenizer merging the .) and we fall back to string matching. I would argue that the latter is safer, because (a) it will ensure we always end generation correctly, and (b) it shows a warning that may alert the user that something may be off (there's a good chance they didn't expect the tokenizer to merge end_string with other characters). And if the tokenizer merges the . with end_string it seems particularly important to be aware of it, since this sequence of characters is likely to be quite common to finish responses. So it seems to me that it's actually a better option than <extra_id_1> (though I admit I hadn't thought it through before ;))

the assmption is either end_string is a special token or not, the generation will end with it inclusively. It is up to tokenizer to decide whether the ids_to_text method wants to show it. I think <extra_id_1> is a better prefix because if it is a special token, it will work as expected. If it is not, the > is less likely to be merged with the end_string, which is a cleaner case as you mentioned in the 3. If '.' is merged with the other partial end-string during the generation, the string match might not capture it correctly. E.g. end_string is "hello", '.' is merged to 'hel', the generated tokens are '.hel', 'lo,' it won't trigger the string match check as we use 'endswith` method.

Alright, let's go with <extra_id_1>, I don't think it's a big deal anyway, it shouldn't matter much in practice. This is done in c9a6d71 (I also rebased on top of main, but there are no changes in previous commits).

I still want to address the point below since a similar situation may still happen with <extra_id_1>:

If '.' is merged with the other partial end-string during the generation, the string match might not capture it correctly. E.g. end_string is "hello", '.' is merged to 'hel', the generated tokens are '.hel', 'lo,' it won't trigger the string match check as we use 'endswith` method.

If '.' is merged with 'hel', then this will trigger the string match because the second condition of this check won't be satisfied (we check both that there are N+1 tokens and also that the first N tokens are the same):

if len(ids_with_end_string) == len(ids_ref) + 1 and ids_with_end_string[:-1] == ids_ref:

As a result, there will be a warning displayed, and the comparison will be made with text.endswith("hello") which will match any generation ending with "hello", regardless of what tokens this corresponds to.
Note that there could still be situations where the model generates "hello" without generation ending, e.g. if it generates tokens ".hel" followed by "lo world". But this case was not handled previously either, and it is unclear that we should stop there (since we can't truncate the model output mid-token, so the response would actually end with "hello world" rather than "hello".

yeah. I think it is the fundamental problem within the string match method. Maybe the work around is not to use endswith method and do post truncation of the extra characters after the generation stops. I think the string match method is already a hack anyway, maybe add a TODO comment and we might come back to this in the future if we see this causes any problems.

maybe add a TODO comment and we might come back to this in the future if we see this causes any problems.

Good idea, added in 0753075

aklife97

LGTM! Everything looks good to me, there are some pending comments by Yi but once we feel those are resolved I think we should be good to merge

1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]>

Systematically calling `mode.eval()` does not seem like a good idea, as it might have side effects leading to unexpected behavior. It would be better to raise an exception if one attempts to generate while in training mode, but this may break existing code => sticking to a warning for now. Signed-off-by: Olivier Delalleau <[email protected]>

Signed-off-by: Olivier Delalleau <[email protected]>

yidong72

LGTM. thanks for refining the end of generation logics.

Signed-off-by: Olivier Delalleau <[email protected]>

aklife97

LGTM, thank you!

* Bugfix and optimization in `end_of_generation_condition()` 1. Bugfix The previous implementation did not verify that `end_string` was encoded into a single token, which could trigger the end of generation earlier than intended (see discussion in NVIDIA#7187) 2. Optimization The previous implementation was scaling linearly with the batch size and quadratically with the length of the generated sequence, which could lead to a significant overhead in some situations. The new implementation is much more efficient in "normal" situations (where the end of generation is identified by a set of unique tokens), and raises a warning when it needs to fallback to the inefficient string matching case. Note that it does not behave exactly the same as before, because we skip the string comparison when the end strings all have unique tokens associated to them. For instance, in the previous implementation, if the model had generated the string "Some string.<|endoftext|>" (where "<|endoftext|>" would really be generated as a string, and not as a single token), then the previous implementation would have considered it to be the end of generation (assuming `end_strings` has length > 1), while the new one would not. The previous behavior was likely a bug though, since we expect models to generate the special tokens associated to end strings when they exist (for instance, the standard case `end_strings=["<|endoftext|>"]` has always been handled by just comparing the last token to `eod_id`). Fixes NVIDIA#7187 Signed-off-by: Olivier Delalleau <[email protected]> * Add warning when model is not in eval mode during generation Systematically calling `mode.eval()` does not seem like a good idea, as it might have side effects leading to unexpected behavior. It would be better to raise an exception if one attempts to generate while in training mode, but this may break existing code => sticking to a warning for now. Signed-off-by: Olivier Delalleau <[email protected]> * Use "<extra_id_1>" as prefix string Signed-off-by: Olivier Delalleau <[email protected]> * Add TODO for potential failure mode of the string match mechanism Signed-off-by: Olivier Delalleau <[email protected]> --------- Signed-off-by: Olivier Delalleau <[email protected]>

github-actions bot added the NLP label Aug 18, 2023

odelalleau force-pushed the od/endofgen-fixopt branch from ab56968 to 3cd7a89 Compare August 18, 2023 15:19

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/modules/common/text_generation_strategy.py Show resolved Hide resolved

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/modules/common/text_generation_strategy.py Show resolved Hide resolved

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/modules/common/text_generation_strategy.py Show resolved Hide resolved

yidong72 reviewed Aug 25, 2023

View reviewed changes

aklife97 reviewed Aug 25, 2023

View reviewed changes

odelalleau added 3 commits August 27, 2023 08:21

Use "<extra_id_1>" as prefix string

c9a6d71

Signed-off-by: Olivier Delalleau <[email protected]>

odelalleau force-pushed the od/endofgen-fixopt branch from 3cd7a89 to c9a6d71 Compare August 27, 2023 15:26

odelalleau requested review from aklife97 and yidong72 August 28, 2023 12:03

yidong72 previously approved these changes Aug 28, 2023

View reviewed changes

odelalleau dismissed yidong72’s stale review via 9c3523f August 28, 2023 13:04

Add TODO for potential failure mode of the string match mechanism

0753075

Signed-off-by: Olivier Delalleau <[email protected]>

odelalleau force-pushed the od/endofgen-fixopt branch from 9c3523f to 0753075 Compare August 28, 2023 13:05

yidong72 approved these changes Aug 28, 2023

View reviewed changes

aklife97 approved these changes Aug 28, 2023

View reviewed changes

yidong72 merged commit 6861215 into NVIDIA:main Aug 28, 2023
10 of 11 checks passed

odelalleau deleted the od/endofgen-fixopt branch August 28, 2023 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix and optimization in `end_of_generation_condition()` #7267

Bugfix and optimization in `end_of_generation_condition()` #7267

odelalleau commented Aug 18, 2023

yidong72 Aug 25, 2023

aklife97 Aug 25, 2023

aklife97 Aug 25, 2023

odelalleau Aug 25, 2023 •

edited

Loading

odelalleau Aug 25, 2023

yidong72 Aug 26, 2023

odelalleau Aug 28, 2023

yidong72 Aug 28, 2023

odelalleau Aug 28, 2023

aklife97 left a comment

yidong72 left a comment

aklife97 left a comment

Bugfix and optimization in end_of_generation_condition() #7267

Bugfix and optimization in end_of_generation_condition() #7267

Conversation

odelalleau commented Aug 18, 2023

What does this PR do ?

Changelog

Detailed explanation

Tests

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

yidong72 left a comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

Bugfix and optimization in `end_of_generation_condition()` #7267

Bugfix and optimization in `end_of_generation_condition()` #7267

odelalleau Aug 25, 2023 •

edited

Loading