Fix finetuner attention masking. #146
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attention Masks
All training configurations except for training a model in mixed precision mode ("fp16") with DeepSpeed enabled were broken, because the tokenized dataset loader was outputting its accompanying attention masks as all zeroes, meaning "ignore all training data."
A bug fixed in version 4.21.0 of the transformers library had allowed models with sufficiently extreme parameters to ignore the attention mask, which was the only way this code was previously able to function. Updating that dependency revealed this bug.
DeepSpeed's mixed precision mode seemingly reintroduced the bug that had previously allowed this code to function, which let it continue to sort of work under select circumstances even with the updated transformers library.
This PR fixes the attention masks for training and inference such that they mask away only the padding token ID.
Special Tokens
Attention masking is intended to prevent accidentally training the model on padding tokens. To generate correct attention masks at training time, it is necessary to know the
pad_tokenID used in the tokenizer.The current implementation of the dataset tokenizer chooses the first available of:
<|endoftext|>The default values for
eos_tokenandpad_tokenin this code are now resolved in a matching way.Additionally, the command line arguments
--eot ''and--pad ''(the Argo Workflow's defaults) now refer to the aforementioned default-picking algorithm rather than being taken literally. Before, they were resolving to whatever special token the tokenizer interpreted''to be.Code Cleanup
Highlights:
model.half()was removed, since it isn't right for mixed-precision training.no_init()was changed to a context manager that covers more steps of model instantiation.ds_config.jsonand in the logic to disable the WandB integration.