Fix finetuner attention masking. #146

Eta0 · 2023-02-16T02:37:19Z

Attention Masks

All training configurations except for training a model in mixed precision mode ("fp16") with DeepSpeed enabled were broken, because the tokenized dataset loader was outputting its accompanying attention masks as all zeroes, meaning "ignore all training data."
A bug fixed in version 4.21.0 of the transformers library had allowed models with sufficiently extreme parameters to ignore the attention mask, which was the only way this code was previously able to function. Updating that dependency revealed this bug.

DeepSpeed's mixed precision mode seemingly reintroduced the bug that had previously allowed this code to function, which let it continue to sort of work under select circumstances even with the updated transformers library.

This PR fixes the attention masks for training and inference such that they mask away only the padding token ID.

Special Tokens

Attention masking is intended to prevent accidentally training the model on padding tokens. To generate correct attention masks at training time, it is necessary to know the pad_token ID used in the tokenizer.
The current implementation of the dataset tokenizer chooses the first available of:

An explicitly specified padding token string,
The model's default padding token, from the Hugging Face model config, or
<|endoftext|>

The default values for eos_token and pad_token in this code are now resolved in a matching way.

Additionally, the command line arguments --eot '' and --pad '' (the Argo Workflow's defaults) now refer to the aforementioned default-picking algorithm rather than being taken literally. Before, they were resolving to whatever special token the tokenizer interpreted '' to be.

Code Cleanup

Highlights:

Explicitly casting models to half-precision as model.half() was removed, since it isn't right for mixed-precision training.
no_init() was changed to a context manager that covers more steps of model instantiation.
Logging was changed to consistently use units of mebibytes ("MiB").
- It used to be three different (but similar) units in different parts of the code all labeled as "mb" ($10^{6}$ bytes, $2^{10} \times 10^{3}$ bytes, and $2^{20}$ bytes).
Replaced usages of deprecated aliases in ds_config.json and in the logic to disable the WandB integration.

Most training configurations were broken because the tokenized dataset loader was outputting its accompanying attention masks as all zeroes, meaning "ignore all training data." A bug fixed in version 4.21.0 of the transformers library had allowed models with sufficiently extreme parameters to ignore the attention mask, which was the only way this code was previously able to function. DeepSpeed's mixed precision mode seemed to have reintroduced this bug, which additionally allowed it to function under select circumstances even with an updated transformers library. This commit fixes the attention masks for training and inference such that they mask away only the padding token ID.

### Tokens This resolves default values for eos_token and pad_token in a way that matches the dataset tokenizer, with the following priority: Command line argument > model default (if one exists) > "<|endoftext|>". Additionally, the command line arguments --eot '' and --pad '' are now correctly interpreted as referring to the default eos_token and pad_token, rather than being interpreted literally as '', which would in most cases resolve to another special token internally. The resolution of a correct pad_token (with an ID matching the padding token used in the tokenized dataset) is necessary for attention masking to work correctly during training. ### 16-bit model loading Explicitly casting models to half-precision as model.half() is not preferred for mixed-precision training, and throws an error during training if attempted without DeepSpeed. It doesn't appear to even save VRAM with DeepSpeed enabled, so this cast is removed. ### Misc. Additionally, no_init() is changed to a context manager that covers more steps of model instantiation.

wbrown

Very nicely done. 👍

Eta0 added 4 commits February 15, 2023 18:50

Code cleanup and switch to MiB units in logs.

8db478c

Fix disabling WandB.

1f505e2

Eta0 added the python Pull requests that update Python code label Feb 16, 2023

Eta0 requested a review from wbrown February 16, 2023 02:37

wbrown approved these changes Feb 16, 2023

View reviewed changes

wbrown merged commit a90c8b6 into wbrown.finetuning-sampling Feb 16, 2023

wbrown deleted the eta.fix-finetuner branch February 16, 2023 16:49

Eta0 mentioned this pull request Feb 24, 2023

Debug fp16 issues in finetuner.py #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix finetuner attention masking. #146

Fix finetuner attention masking. #146

Uh oh!

Eta0 commented Feb 16, 2023

Uh oh!

wbrown left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix finetuner attention masking. #146

Fix finetuner attention masking. #146

Uh oh!

Conversation

Eta0 commented Feb 16, 2023

Attention Masks

Special Tokens

Code Cleanup

Uh oh!

wbrown left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants