Skip to content

Conversation

@Eta0
Copy link
Contributor

@Eta0 Eta0 commented Feb 16, 2023

Attention Masks

All training configurations except for training a model in mixed precision mode ("fp16") with DeepSpeed enabled were broken, because the tokenized dataset loader was outputting its accompanying attention masks as all zeroes, meaning "ignore all training data."
A bug fixed in version 4.21.0 of the transformers library had allowed models with sufficiently extreme parameters to ignore the attention mask, which was the only way this code was previously able to function. Updating that dependency revealed this bug.

DeepSpeed's mixed precision mode seemingly reintroduced the bug that had previously allowed this code to function, which let it continue to sort of work under select circumstances even with the updated transformers library.

This PR fixes the attention masks for training and inference such that they mask away only the padding token ID.

Special Tokens

Attention masking is intended to prevent accidentally training the model on padding tokens. To generate correct attention masks at training time, it is necessary to know the pad_token ID used in the tokenizer.
The current implementation of the dataset tokenizer chooses the first available of:

  1. An explicitly specified padding token string,
  2. The model's default padding token, from the Hugging Face model config, or
  3. <|endoftext|>

The default values for eos_token and pad_token in this code are now resolved in a matching way.

Additionally, the command line arguments --eot '' and --pad '' (the Argo Workflow's defaults) now refer to the aforementioned default-picking algorithm rather than being taken literally. Before, they were resolving to whatever special token the tokenizer interpreted '' to be.

Code Cleanup

Highlights:

  • Explicitly casting models to half-precision as model.half() was removed, since it isn't right for mixed-precision training.
  • no_init() was changed to a context manager that covers more steps of model instantiation.
  • Logging was changed to consistently use units of mebibytes ("MiB").
    • It used to be three different (but similar) units in different parts of the code all labeled as "mb" ($10^{6}$ bytes, $2^{10} \times 10^{3}$ bytes, and $2^{20}$ bytes).
  • Replaced usages of deprecated aliases in ds_config.json and in the logic to disable the WandB integration.

Eta0 added 4 commits February 15, 2023 18:50
Most training configurations were broken because the tokenized dataset
loader was outputting its accompanying attention masks as all zeroes,
meaning "ignore all training data."

A bug fixed in version 4.21.0 of the transformers library had allowed
models with sufficiently extreme parameters to ignore
the attention mask, which was the only way this code
was previously able to function.

DeepSpeed's mixed precision mode seemed to have reintroduced this bug,
which additionally allowed it to function under select circumstances
even with an updated transformers library.

This commit fixes the attention masks for training and inference
such that they mask away only the padding token ID.
### Tokens
This resolves default values for eos_token and pad_token in a way that
matches the dataset tokenizer, with the following priority:
Command line argument > model default (if one exists) > "<|endoftext|>".

Additionally, the command line arguments --eot '' and --pad ''
are now correctly interpreted as referring to the default
eos_token and pad_token, rather than being interpreted literally as '',
which would in most cases resolve to another special token internally.

The resolution of a correct pad_token (with an ID matching the padding
token used in the tokenized dataset) is necessary for attention masking
to work correctly during training.

### 16-bit model loading
Explicitly casting models to half-precision as model.half() is
not preferred for mixed-precision training, and throws an error
during training if attempted without DeepSpeed.
It doesn't appear to even save VRAM with DeepSpeed enabled,
so this cast is removed.

### Misc.
Additionally, no_init() is changed to a context manager that covers
more steps of model instantiation.
@Eta0 Eta0 added the python Pull requests that update Python code label Feb 16, 2023
@Eta0 Eta0 requested a review from wbrown February 16, 2023 02:37
Copy link
Contributor

@wbrown wbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nicely done. 👍

@wbrown wbrown merged commit a90c8b6 into wbrown.finetuning-sampling Feb 16, 2023
@wbrown wbrown deleted the eta.fix-finetuner branch February 16, 2023 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants