-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add validation split, wandb logging, multigpu compatibility #49
feat: add validation split, wandb logging, multigpu compatibility #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@othertea -- Looks great overall! Our main dataset for training will have a fixed train-val-test split, so I would slightly modify train_val_test_split by adding an if statement at the beginning so that:
- If "train" "val" and "test" are already keys of the input dataset we keep those
- If there is only "train" then we keep the logic in the current version of train_val_test_split
- Otherwise we throw an exception (similar to what you have here)
Thanks for the review @pascalnotin ! I've updated the PR with the request changes, let me know if it looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @othertea ! All looks good, except perhaps this line: https://github.com/OpenBioML/protein-lm-scaling/pull/49/files#diff-3a431ad4e3eb2be8f601ceac9e42bf2166f4f220ff2291198220be1da7b8a077R103
Shouldn't it be instead: split_dict["test"] = train_valtest["test"] ?
@pascalnotin yes, thank you for catching that! It's been fixed. |
Thank you @othertea -- looks good to me, merged PR into main. |
Summary
This PR contains a variety of updates to update the training process, mainly:
Additional changes
In addition to the above, some of the additional changes include:
get_training_args
returns a HuggingFaceTrainingArguments
instead of a custom pydanticBaseModel
, which had been suggested here feat: initial restructure of training script #26 (comment). We can still turn it back into aBaseModel
later when we start going beyond what is supported by HuggingFace, but I kept having to use moreTrainingArgument
fields, which required adding them into our customBaseModel
.ddp_find_unused_parameters
tofalse
in thetoy_hf.yaml
config to avoid the messageWarning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
.Yet to do
Parameter 'function'=<function get_dataset.<locals>.<lambda> at 0x7f51080e2d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
which I believe will be resolved by merging adding Huggingface compatible tokenizers #48 and properly incorporating that tokenizer.