Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add validation split, wandb logging, multigpu compatibility #49

Merged
merged 12 commits into from
Sep 24, 2023

Conversation

othertea
Copy link
Collaborator

@othertea othertea commented Sep 11, 2023

Summary

This PR contains a variety of updates to update the training process, mainly:

  • Provides a way to create and use a (optional) validation/test set that was randomly chosen. Note we can update this later once we have non-random splits, as is planned in Create cross-validation split on input dataset #18 .
  • Provides a way to log metrics to weights and biases.
  • make some changes to be compatible with single-node multi-gpu training.

Additional changes

In addition to the above, some of the additional changes include:

  • No longer streaming HuggingFace datasets.
  • get_training_args returns a HuggingFace TrainingArguments instead of a custom pydantic BaseModel, which had been suggested here feat: initial restructure of training script #26 (comment). We can still turn it back into a BaseModel later when we start going beyond what is supported by HuggingFace, but I kept having to use more TrainingArgument fields, which required adding them into our custom BaseModel.
  • Directions on the README for how to launch a single-node multi-gpu job on the stability cluster, which I have verified runs. Note that this is still a toy run.
  • Setting ddp_find_unused_parameters to false in the toy_hf.yaml config to avoid the message Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()).

Yet to do

  • Training with this new code will print
    Parameter 'function'=<function get_dataset.<locals>.<lambda> at 0x7f51080e2d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
    which I believe will be resolved by merging adding Huggingface compatible tokenizers #48 and properly incorporating that tokenizer.
  • This PR does not currently contain a config yaml to do a proper run, only toy runs.

@othertea othertea marked this pull request as ready for review September 11, 2023 03:01
Copy link
Collaborator

@pascalnotin pascalnotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@othertea -- Looks great overall! Our main dataset for training will have a fixed train-val-test split, so I would slightly modify train_val_test_split by adding an if statement at the beginning so that:

  1. If "train" "val" and "test" are already keys of the input dataset we keep those
  2. If there is only "train" then we keep the logic in the current version of train_val_test_split
  3. Otherwise we throw an exception (similar to what you have here)

@othertea
Copy link
Collaborator Author

Thanks for the review @pascalnotin ! I've updated the PR with the request changes, let me know if it looks good.

Copy link
Collaborator

@pascalnotin pascalnotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @othertea ! All looks good, except perhaps this line: https://github.com/OpenBioML/protein-lm-scaling/pull/49/files#diff-3a431ad4e3eb2be8f601ceac9e42bf2166f4f220ff2291198220be1da7b8a077R103

Shouldn't it be instead: split_dict["test"] = train_valtest["test"] ?

@othertea
Copy link
Collaborator Author

@pascalnotin yes, thank you for catching that! It's been fixed.

@pascalnotin pascalnotin merged commit 1289b12 into OpenBioML:main Sep 24, 2023
@pascalnotin
Copy link
Collaborator

Thank you @othertea -- looks good to me, merged PR into main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants