feat: add validation split, wandb logging, multigpu compatibility #49

othertea · 2023-09-11T02:31:52Z

Summary

This PR contains a variety of updates to update the training process, mainly:

Provides a way to create and use a (optional) validation/test set that was randomly chosen. Note we can update this later once we have non-random splits, as is planned in Create cross-validation split on input dataset #18 .
Provides a way to log metrics to weights and biases.
make some changes to be compatible with single-node multi-gpu training.

Additional changes

In addition to the above, some of the additional changes include:

No longer streaming HuggingFace datasets.
get_training_args returns a HuggingFace TrainingArguments instead of a custom pydantic BaseModel, which had been suggested here feat: initial restructure of training script #26 (comment). We can still turn it back into a BaseModel later when we start going beyond what is supported by HuggingFace, but I kept having to use more TrainingArgument fields, which required adding them into our custom BaseModel.
Directions on the README for how to launch a single-node multi-gpu job on the stability cluster, which I have verified runs. Note that this is still a toy run.
Setting ddp_find_unused_parameters to false in the toy_hf.yaml config to avoid the message Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()).

Yet to do

Training with this new code will print
Parameter 'function'=<function get_dataset.<locals>.<lambda> at 0x7f51080e2d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
which I believe will be resolved by merging adding Huggingface compatible tokenizers #48 and properly incorporating that tokenizer.
This PR does not currently contain a config yaml to do a proper run, only toy runs.

pascalnotin

@othertea -- Looks great overall! Our main dataset for training will have a fixed train-val-test split, so I would slightly modify train_val_test_split by adding an if statement at the beginning so that:

If "train" "val" and "test" are already keys of the input dataset we keep those
If there is only "train" then we keep the logic in the current version of train_val_test_split
Otherwise we throw an exception (similar to what you have here)

…splits

othertea · 2023-09-16T17:47:10Z

Thanks for the review @pascalnotin ! I've updated the PR with the request changes, let me know if it looks good.

pascalnotin

Thanks @othertea ! All looks good, except perhaps this line: https://github.com/OpenBioML/protein-lm-scaling/pull/49/files#diff-3a431ad4e3eb2be8f601ceac9e42bf2166f4f220ff2291198220be1da7b8a077R103

Shouldn't it be instead: split_dict["test"] = train_valtest["test"] ?

othertea · 2023-09-23T15:27:58Z

@pascalnotin yes, thank you for catching that! It's been fixed.

pascalnotin · 2023-09-24T01:55:23Z

Thank you @othertea -- looks good to me, merged PR into main.

othertea added 10 commits September 9, 2023 12:23

feat: add val/test split

5761f0b

feat: add optional wandb logging and refactor configs

6a3975a

docs: remove extra newline

3ed1781

chore: add wandb dependency

246bfeb

chore: setup wandb with env vars

4d9ca00

docs: update README training section

ebd4efd

fix: do not manually move model to gpu

3ab8f5e

chore: set ddp_find_unused_parameters to false in config

fb5dca1

chore: remove manual logging

b8e728a

Merge branch 'main' into add-validation-wandb-multigpu

517084a

othertea marked this pull request as ready for review September 11, 2023 03:01

pascalnotin requested changes Sep 13, 2023

View reviewed changes

feat: handle case where huggingface dataset has 'train','val','test' …

cc5f98c

…splits

othertea requested a review from pascalnotin September 16, 2023 17:47

pascalnotin requested changes Sep 21, 2023

View reviewed changes

fix: correctly set test split if val_size is zero

c9e2237

othertea requested a review from pascalnotin September 23, 2023 15:28

pascalnotin approved these changes Sep 24, 2023

View reviewed changes

pascalnotin merged commit 1289b12 into OpenBioML:main Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add validation split, wandb logging, multigpu compatibility #49

feat: add validation split, wandb logging, multigpu compatibility #49

othertea commented Sep 11, 2023 •

edited

Loading

pascalnotin left a comment

othertea commented Sep 16, 2023

pascalnotin left a comment

othertea commented Sep 23, 2023

pascalnotin commented Sep 24, 2023

feat: add validation split, wandb logging, multigpu compatibility #49

feat: add validation split, wandb logging, multigpu compatibility #49

Conversation

othertea commented Sep 11, 2023 • edited Loading

Summary

Additional changes

Yet to do

pascalnotin left a comment

Choose a reason for hiding this comment

othertea commented Sep 16, 2023

pascalnotin left a comment

Choose a reason for hiding this comment

othertea commented Sep 23, 2023

pascalnotin commented Sep 24, 2023

othertea commented Sep 11, 2023 •

edited

Loading