Hyperparameter Optimization Initialization #92

weiji14 · 2019-02-12T01:25:59Z

Start fine-tuning our Super Resolution Generative Adversarial Network's hyperparameters for better results. Using a Tree-Structured Parzen Estimator (TPE) which is a Bayesian Optimization approach.

List of hyperparameters to tune:

Learning rate (i.e. how fast the neural network learns)
Number of residual blocks (i.e. model depth)
Residual scaling factor (see also Szegedy et al., 2016)
Number of training epochs
Batch Size

TODO:

Install a hyperparameter optimization framework (e.g. Optuna) (0f00f3f)
Refactor srgan_train.ipynb to use an 'objective' function for hyperparameter tuning (e629cb4)
Find a way to do early pruning in Optuna for experiments with exploding/vanishing gradients (ace6aa1)
Set these 'good' hyperparameters as the default in the 'objective' function's definition (d569727)
Run experiments in parallel with the 2 Tesla P100 GPUs we have (a2866b6)
Get better results by brute force fine tuning on re-recentred hyperparameter search space (9c65fb4)
Write integration tests for srgan_train.ipynb using the new 'objective' function and other helper functions (37b90f6)

Hyperparameter Optimization framework! https://github.com/pfnet/optuna

Enabling hyperparameter optimization in the srgan_train.ipynb script by putting pretty much everything inside of functions. Doing so inside a huge objective() function. Using a Tree-structure Parzen Estimator, something Bayesian :) Suggested hyperparameters are declared at runtime using optuna.trial.Trial, and experiments are logged to Comet.ML. Datasets now loaded using load_data_into_memory() with a better GPU_enabled check. Train/Dev iterators made using get_train_dev_iterators(). Generator/Discriminator models and optimizers declared using compile_srgan_model(), similar to what was removed in ee1e9df. Training of one epoch now wrapped inside a trainer() function, which will need to be refactored later into a nicer Chainer Updater. Some refactoring tweaks made to how the model weights and architecture are saved and loaded, basically making it a lot more explicit, with some changes made to deepbedmap.ipynb to reflect that.

Bumps [comet-ml](https://www.comet.ml) from 1.0.42 to 1.0.45. Signed-off-by: dependabot[bot] <support@dependabot.com>

Storing the optuna hyperparameter optimization study details in an sqlite database for resuming later. Also improve efficiency in searching for optimal hyperparamaters by pruning unpromising experimental trials with vanishing/exploding gradients. Specifically, the pruning occurs when PSNR becomes negative, or when the Generator's Loss becomes NaN (usually a big number), or when the Discriminator's Loss becomes NaN (usually 0 or a small number). Note that these new 50 experimental runs with code sha 'faa96f1b' (e.g. at https://www.comet.ml/weiji14/deepbedmap/d6b2bf37408a45ad8331ae587c6aeb99) are resumed from the previous 100 runs stored in train2.db (renamed to train.db here), i.e. those recorded in comet.ml experiment with code sha 'f3ff3d57' (e.g. at https://www.comet.ml/weiji14/deepbedmap/18aeab738d9a4c56815d24cd712aafc6). Tentatively, the best hyperparameters so far is {'batch_size': 64, 'learning_rate': 0.0006500000000000001, 'num_epochs': 43, 'num_residual_blocks': 8}. Also made a quick update of the ONNX model architecture opset version.

Setting our hyperparameters to those found in the best trial, and focus our hyperparameter search around that. Namely, we are using a deeper ESRGAN model with num_residual_blocks=8, batch_size=64, learning_rate=6.5e-4, num_epochs=45 (close to 43). Ran 10 experimental runs on a fresh train.db database (i.e. discarding the old one) to check that things work, and it does seem to be quite stable within the hyperparameter search space of learning_rate between 4e-4 and 8e-4 and num_epochs between 30 and 60. Also updated a test on generator model parameter count, and set cupy.random.seed if cupy is available (when using GPU) for slightly better reproducibility. The .npz weight file download had some issues because we hardcoded retrieving asset at index 0, but it recently switched to 1. So I just added an assert check to make sure file extension ends with ".npz" and used the opportunity to refactor the code to use the new Comet_ML API library instead of requests.

Some sensible code refactoring for stuff coming up like brute force training, adding integration tests and modularizing code blocks better. Added an enable_livelossplot boolean flag to objective function, defaulting to False so that we can train faster. Added an enable_comet_logging boolean flag that may be useful for testing the objective function in some unit/integration test. The get_train_dev_iterators function now has a proper seed argument in place (was using the global seed before...). Final notebook cell prints top ten RMSE values (smallest ones) instead of last ten by time.

Training model on 2 GPUs in parallel e.g. via `CUDA_VISIBLE_DEVICES=1 jupyter nbconvert --ExecutePreprocessor.timeout=None --execute srgan_train.ipynb --to notebook --output model/logs/srgan_train_device1.ipynb &` for device1, swapping 1 to 0 for device0. Set cupy seed more properly, instead of only for device 0 as before. Remove potential problem with hardcoded sending of neural network model to gpu_id=0. Note that there is a chance for collision in using two processes accessing the same database... Results after ~90 training runs still not that great, may need to deepen the network more or start tuning more hyperparameters again. Also patch d569727's _download_deepbedmap_model_weights_from_comet() which had a hardcoded fix for a hardcoded problem. Developed a for-loop check instead of remove hardcoded way of downloading the npz parameter weights file.

Make things go faster by skipping quilt download and using cached track data after first run. Patch a2866b6 to prevent parallel GPU experimental trials colliding with one another, simply by using a unique trial_id when creating files for testing. Also changed batch_size suggestion to use an integer exponent instead of categorical, which might help with the bayesian model? Now trying batch sizes 64 and 128, with range of num_residual_blocks between 8 and 12. Was going to train for 25 runs by killed at 10 as there were a lot of pruned ones... Also fix get_deepbedmap_test_result() returning an np.float64 instead of a float, which somehow turns into a string when uploaded to comet.ml, causing sorting issues?!!

Adding residual scaling as a new hyperparameter to tune, and brute force training 400 times. The residual scaling factor was previously set at 0.2 as suggested by [Wang et al. 2018](https://arxiv.org/abs/1809.00219)'s ESRGAN paper, but here we try a range between 0.1 and 0.3 as mentioned in [Szegedy et al., 2016](https://arxiv.org/abs/1602.07261)'s paper. [Lim et al., 2017](https://arxiv.org/abs/1707.02921)'s EDSR paper actually used a residual scaling of 0.1. In our case, it appears that 0.3 is better? Best result in the 400 trials is an RMSE_test of 36.7995 using params {'batch_size_exponent': 7, 'learning_rate': 5e-4, 'num_epochs': 46, 'num_residual_blocks': 11, 'residual_scaling': 0.3}. However, another good hyperparameter setting that gives good RMSE_test values below 50 is batch_size=128, learning rate=5e-4, num_epochs=~45, num_residual_block=10, residual_scaling=0.3. For reference, our cubic interpolation benchmark is 62.24. Also patch ace6aa1 to prune unpromising experimental trials based on NaN metrics from the training set instead of the dev/validation set, a problem noticed in the last commit at 2cf27d4 which saw some validation metrics return NaN while training metrics were still valid numbers. Another patch for a2866b6 to set a different Tree-Structured Parzen Estimator (TPE) seed for each GPU so that there is more variety in each Optuna trial.

Setting new hyperparameters from our previous tuning at 9c65fb4, and testing our objective function (at least a mirror of it) in a behave integration test (for one epoch). Using a bigger batch size (128) and deeper model with more training epochs (for better convergence), specifically {"batch_size_exponent": 7, "num_residual_blocks": 10, "residual_scaling": 0.3, "learning_rate": 5e-4, "num_epochs": 100}. We report here an RMSE_test of 49.26 from experiment at https://www.comet.ml/weiji14/deepbedmap/315bd591ab944c1ebf87bce44cb83c21. Visual inspection of results already show a pretty good surface, except for the strange striped artifact some distance away from the border. May need to train and test this configuration a few more times? The objective function can now run properly as a standalone function using the optuna.trial.FixedTrial! For speed, and because the Continuous Integration server does not have a GPU, we run our srgan_train integration test on a test dataset with a total of only 1 tile. This required a hacky modification to the load_dataset_into_memory() function, which now outputs a dev_iter of size 1. For speed, and because the Continuous Integration server does not have a GPU, we run our srgan_train integration test on a test dataset with a total of only 1 tile. This required a hacky modification to the load_dataset_into_memory() function, which now outputs a dev_iter of size 1.

Closes #92 Hyperparameter Optimization Initialization - using the Optuna framework.

➕ Add Optuna

0f00f3f

Hyperparameter Optimization framework! https://github.com/pfnet/optuna

weiji14 added enhancement ✨ New feature or request model 🏗️ Pull requests that update neural network model labels Feb 12, 2019

weiji14 added this to the v0.6.0 milestone Feb 12, 2019

weiji14 self-assigned this Feb 12, 2019

weiji14 and others added 2 commits February 13, 2019 12:30

⬆️ Bump comet-ml from 1.0.42 to 1.0.45 (#101)

c65128b

Bumps [comet-ml](https://www.comet.ml) from 1.0.42 to 1.0.45. Signed-off-by: dependabot[bot] <support@dependabot.com>

weiji14 force-pushed the model/init_optuna branch from 60f235f to c65128b Compare February 15, 2019 23:29

weiji14 added 7 commits February 18, 2019 09:07

weiji14 changed the title ~~WIP Hyperparameter Optimization Initialization~~ Hyperparameter Optimization Initialization Feb 21, 2019

weiji14 merged commit 37b90f6 into master Feb 21, 2019

weiji14 added a commit that referenced this pull request Feb 21, 2019

Merge branch 'model/init_optuna' (#92)

935ac00

Closes #92 Hyperparameter Optimization Initialization - using the Optuna framework.

weiji14 deleted the model/init_optuna branch February 21, 2019 22:26

weiji14 added a commit that referenced this pull request Feb 21, 2019

🔀 Merge branch 'model/init_optuna' (#92)

bdb8a0d

Closes #92 Hyperparameter Optimization Initialization - using the Optuna framework.

weiji14 mentioned this pull request Mar 19, 2019

Retune ESRGAN hyperparameters on stronger discriminator #129

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyperparameter Optimization Initialization #92

Hyperparameter Optimization Initialization #92

Uh oh!

weiji14 commented Feb 12, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Hyperparameter Optimization Initialization #92

Hyperparameter Optimization Initialization #92

Uh oh!

Conversation

weiji14 commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weiji14 commented Feb 12, 2019 •

edited

Loading