Hydra configuration #2

anthonytec2 · 2020-06-25T12:11:38Z

Hydra Configuration

pl_examples/models/hydra_template.py

anthonytec2 · 2020-06-27T16:17:25Z

Any other ideas to add to the example?

omry · 2020-06-27T17:03:06Z

One idea is to separate dataclasses configuring core PL objects to a different module to make it clear those should be reused somehow and not copied by every single user.

pl_examples/hydra_examples/cpu_template.py

omry · 2020-06-27T17:06:50Z

pl_examples/hydra_examples/pl_hydra/trainer_conf.py

+
+@dataclass
+class LightningTrainerConf:
+    # callbacks: Optional[List[Callback]] = None


I think we are ready to start tackling left-over todos.

So this one, I feel we would need some user help. For example, how do we support the list of callback objects? 1) A user would need to define a structured config for their callback, 2) They would need to create a yaml list of these configs, 3) We would need to instantiate their object and populate the list, 4) Pass this into trainer.

Taking the callback from the example:

class MyPrintingCallback(Callback): def on_init_start(self, trainer): print('Starting to init trainer!') def on_init_end(self, trainer): print('trainer is init now') def on_train_end(self, trainer, pl_module): print('do something when training ends')

This style lets you reuse the callback configs more easily:

callbacks: print: cls: MyPrintingCallback s3_checkpoint: cls: S3Checkpoint params: bucket_name: ??? callbacks_list: - ${callbacks.print} - ${callbacks.checkpoint}

The code can instantiate the callbacks like:

callbacks = [hydra.utils.instantiate(callback) for callback in cfg.callbacks_list]

Currently, Config group are mutually exclusive: you can only load one config from each config group.
Once facebookresearch/hydra#499 is done we will be able to do something better.
For now you the example can put all the callbacks config in the primary config file or break them into callbacks.yaml that can be added to the defaults list as - callbacks.

PS: I did not try this so there may be unforeseen problems.

Question about interpolation, when I define callbacks: ${callbbacks.print}, then this resolves to: {'cls': 'MyPrintingCallback'}. Although when I put callbacks into a list format it resolves to {'callbacks': ['${callbacks.print}']}. Is there anything special you need to do for variable interpolation in a list?

Okay, I figured out the above issue. The solution you mentioned did not work, I was getting an error regarding incorrect type when trying to use the object: cfg.callbacks_list[0]. I put up a working solution in the most recent commit by just listing out the names of these fields.

What error? can you be more specific?

Nevermind, I fixed the example and now it works like your example.

pl_examples/hydra_examples/pl_hydra/trainer_conf.py

pl_examples/models/hydra_config_model.py

anthonytec2 · 2020-06-28T11:55:34Z

One idea would be to place it in: pytorch-lightning/pytorch_lightning/trainer/ where we can then reuse the hydra configuration

anthonytec2 · 2020-06-29T12:14:45Z

Also quick aside, do you know much about the submitit plugin? In terms of how environment setup works for running with shared environments?

omry · 2020-06-29T15:32:01Z

Yes, but let's not use this important issue for unrelated discussions.
Ask your question in the Hydra chat.

anthonytec2 · 2020-06-30T01:30:40Z

The leftover todos relate to Union types, which have been created as Any or the more generic of the two types.

pytorch_lightning/trainer/trainer_conf.py

anthonytec2 · 2020-06-30T11:40:42Z

How would you like to handle your comment in regards to separating this example into a part to be included and versioned with PL? Maybe, I misunderstood the comment previously, but my intention here was just putting trainer conf into the core folder for merging to the PL.

omry · 2020-06-30T16:53:41Z

I think all (most?) of the dataclasses belongs in the core:
content of:

pytorch_lightning/trainer/trainer_conf.py
pl_examples/hydra_examples/conf/scheduler.py
pl_examples/hydra_examples/conf/optimizer.py

we can condition the registration with Hydra's ConfigStore on the presence of Hydra.

One think I don't like about this soft dependency is that it makes it hard to require a specific version of Hydra. I guess you could check hydra.version at runtime but it's not great.

pl_examples/hydra_examples/conf/data/mnist.yaml

anthonytec2 · 2020-07-01T00:59:01Z

I agree with most of your comments in regards to moving things into core, my only hesitation is that users can define multiple optimizers which each use a different scheduler. At the moment, I am unsure given the config store api to be able to easily reuse this optimizer to a different group.

omry · 2020-07-01T07:59:03Z

I am not sure I follow your association between optimizer and (LR) scheduler.
Those seems like orthogonal concepts which should have their own config groups.

anthonytec2 · 2020-07-01T12:29:06Z

I did not explain that too well, basically I wanted to highlight the fact that you could have multiple optimizers/schedulers in a single configuration file. Since, at the moment multiple configurations are not supported for a single group, this would be an issue. Although, for most cases users should be fine with one optimizer at the moment. They can extend this template given for more advanced cases.

pl_examples/hydra_examples/conf/data/fashionmnist.yaml

pl_examples/hydra_examples/conf/data/kmnist.yaml

pytorch_lightning/trainer/trainer_conf.py

omry · 2020-07-05T22:23:24Z

can you add a second example that is reusing the code here to do something useful (train mnist for example)?
just to be sure we are indeed covering everything we should.

romesco · 2020-07-05T22:35:42Z

For the sake of testing, is there a way for me to use hydra's compositional config feature without having hydra take control of the logger, output directory, etc.?

For example, one current inconvenience is that each time I run the pl_template.py, it redownloads MNIST into outputs/Date/Time which is definitely not expected functionality. Also, it is not playing well with manual exit in a tmux/screen session.

omry · 2020-07-05T23:21:42Z

For the sake of testing, is there a way for me to use hydra's compositional config feature without having hydra take control of the logger, output directory, etc.?

For example, one current inconvenience is that each time I run the pl_template.py, it redownloads MNIST into outputs/Date/Time which is definitely not expected functionality. Also, it is not playing well with manual exit in a tmux/screen session.

You can use the compose API, but you are forfeiting many important features. this is definitely not the recommended example for people to follow for something like this.
A better solution is to fix the bug in the example such that pl_template does not redownload MNIST every time you run it.
In fact, I think this is already fixed this example so I am not sure what you are seeing.

anthonytec2 · 2020-07-05T23:22:22Z

@romesc-CMU I thought i fixed the redownload issue in a recent commit. Can you pull and try again.

romesco · 2020-07-05T23:24:55Z

Ok, let me repull and check it out again. Maybe I was configuring incorrectly, since I did modify a few things to simplify testing.

anthonytec2 · 2020-07-06T02:23:18Z

In regards to creating a 2nd example, at the moment this example is training MNIST.

romesco · 2020-07-06T06:20:28Z

I'm on the most up to date commit and this fixed the dataset I/O issue 👍 .

So far while testing pl_template.py as is (not my simplified version):

I was able to pass different gpu configurations.
I was able to successfully save and load a checkpoint.
(although, I'm wondering how to get the directory hydra is currently writing to in outputs.)
I saw hydra.utils.get_original_cwd(), but that only gets me to the entry point. Edit: os.getcwd() works. Didn't realize hydra changes the working directory. That makes sense.

I read up on the structured configs and the Config Stores are making a lot more sense. I'm also understanding the purpose of init_trainer() now. Out of curiosity, when I was reading the structured configs docs, I noticed it lays out the two benefits as:

Runtime type checking as you compose or mutate your config
Static type checking when using static type checkers (mypy, PyCharm, etc.)

Does that imply this entire example is achievable with the earlier version of hydra that doesn't include structured configs? The main advantage to having them being the type checking? [which I do like]

omry · 2020-07-06T07:16:59Z

Without structured configs you achieve that with yaml config files.
you get very limited type safety though. (In practice there will likely be some existing issues in 0.11 that are addressed in the RC that this example is relying on).

Please go through the basic Hydra tutorial to get a handle of the basic functionality.

omry · 2020-07-06T07:57:36Z

@romesc-CMU, Bt the way - did you successfully made any command line, config and runtime config access errors to see what happens? :)

pl_examples/hydra_examples/pl_template.py

anthonytec2 · 2020-07-17T12:46:08Z

One problem still to resolve is that PL flattens all the hparams passed into the model and tries to log them to the logger defined. A new issue is that when I pass these parameters some are missing and this results in exception. Specifically, since I defined target instead of cls, I get the error missing cls when the logger tries to flatten all the config settings.

anthonytec2 · 2020-07-17T13:13:13Z

Also, we are having a problem after rebasing with the Tensorboard logger. This relates to this issue: Lightning-AI#2519. The problem is that tensorboard is trying to serialize objects, but the case used to determine if the object is a Dict Config is never hit. The reason being that the hparams object stores a dict of the parameters passed into the model, hence the case is never true that the hparams object is a Container type. https://github.com/PyTorchLightning/pytorch-lightning/blob/9759491940c4108ac8ef01e0b53b31f03a69b4d6/pytorch_lightning/core/saving.py#L364

anthonytec2 · 2020-07-17T13:14:22Z

Final issue is with None objects facebookresearch/hydra#785. I am also experiencing this and the work around suggested does not work.

omry · 2020-07-17T17:47:23Z

Final issue is with None objects facebookresearch/hydra#785. I am also experiencing this and the work around suggested does not work.

I am going to look at this one today.
The workaround seems fishy to me, in any case I will fix it properly.

anthonytec2 · 2020-07-18T01:24:27Z

None bug fixed in example!

pl_examples/hydra_examples/pl_template.py

pl_examples/hydra_examples/user_config.py

pl_examples/hydra_examples/pl_template.py

anthonytec2 · 2020-07-18T15:09:22Z

Okay, I removed the double define of cls after the removal of the cls field in the recent hydra MR. The only issue left is on the parameter saving for tensorboard, which already has an issue and MR fix in the works. I am going to send this MR over to the PL team for initial discussions.

omry · 2020-07-18T15:19:40Z

MR?

romesco · 2020-07-18T21:51:38Z

MR?

merge request I guess?

fix job name template change to model create hydra examples folder fix error with none values optimizers and lr schedules clean up model structure model has data included dont configure outputs document hydra example update readme rename trainer conf scheduler example schedulers update change out structure for opt and sched flatten config dirs reduce number of classes scheduler and opt configs spelling change group config store location change import and store structured conf remaining classes fix for date change location of trainer config fix package name trainer instantiation clean up init trainer type fixes clean up imports update readme add in seed Update pl_examples/hydra_examples/README.md Co-authored-by: Omry Yadan <[email protected]> Update pl_examples/hydra_examples/README.md Co-authored-by: Omry Yadan <[email protected]> change to model clean up hydra example data to absolute path update file name fix path isort run name change hydra logging change config dir use name as logging group load configs in init py callout callbacks fix callbacks empty list example param data params example with two other data classes fix saving params dataset path correction comments in trainer conf logic in user app better config clean up arguments multiprocessing handled by PL settings cleaner callback list callback clean up top level config wip user config add in callbacks fix callbacks in user config fix group names name config fix user config instantiation without + change type split for readability user config move master config yaml hydra from master changes remove init py clean up model configuration add comments add to readme function doc need hydra for instantiate defaults defined in config yaml remove to do lines issue note remove imports unused cfg init removal double define instantiate changes change back to full config Update pl_examples/hydra_examples/pl_template.py Co-authored-by: Omry Yadan <[email protected]> Revert "double define" This reverts commit 4a9a962. fix data configuration remove bug comment, fixed already fix callbacks instantiate

omry · 2020-07-19T14:43:29Z

pl_examples/hydra_examples/README.md

-This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.   
-The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.  
+This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.  
+The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.


Please capitalize Structured Configs too to be consistent with how it's used in the documentation of Hydra.

omry · 2020-07-19T14:44:22Z

pl_examples/hydra_examples/README.md

@@ -1,13 +1,13 @@
 ## Hydra Pytorch Lightning Example

-This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.   
-The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.  
+This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.  


Suggested change

This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.

This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a framework that allows for the easy configuration of complex applications.

omry · 2020-07-19T14:45:24Z

pl_examples/hydra_examples/README.md


-Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters. 
+Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters.


If these configurations are adopted into the core of PL this should move into the PR description and out of the README.

rakhimovv · 2020-07-21T14:23:27Z

Hi! Thanks for a wonderful example! I met a problem when I try to run python pl_template.py trainer.gpus=4 trainer.distributed_backend=ddp it fails

omry · 2020-07-21T16:46:06Z

@rakhimovv, it would help it you show how it fails.

rakhimovv · 2020-07-21T17:23:25Z

@omry @anthonytec2

HYDRA_FULL_ERROR=1 python pl_template.py trainer.gpus=4 trainer.distributed_backend=ddp

produces

Starting to init trainer!
GPU available: True, used: True
[2020-07-21 16:50:34,181][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-07-21 16:50:34,181][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
[2020-07-21 16:50:34,181][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0,1,2,3]
trainer is init now
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
[2020-07-21 16:50:36,017][lightning][INFO] - initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
[2020-07-21 16:50:39,876][lightning][INFO] - initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
[2020-07-21 16:50:41,958][lightning][INFO] - initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
[2020-07-21 16:50:42,120][lightning][INFO] - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
----------------------------------------------------------------------------------------------------
[2020-07-21 16:50:43,026][lightning][INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=ddp
[2020-07-21 16:50:43,027][lightning][INFO] - distributed_backend=ddp
All DDP processes registered. Starting ddp with 4 processes
[2020-07-21 16:50:43,027][lightning][INFO] - All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------
[2020-07-21 16:50:43,027][lightning][INFO] - ----------------------------------------------------------------------------------------------------

  | Name      | Type        | Params | In sizes  | Out sizes
------------------------------------------------------------------
0 | c_d1      | Linear      | 785 K  | [2, 784]  | [2, 1000]
1 | c_d1_bn   | BatchNorm1d | 2 K    | [2, 1000] | [2, 1000]
2 | c_d1_drop | Dropout     | 0      | [2, 1000] | [2, 1000]
3 | c_d2      | Linear      | 10 K   | [2, 1000] | [2, 10]  
[2020-07-21 16:50:54,947][lightning][INFO] - 
  | Name      | Type        | Params | In sizes  | Out sizes
------------------------------------------------------------------
0 | c_d1      | Linear      | 785 K  | [2, 784]  | [2, 1000]
1 | c_d1_bn   | BatchNorm1d | 2 K    | [2, 1000] | [2, 1000]
2 | c_d1_drop | Dropout     | 0      | [2, 1000] | [2, 1000]
3 | c_d2      | Linear      | 10 K   | [2, 1000] | [2, 10]  
[2020-07-21 16:50:54,950][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'

If I understand correctly, the problem is that data split initialization should happen in def setup(self, stage) not in def prepare_data(self). But even if I copy-paste the code from def prepare_data(self) to def setup(self, stage) in hydra_config_model.py sometimes it works and sometimes it fails with the error attached below. The possible reason, I suppose, is that several processes try to download and write data to the same folder. I assume this is because value interpolation does not work in ddp regime. As the datasets are saved into experiment_folder/datasets, not into ${hydra:runtime.cwd}/datasets

Starting to init trainer!
GPU available: True, used: True
[2020-07-21 17:11:52,484][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-07-21 17:11:52,484][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
[2020-07-21 17:11:52,484][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0,1,2,3]
trainer is init now
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
[2020-07-21 17:11:54,322][lightning][INFO] - initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
[2020-07-21 17:11:58,095][lightning][INFO] - initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
[2020-07-21 17:12:00,256][lightning][INFO] - initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
[2020-07-21 17:12:00,429][lightning][INFO] - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]----------------------------------------------------------------------------------------------------
[2020-07-21 17:12:01,446][lightning][INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=ddp
[2020-07-21 17:12:01,446][lightning][INFO] - distributed_backend=ddp
All DDP processes registered. Starting ddp with 4 processes
[2020-07-21 17:12:01,446][lightning][INFO] - All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------
[2020-07-21 17:12:01,446][lightning][INFO] - ----------------------------------------------------------------------------------------------------
9920512it [00:06, 1496351.07it/s]                                                                                                                                                                                                             
 15%|███████████████████████████▉                                                                                                                                                               | 1482752/9912422 [00:06<00:08, 989320.57it/s]Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
9920512it [00:06, 1499278.68it/s]                                                                                                                                                                                                             
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
 22%|████████████████████████████████████████▉                                                                                                                                                 | 2179072/9912422 [00:06<00:05, 1479882.48it/s]Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
 29%|██████████████████████████████████████████████████████▍                                                                                                                                   | 2899968/9912422 [00:07<00:04, 1526329.32it/s]Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   | 9740288/9912422 [00:10<00:00, 2441871.66it/s]Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
32768it [00:05, 6061.31it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
32768it [00:05, 6072.13it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
32768it [00:05, 6082.34it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
1654784it [00:05, 279562.79it/s]                                                                                                                                                                                                              
Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/utils.py", line 35, in call
    return _instantiate_class(type_or_callable, config, *args, **kwargs)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 478, in _instantiate_class
    return clazz(*args, **final_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 86, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 511, in ddp_train
    model.setup('fit')
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 113, in setup
    self.train_set = hydra.utils.instantiate(self.data.ds, transform=transform, train=True)
  File "/home/rakhimov/hydra/hydra/utils.py", line 40, in call
    raise HydraException(f"Error calling '{cls}' : {e}") from e
hydra.errors.HydraException: Error calling 'torchvision.datasets.MNIST' : File not found or corrupted.
1654784it [00:06, 273807.53it/s]                                                                                                                                                                                                              
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz
0it [00:00, ?it/s]                                                                                                                                                                                                                           Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz
8192it [00:05, 1553.94it/s]                                                                                                                                                                                                                   
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/s]
Processing...
/pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
8192it [00:05, 1551.38it/s]                                                                                                                                                                                                                   
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw                                                                                                                                                                                                      | 0/4542 [00:05<?, ?it/s]
Processing...
/pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
9920512it [00:30, 2441871.66it/s]

p.s.

python pl_template.py trainer.gpus=4

works fine, but it uses ddp_spawn regime, not ddp

omry · 2020-07-21T17:37:12Z

Thanks for reporting.
At the moment the ownership of this example and the level of support PL has for Hydra is still being determined.
This kind of problem should be handled by PL in my opinion and not as a part of the example.

You can follow along here.
feel free to point out the issue you encountered there for visibility.

anthonytec2 · 2020-07-23T01:52:11Z

@rakhimovv I have noticed the above issue before and as you mentioned, changing to use setup is important. Additionally, I would just hardcode the data path for now until we work through a fix that properly sorts all this out.

tkornuta-nvidia · 2020-07-24T23:16:50Z

@anthonytec2 so what is the desired solution?

We faced the same issue and at the end we overrode the hydra.main decorator with the one that actually enforces
hydra.run.dir=....

omry · 2020-07-25T01:49:23Z

@anthonytec2 so what is the desired solution?

We faced the same issue and at the end we overrode the hydra.main decorator with the one that actually enforces
hydra.run.dir=....

Terrible hack :)

If you are using PL directly, you should wait for a fix there because it's the one that is spawning the process.
If you are spawning it directly, you can set it's cwd to the Hydra original working directory.
Please join the Hydra chat, we can chat about it.

anthonytec2 mentioned this pull request Jun 25, 2020

[WIP] Hydra Configuration PL #1

Closed

omry reviewed Jun 25, 2020

View reviewed changes

pl_examples/models/hydra_template.py Outdated Show resolved Hide resolved

cgerum mentioned this pull request Jun 26, 2020

[Bug] ObjectConf with cls Keyword Argument is not Supported on Python 3.6.9 facebookresearch/hydra#721

Closed

omry reviewed Jun 27, 2020

View reviewed changes

omry reviewed Jun 30, 2020

View reviewed changes

pytorch_lightning/trainer/trainer_conf.py Outdated Show resolved Hide resolved

omry reviewed Jun 30, 2020

View reviewed changes

pl_examples/hydra_examples/conf/data/mnist.yaml Outdated Show resolved Hide resolved

omry reviewed Jul 2, 2020

View reviewed changes

pl_examples/hydra_examples/conf/data/fashionmnist.yaml Outdated Show resolved Hide resolved

anthonytec2 commented Jul 2, 2020

View reviewed changes

pl_examples/hydra_examples/conf/data/kmnist.yaml Outdated Show resolved Hide resolved

omry reviewed Jul 5, 2020

View reviewed changes

pytorch_lightning/trainer/trainer_conf.py Outdated Show resolved Hide resolved

omry reviewed Jul 6, 2020

View reviewed changes

pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved

anthonytec2 force-pushed the hydra_conf branch 2 times, most recently from 31eaf06 to 50232e0 Compare July 17, 2020 12:26

omry reviewed Jul 18, 2020

View reviewed changes

pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved

pl_examples/hydra_examples/user_config.py Outdated Show resolved Hide resolved

pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved

omry reviewed Jul 18, 2020

View reviewed changes

pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved

anthonytec2 force-pushed the hydra_conf branch from 4e394e4 to c7e7ca3 Compare July 19, 2020 02:39

anthonytec2 added 4 commits July 19, 2020 08:32

change out links

7429e01

conf consistency

a5a764e

support simpler callbacks

739ae2f

callbacks is in trainer config

ab71dd1

omry reviewed Jul 19, 2020

View reviewed changes

romesco force-pushed the hydra_conf branch from 625a103 to ab71dd1 Compare July 20, 2020 22:15

Add simplified example for novice users.

85181b4

rakhimovv mentioned this pull request Jul 21, 2020

Hydra Configuration for Pytorch Lightning Lightning-AI/pytorch-lightning#2639

Closed

7 tasks

anthonytec2 closed this May 13, 2021

	This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
	This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a framework that allows for the easy configuration of complex applications.


		Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters.
		Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters.

Hydra configuration #2

Hydra configuration #2

Conversation

anthonytec2 commented Jun 25, 2020

anthonytec2 commented Jun 27, 2020

omry commented Jun 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonytec2 commented Jun 28, 2020

anthonytec2 commented Jun 29, 2020

omry commented Jun 29, 2020

anthonytec2 commented Jun 30, 2020

anthonytec2 commented Jun 30, 2020

omry commented Jun 30, 2020

anthonytec2 commented Jul 1, 2020

omry commented Jul 1, 2020

anthonytec2 commented Jul 1, 2020

omry commented Jul 5, 2020

romesco commented Jul 5, 2020 • edited Loading

omry commented Jul 5, 2020

anthonytec2 commented Jul 5, 2020

romesco commented Jul 5, 2020

anthonytec2 commented Jul 6, 2020

romesco commented Jul 6, 2020 • edited Loading

omry commented Jul 6, 2020 • edited Loading

omry commented Jul 6, 2020 • edited Loading

anthonytec2 commented Jul 17, 2020 • edited Loading

anthonytec2 commented Jul 17, 2020

anthonytec2 commented Jul 17, 2020

omry commented Jul 17, 2020

anthonytec2 commented Jul 18, 2020

anthonytec2 commented Jul 18, 2020

omry commented Jul 18, 2020

romesco commented Jul 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry Jul 19, 2020 • edited Loading

Choose a reason for hiding this comment

rakhimovv commented Jul 21, 2020

omry commented Jul 21, 2020

rakhimovv commented Jul 21, 2020

omry commented Jul 21, 2020

anthonytec2 commented Jul 23, 2020

tkornuta-nvidia commented Jul 24, 2020 • edited Loading

omry commented Jul 25, 2020 • edited Loading

romesco commented Jul 5, 2020 •

edited

Loading

romesco commented Jul 6, 2020 •

edited

Loading

omry commented Jul 6, 2020 •

edited

Loading

omry commented Jul 6, 2020 •

edited

Loading

anthonytec2 commented Jul 17, 2020 •

edited

Loading

omry Jul 19, 2020 •

edited

Loading

tkornuta-nvidia commented Jul 24, 2020 •

edited

Loading

omry commented Jul 25, 2020 •

edited

Loading