Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

Closed
garrett361 opened this issue Nov 3, 2021 · 5 comments · Fixed by #10610
Closed

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

garrett361 opened this issue Nov 3, 2021 · 5 comments · Fixed by #10610
Assignees
Labels
bug Something isn't working help wanted Open to be worked on logger: wandb Weights & Biases priority: 0 High priority task

Comments

@garrett361
Copy link

🐛 Bug

After upgrading to v1.5.0, wandb hyperparameter sweeps performed in Colab notebooks fail with one UserWarning and one ValueError raised:

UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
ValueError('signal only works in main thread')

wandb sweep terminates without performing training runs. Downgrading to pl v1.4.9 resolves the issue.

To Reproduce

Reproduced at the end of this minimally modified BoringModel Colab notebook:
https://colab.research.google.com/drive/16nylTt8jGAbiSfq7zr7YlLOohfIbFqHq?usp=sharing

Expected behavior

wandb sweep should run training loops without terminating due to these errors.

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.9.0+cu111
    • pytorch-lightning: 1.5.0
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12
    • version: Proposal for help #1 SMP Sat Jun 5 09:50:34 PDT 2021
      You can also fill out the list below manually.
      -->
  • PyTorch Lightning Version (e.g., 1.3.0):
  • PyTorch Version (e.g., 1.8)
  • Python version:
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

@garrett361 garrett361 added bug Something isn't working help wanted Open to be worked on labels Nov 3, 2021
@tchaton tchaton added logger: wandb Weights & Biases priority: 0 High priority task labels Nov 3, 2021
@tchaton tchaton self-assigned this Nov 15, 2021
@cemde
Copy link

cemde commented Nov 17, 2021

I have the same issue with Ray Tune on a personal cluster.
I am running the tutorial from: https://docs.ray.io/en/latest/tune/tutorials/tune-pytorch-lightning.html#changing-the-cli-output

Environment:
Linux Ubuntu
CUDA 11.1
Python 3.8.10
Pytorch 1.10.0
Pytorch-lightning 1.5.2

Traceback (most recent call last):
  File "...path.../ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "...path.../ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "...path.../ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "...path.../ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
  File "...path.../ray/tune/trainable.py", line 224, in train_buffered
    result = self.train()
  File "...path.../ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "...path.../ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "...path.../ray/tune/function_runner.py", line 528, in _report_thread_runner_error
    raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
  File "...path.../ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "...path.../ray/tune/function_runner.py", line 330, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "...path.../ray/tune/function_runner.py", line 599, in _trainable_func
    output = fn()
  File "...path.../ray/tune/utils/trainable.py", line 353, in _inner
    inner(config, checkpoint_dir=None)
  File "...path.../ray/tune/utils/trainable.py", line 344, in inner
    trainable(config, **fn_kwargs)
  File "/users-1/cornelius/LearnRay/main.py", line 162, in train_mnist_tune
    trainer.fit(model)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 737, in fit
    self._call_and_handle_interrupt(
  File "...path.../pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
    self.training_type_plugin.start_training(self)
  File "...path.../pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
    return self._run_train()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1301, in _run_train
    self._pre_training_routine()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1291, in _pre_training_routine
    self.signal_connector.register_signal_handlers()
  File "...path.../pytorch_lightning/trainer/connectors/signal_connector.py", line 47, in register_signal_handlers
    signal.signal(signal.SIGUSR1, HandlersCompose(sigusr1_handlers))
  File "/users-1/cornelius/anaconda3/envs/MLR/lib/python3.8/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

Downgrading to 1.4.5 worked for me.

Should I open an independent issue or is this the same bug?

Thanks for everyone's hard work. I love lightning :)

@tchaton
Copy link
Contributor

tchaton commented Nov 18, 2021

Dear @garrett361 @cemde

Would you try trying out this PR: #10610 from Lightning.
Hopefully, it should resolve this error.

If it does, it would be released with 1.5.3 next Tuesday.

Best,
T.C

@cemde
Copy link

cemde commented Nov 18, 2021

@tchaton
thanks for the quick patch! I installed PL from git@bugfix_10336 and I no longer got the original error message. I now get this:

ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=2677701, ip=123.4.5.6, repr=<types.ImplicitFunc object at 0x7f874f0504c0>)
  File "..path../ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "..path../ray/tune/function_runner.py", line 330, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "..path../ray/tune/function_runner.py", line 599, in _trainable_func
    output = fn()
  File "..path../ray/tune/utils/trainable.py", line 353, in _inner
    inner(config, checkpoint_dir=None)
  File "..path../ray/tune/utils/trainable.py", line 344, in inner
    trainable(config, **fn_kwargs)
  File "/users-1/cornelius/jem/pascal_baseline_tune.py", line 46, in train_tune
    trainer.fit(model, data)
  File "..path../pytorch_lightning/trainer/trainer.py", line 719, in fit
    self._call_and_handle_interrupt(
  File "..path../pytorch_lightning/trainer/trainer.py", line 671, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "..path../pytorch_lightning/trainer/trainer.py", line 754, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1156, in _run
    self._dispatch()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1235, in _dispatch
    self.training_type_plugin.start_training(self)
  File "..path../pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1245, in run_stage
    return self._run_train()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1267, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1331, in _run_sanity_check
    self._evaluation_loop.run()
  File "..path../pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "..path../pytorch_lightning/loops/dataloader/evaluation_loop.py", line 139, in on_run_end
    self._on_evaluation_end()
  File "..path../pytorch_lightning/loops/dataloader/evaluation_loop.py", line 201, in _on_evaluation_end
    self.trainer.call_hook("on_validation_end", *args, **kwargs)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1451, in call_hook
    callback_fx(*args, **kwargs)
  File "..path../pytorch_lightning/trainer/callback_hook.py", line 221, in on_validation_end
    callback.on_validation_end(self, self.lightning_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 118, in on_validation_end
    self._handle(trainer, pl_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 200, in _handle
    report_dict = self._get_report_dict(trainer, pl_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 177, in _get_report_dict
    if trainer.running_sanity_check:
AttributeError: 'Trainer' object has no attribute 'running_sanity_check'

But I assume this is an issue from the PL 1.6.0dev code base?!

So looks like it is fixed

@garrett361
Copy link
Author

@cemde Looks fixed to me. Only difference I see between wandb sweeps in pl 1.4.9 is a new UserWarning:

UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.

Thanks for the quick response and hard work!

@tchaton
Copy link
Contributor

tchaton commented Nov 19, 2021

Hey @cemde,

Would you mind opening another issue with a reproducible script ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger: wandb Weights & Biases priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants