v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

garrett361 · 2021-11-03T15:14:30Z

🐛 Bug

After upgrading to v1.5.0, wandb hyperparameter sweeps performed in Colab notebooks fail with one UserWarning and one ValueError raised:

UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.

ValueError('signal only works in main thread')

wandb sweep terminates without performing training runs. Downgrading to pl v1.4.9 resolves the issue.

To Reproduce

Reproduced at the end of this minimally modified BoringModel Colab notebook:
https://colab.research.google.com/drive/16nylTt8jGAbiSfq7zr7YlLOohfIbFqHq?usp=sharing

Expected behavior

wandb sweep should run training loops without terminating due to these errors.

Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
- available: True
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.9.0+cu111
- pytorch-lightning: 1.5.0
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.12
- version: Proposal for help #1 SMP Sat Jun 5 09:50:34 PDT 2021
  You can also fill out the list below manually.
  -->

PyTorch Lightning Version (e.g., 1.3.0):
PyTorch Version (e.g., 1.8)
Python version:
OS (e.g., Linux):
CUDA/cuDNN version:
GPU models and configuration:
How you installed PyTorch (conda, pip, source):
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

cemde · 2021-11-17T14:56:28Z

I have the same issue with Ray Tune on a personal cluster.
I am running the tutorial from: https://docs.ray.io/en/latest/tune/tutorials/tune-pytorch-lightning.html#changing-the-cli-output

Environment:
Linux Ubuntu
CUDA 11.1
Python 3.8.10
Pytorch 1.10.0
Pytorch-lightning 1.5.2

Traceback (most recent call last):
  File "...path.../ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "...path.../ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "...path.../ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "...path.../ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
  File "...path.../ray/tune/trainable.py", line 224, in train_buffered
    result = self.train()
  File "...path.../ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "...path.../ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "...path.../ray/tune/function_runner.py", line 528, in _report_thread_runner_error
    raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
  File "...path.../ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "...path.../ray/tune/function_runner.py", line 330, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "...path.../ray/tune/function_runner.py", line 599, in _trainable_func
    output = fn()
  File "...path.../ray/tune/utils/trainable.py", line 353, in _inner
    inner(config, checkpoint_dir=None)
  File "...path.../ray/tune/utils/trainable.py", line 344, in inner
    trainable(config, **fn_kwargs)
  File "/users-1/cornelius/LearnRay/main.py", line 162, in train_mnist_tune
    trainer.fit(model)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 737, in fit
    self._call_and_handle_interrupt(
  File "...path.../pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
    self.training_type_plugin.start_training(self)
  File "...path.../pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
    return self._run_train()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1301, in _run_train
    self._pre_training_routine()
  File "...path.../pytorch_lightning/trainer/trainer.py", line 1291, in _pre_training_routine
    self.signal_connector.register_signal_handlers()
  File "...path.../pytorch_lightning/trainer/connectors/signal_connector.py", line 47, in register_signal_handlers
    signal.signal(signal.SIGUSR1, HandlersCompose(sigusr1_handlers))
  File "/users-1/cornelius/anaconda3/envs/MLR/lib/python3.8/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

Downgrading to 1.4.5 worked for me.

Should I open an independent issue or is this the same bug?

Thanks for everyone's hard work. I love lightning :)

tchaton · 2021-11-18T16:05:30Z

Dear @garrett361 @cemde

Would you try trying out this PR: #10610 from Lightning.
Hopefully, it should resolve this error.

If it does, it would be released with 1.5.3 next Tuesday.

Best,
T.C

cemde · 2021-11-18T16:21:10Z

@tchaton
thanks for the quick patch! I installed PL from git@bugfix_10336 and I no longer got the original error message. I now get this:

ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=2677701, ip=123.4.5.6, repr=<types.ImplicitFunc object at 0x7f874f0504c0>)
  File "..path../ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "..path../ray/tune/function_runner.py", line 330, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "..path../ray/tune/function_runner.py", line 599, in _trainable_func
    output = fn()
  File "..path../ray/tune/utils/trainable.py", line 353, in _inner
    inner(config, checkpoint_dir=None)
  File "..path../ray/tune/utils/trainable.py", line 344, in inner
    trainable(config, **fn_kwargs)
  File "/users-1/cornelius/jem/pascal_baseline_tune.py", line 46, in train_tune
    trainer.fit(model, data)
  File "..path../pytorch_lightning/trainer/trainer.py", line 719, in fit
    self._call_and_handle_interrupt(
  File "..path../pytorch_lightning/trainer/trainer.py", line 671, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "..path../pytorch_lightning/trainer/trainer.py", line 754, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1156, in _run
    self._dispatch()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1235, in _dispatch
    self.training_type_plugin.start_training(self)
  File "..path../pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1245, in run_stage
    return self._run_train()
  File "..path../pytorch_lightning/trainer/trainer.py", line 1267, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1331, in _run_sanity_check
    self._evaluation_loop.run()
  File "..path../pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "..path../pytorch_lightning/loops/dataloader/evaluation_loop.py", line 139, in on_run_end
    self._on_evaluation_end()
  File "..path../pytorch_lightning/loops/dataloader/evaluation_loop.py", line 201, in _on_evaluation_end
    self.trainer.call_hook("on_validation_end", *args, **kwargs)
  File "..path../pytorch_lightning/trainer/trainer.py", line 1451, in call_hook
    callback_fx(*args, **kwargs)
  File "..path../pytorch_lightning/trainer/callback_hook.py", line 221, in on_validation_end
    callback.on_validation_end(self, self.lightning_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 118, in on_validation_end
    self._handle(trainer, pl_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 200, in _handle
    report_dict = self._get_report_dict(trainer, pl_module)
  File "..path../ray/tune/integration/pytorch_lightning.py", line 177, in _get_report_dict
    if trainer.running_sanity_check:
AttributeError: 'Trainer' object has no attribute 'running_sanity_check'

But I assume this is an issue from the PL 1.6.0dev code base?!

So looks like it is fixed

garrett361 · 2021-11-18T17:15:12Z

@cemde Looks fixed to me. Only difference I see between wandb sweeps in pl 1.4.9 is a new UserWarning:

UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.

Thanks for the quick response and hard work!

tchaton · 2021-11-19T07:32:34Z

Hey @cemde,

Would you mind opening another issue with a reproducible script ?

garrett361 added bug Something isn't working help wanted Open to be worked on labels Nov 3, 2021

tchaton added logger: wandb Weights & Biases priority: 0 High priority task labels Nov 3, 2021

tchaton self-assigned this Nov 15, 2021

tchaton mentioned this issue Nov 18, 2021

Don't register signal in thread #10610

Merged

12 tasks

garrett361 closed this as completed Nov 18, 2021

garrett361 mentioned this issue Dec 17, 2021

ValueError('signal only works in main thread') messages back for wand sweeps in v1.5.6 #11118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

garrett361 commented Nov 3, 2021

cemde commented Nov 17, 2021 •

edited

Loading

tchaton commented Nov 18, 2021 •

edited

Loading

cemde commented Nov 18, 2021 •

edited

Loading

garrett361 commented Nov 18, 2021

tchaton commented Nov 19, 2021

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336

Comments

garrett361 commented Nov 3, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

cemde commented Nov 17, 2021 • edited Loading

tchaton commented Nov 18, 2021 • edited Loading

cemde commented Nov 18, 2021 • edited Loading

garrett361 commented Nov 18, 2021

tchaton commented Nov 19, 2021

cemde commented Nov 17, 2021 •

edited

Loading

tchaton commented Nov 18, 2021 •

edited

Loading

cemde commented Nov 18, 2021 •

edited

Loading