-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.5.0 breaks wandb hyperparameter sweeps in Colab #10336
Comments
I have the same issue with Ray Tune on a personal cluster. Environment: Traceback (most recent call last):
File "...path.../ray/tune/trial_runner.py", line 890, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "...path.../ray/tune/ray_trial_executor.py", line 788, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "...path.../ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "...path.../ray/worker.py", line 1625, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
File "...path.../ray/tune/trainable.py", line 224, in train_buffered
result = self.train()
File "...path.../ray/tune/trainable.py", line 283, in train
result = self.step()
File "...path.../ray/tune/function_runner.py", line 381, in step
self._report_thread_runner_error(block=True)
File "...path.../ray/tune/function_runner.py", line 528, in _report_thread_runner_error
raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=1942608, ip=163.1.88.197, repr=<ray.tune.function_runner.ImplicitFunc object at 0x7f8d28b2dfd0>)
File "...path.../ray/tune/function_runner.py", line 262, in run
self._entrypoint()
File "...path.../ray/tune/function_runner.py", line 330, in entrypoint
return self._trainable_func(self.config, self._status_reporter,
File "...path.../ray/tune/function_runner.py", line 599, in _trainable_func
output = fn()
File "...path.../ray/tune/utils/trainable.py", line 353, in _inner
inner(config, checkpoint_dir=None)
File "...path.../ray/tune/utils/trainable.py", line 344, in inner
trainable(config, **fn_kwargs)
File "/users-1/cornelius/LearnRay/main.py", line 162, in train_mnist_tune
trainer.fit(model)
File "...path.../pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "...path.../pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "...path.../pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "...path.../pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "...path.../pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
self.training_type_plugin.start_training(self)
File "...path.../pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "...path.../pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
return self._run_train()
File "...path.../pytorch_lightning/trainer/trainer.py", line 1301, in _run_train
self._pre_training_routine()
File "...path.../pytorch_lightning/trainer/trainer.py", line 1291, in _pre_training_routine
self.signal_connector.register_signal_handlers()
File "...path.../pytorch_lightning/trainer/connectors/signal_connector.py", line 47, in register_signal_handlers
signal.signal(signal.SIGUSR1, HandlersCompose(sigusr1_handlers))
File "/users-1/cornelius/anaconda3/envs/MLR/lib/python3.8/signal.py", line 47, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread Downgrading to 1.4.5 worked for me. Should I open an independent issue or is this the same bug? Thanks for everyone's hard work. I love lightning :) |
Dear @garrett361 @cemde Would you try trying out this PR: #10610 from Lightning. If it does, it would be released with 1.5.3 next Tuesday. Best, |
@tchaton
But I assume this is an issue from the PL 1.6.0dev code base?! So looks like it is fixed |
@cemde Looks fixed to me. Only difference I see between UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`. Thanks for the quick response and hard work! |
Hey @cemde, Would you mind opening another issue with a reproducible script ? |
🐛 Bug
After upgrading to v1.5.0, wandb hyperparameter sweeps performed in Colab notebooks fail with one
UserWarning
and oneValueError
raised:wandb
sweep terminates without performing training runs. Downgrading topl
v1.4.9 resolves the issue.To Reproduce
Reproduced at the end of this minimally modified BoringModel Colab notebook:
https://colab.research.google.com/drive/16nylTt8jGAbiSfq7zr7YlLOohfIbFqHq?usp=sharing
Expected behavior
wandb
sweep should run training loops without terminating due to these errors.Environment
You can also fill out the list below manually.
-->
conda
,pip
, source):torch.__config__.show()
:Additional context
The text was updated successfully, but these errors were encountered: