-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Error when logging to MLFlow deleted experiment #20556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when logging to MLFlow deleted experiment #20556
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #20556 +/- ##
=========================================
- Coverage 88% 79% -9%
=========================================
Files 267 264 -3
Lines 23380 23325 -55
=========================================
- Hits 20481 18366 -2115
- Misses 2899 4959 +2060 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
A test is not strictly needed but would be great to have one that verifies the condition here https://github.com/Lightning-AI/pytorch-lightning/blob/master/tests/tests_pytorch/loggers/test_mlflow.py
* Make error MLFlow deleted experiment explicit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 29ed24f)
* Make error MLFlow deleted experiment explicit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 29ed24f)
What does this PR do?
Addresses issue #20555.
When logging to an MLFlow experiment name that has been deleted (e.g. deleted from the MLFlow UI), the MLFlow experiment still exists in the backend, so
expt = self._mlflow_client.get_experiment_by_name(self._experiment_name)
retrieves it and Lightning proceeds as if the experiment already exists. When Lightning actually tries to log to the experiment, it stalls as MLFlow returns many 500 errors. Eventually it times out and the program crashes.This PR simply modifies the check to ensure that the run exists and also is not deleted. This will cause the experiment creation to fail, but with an error message that explicitly states that it cannot be created because there is a deleted experiment with the same name.
This could be handled differently to not crash the run (e.g. increment the experiment name by post-pending _1, _2, etc.) but I think the explicit failure with a clear message is my preferred behaviour.
Fixes #20555>
📚 Documentation preview 📚: https://pytorch-lightning--20556.org.readthedocs.build/en/20556/