Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change auto-device selection for Jupyter notebook environments #18291

Merged
merged 19 commits into from
Sep 8, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 11, 2023

What does this PR do?

Fixes #18281

This PR changes the devices="auto" behavior when running inside a Jupyter notebook.

Before:

# Selects all GPUs available on the machine, and sets `strategy="ddp_fork`
Trainer(devices="auto")

Now:

# Selects only one GPU, no ddp
Trainer(devices="auto")

We've seen multiple reports similar to #18281 with users reporting segmentation faults and CUDA issues. These are at the moment too hard to address. For a better user experience, we propose to switch the default back and add an info message that multiple devices may be used, but the feature is currently considered "unstable".

cc @Borda @justusschock @carmocca @awaelchli

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Aug 11, 2023
@awaelchli awaelchli changed the title WIP: Change auto-device selection for Jupyter notebook environments Change auto-device selection for Jupyter notebook environments Aug 18, 2023
@awaelchli awaelchli marked this pull request as ready for review August 18, 2023 01:13
@awaelchli awaelchli added this to the 2.1 milestone Aug 18, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 18, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
[pytorch-lightning (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173683&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success
[pytorch-lightning (GPUs) (testing PyTorch latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173683&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/trainer/connectors/test_accelerator_connector.py, src/lightning/fabric/connector.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 fabric: Docs
Check ID Status
docs-make (fabric, doctest) success
docs-make (fabric, html) success

These checks are required after the changes to src/lightning/fabric/connector.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/trainer/connectors/accelerator_connector.py, docs/source-pytorch/advanced/speed.rst, docs/source-pytorch/extensions/plugins.rst, docs/source-pytorch/starter/style_guide.rst.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success
fabric-cpu (macOS-12, fabric, 3.11, 2.0) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0) success
fabric-cpu (windows-2022, fabric, 3.11, 2.0) success

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
[lightning-fabric (GPUs) (testing Fabric latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173685&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success
[lightning-fabric (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173685&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success

These checks are required after the changes to src/lightning/fabric/connector.py, tests/tests_fabric/test_connector.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli added the strategy: ddp DistributedDataParallel label Aug 18, 2023
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say that I would prefer the alternative of keeping the current behavior - which is consistent with all other environments - and suggesting the possibility of running with devices=1 if things go wrong.

It would also avoid having to revert this breaking change again (meaning another breaking change) if support for multi-GPU environments with notebooks improves in the future.

However, I don't have a strong enough opinion to block this, so here's my stamp to proceed if you feel comfortable making this change

@carmocca carmocca added the breaking change Includes a breaking change label Aug 18, 2023
@mergify mergify bot added the ready PRs ready to be merged label Aug 18, 2023
@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #18291 (9f8f749) into master (fbdbe63) will decrease coverage by 23%.
The diff coverage is 100%.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #18291     +/-   ##
=========================================
- Coverage      85%      62%    -23%     
=========================================
  Files         427      422      -5     
  Lines       33138    33054     -84     
=========================================
- Hits        28188    20406   -7782     
- Misses       4950    12648   +7698     

@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Aug 23, 2023
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 8, 2023
@awaelchli awaelchli merged commit 4dfc09c into master Sep 8, 2023
@awaelchli awaelchli deleted the feature/auto-device-notebooks branch September 8, 2023 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes a breaking change fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should the Trainer select all GPUs in a Jupyter environment?
3 participants