Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable memory sharing on model parameters in ddp-spawn #18238

Merged
merged 29 commits into from
Aug 15, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 6, 2023

What does this PR do?

Fixes #17399

The torch.multiprocessing.spawn launcher (strategy="ddp_spawn") by default enables memory sharing for all tensors passed through the spawning function, including tensors in modules. This means that the underlying storage of these tensors is shared across all processes, and anyone can read or write to them. This can lead to inconsistencies, as demonstrated in the repro example in the linked issue. This PR disables that by cloning the weights in each process, detaching it from shared memory.

Note:
This only applies when running on CPU. If running on GPU, memory won't be shared.
Hopefully, this will also help to avoid flakiness in tests.

cc @Borda @carmocca @justusschock @awaelchli

@awaelchli awaelchli added bug Something isn't working strategy: ddp spawn fun Staff contributions outside working hours - to differentiate from the "community" label labels Aug 6, 2023
@awaelchli awaelchli added this to the 2.0.x milestone Aug 6, 2023
@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Aug 6, 2023
@awaelchli awaelchli modified the milestones: 2.0.x, v1.9.x Aug 6, 2023
@awaelchli awaelchli force-pushed the bugfix/tensor-memory-sharing branch from a3733be to d10e1c4 Compare August 8, 2023 22:22
@awaelchli awaelchli marked this pull request as ready for review August 9, 2023 21:44
@awaelchli awaelchli changed the title Disable memory sharing on model parameters in ddp-spawn [TPU] WIP: Disable memory sharing on model parameters in ddp-spawn [TPU] Aug 10, 2023
@awaelchli awaelchli requested a review from Borda as a code owner August 13, 2023 15:01
@awaelchli awaelchli changed the title WIP: Disable memory sharing on model parameters in ddp-spawn [TPU] WIP: Disable memory sharing on model parameters in ddp-spawn Aug 13, 2023
@awaelchli awaelchli changed the title WIP: Disable memory sharing on model parameters in ddp-spawn Disable memory sharing on model parameters in ddp-spawn Aug 13, 2023
src/lightning/pytorch/CHANGELOG.md Outdated Show resolved Hide resolved
src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Aug 13, 2023
src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Aug 15, 2023
@Borda Borda merged commit a0ca2c8 into master Aug 15, 2023
@Borda Borda deleted the bugfix/tensor-memory-sharing branch August 15, 2023 12:39
Borda pushed a commit that referenced this pull request Aug 28, 2023
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

(cherry picked from commit a0ca2c8)
lantiga pushed a commit that referenced this pull request Aug 30, 2023
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

(cherry picked from commit a0ca2c8)
@awaelchli awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label pl Generic label for PyTorch Lightning package ready PRs ready to be merged reproducibility strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot replicate training results with seed_everything and strategy=ddp_spawn
3 participants