Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DDPPlugin to be used on CPU #6208

Merged
merged 59 commits into from
Jul 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
a87149e
Skip test due to 'Python bus error'
carmocca Jun 24, 2021
807e636
Debug NCCL
carmocca Jun 24, 2021
76f9766
Remove NCCL_DEBUG statement
carmocca Jun 24, 2021
e1c4007
Revert "Skip test due to 'Python bus error'"
carmocca Jun 24, 2021
706ec17
fix
awaelchli Feb 25, 2021
f5e64a9
add test
awaelchli Feb 26, 2021
4d2f8b0
changelog
awaelchli Feb 26, 2021
c914724
yapf
awaelchli Feb 26, 2021
02ea55b
patch os environ
awaelchli Feb 26, 2021
5db42a4
make a special test
awaelchli Feb 27, 2021
9da2cf8
destroy pg
awaelchli Feb 28, 2021
db3d3b7
debug
awaelchli Feb 28, 2021
0b9d595
revert
awaelchli Feb 28, 2021
ed4b626
revert
awaelchli Feb 28, 2021
b0d4992
problematic test
awaelchli Feb 28, 2021
9a689ac
skip
awaelchli Feb 28, 2021
817674c
try the fixture
awaelchli Mar 18, 2021
589bf44
test
awaelchli Apr 8, 2021
0d81028
update sensitive test
awaelchli Apr 18, 2021
69bc1e1
update changelog
awaelchli Apr 18, 2021
eb175f4
remove comment
awaelchli Apr 18, 2021
198fef5
update wrong test
awaelchli Apr 18, 2021
a079b69
update test name
awaelchli Apr 18, 2021
e6f3be0
parameterization
awaelchli Apr 26, 2021
435909c
Revert "parameterization"
awaelchli Apr 26, 2021
62e7d90
remove conftest
awaelchli Apr 26, 2021
38eb366
ignore test
awaelchli Apr 27, 2021
355e92d
teardown
awaelchli Apr 27, 2021
b0c52a2
fix merge
awaelchli May 15, 2021
3cbb471
deep speed parameterization
awaelchli Jun 24, 2021
e8619f4
uncomment test
awaelchli Jun 24, 2021
8b8f59f
update chlog
awaelchli Jun 24, 2021
dcf6138
update changelog
awaelchli Jun 24, 2021
899e62c
split tests
awaelchli Jun 29, 2021
4bc5592
update test
awaelchli Jun 29, 2021
2849bbf
update test comments
awaelchli Jun 29, 2021
dfea688
unroll test
awaelchli Jun 29, 2021
29be6a0
unroll test
awaelchli Jun 29, 2021
fd3de54
Merge branch 'master' into bugfix/ddp_cpu
awaelchli Jun 29, 2021
09632e6
unroll test
awaelchli Jun 29, 2021
aebf46d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 29, 2021
8c27163
increase shm
awaelchli Jun 30, 2021
f93841f
Merge branch 'master' into bugfix/ddp_cpu
awaelchli Jun 30, 2021
6cc68c1
sudo
awaelchli Jun 30, 2021
476f525
unroll ipu
awaelchli Jun 30, 2021
b4e7ea6
Revert "sudo"
awaelchli Jun 30, 2021
bed8500
Revert "increase shm"
awaelchli Jun 30, 2021
e389a37
x
awaelchli Jul 1, 2021
c3ed6bd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 1, 2021
664aebb
Merge branch 'master' into bugfix/ddp_cpu
carmocca Jul 1, 2021
1e2ed6a
find guilty test
awaelchli Jul 1, 2021
f65e299
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 1, 2021
25bc0e9
POPTORCH_WAIT_FOR_IPU=1
awaelchli Jul 2, 2021
dbbda56
move test
awaelchli Jul 2, 2021
af1d383
redo parameterize for ipu
awaelchli Jul 2, 2021
6ba5a41
de-comment test
awaelchli Jul 2, 2021
da7b2bd
move chlog
awaelchli Jul 2, 2021
7f2ba1c
Update tests/accelerators/test_accelerator_connector.py
awaelchli Jul 2, 2021
5b40e21
Update tests/accelerators/test_accelerator_connector.py
awaelchli Jul 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .azure-pipelines/ipu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ jobs:
- bash: |
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh

export POPTORCH_WAIT_FOR_IPU=1
python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
env:
MKL_THREADING_LAYER: "GNU"
Expand All @@ -90,6 +90,7 @@ jobs:
- bash: |
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
export POPTORCH_WAIT_FOR_IPU=1
bash tests/special_tests.sh
env:
MKL_THREADING_LAYER: "GNU"
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed a bug where `truncated_bptt_steps` would throw an AttributeError when the target RNN has multiple hidden states ([#8145](https://github.com/PyTorchLightning/pytorch-lightning/pull/8145))


- Fixed passing a custom `DDPPlugin` when choosing `accelerator="ddp_cpu"` for the accelerator ([#6208](https://github.com/PyTorchLightning/pytorch-lightning/pull/6208))


## [1.3.7] - 2021-06-22

- Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error ([#7975](https://github.com/PyTorchLightning/pytorch-lightning/pull/7975))
Expand Down
4 changes: 1 addition & 3 deletions pytorch_lightning/plugins/training_type/ddp.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,11 +209,9 @@ def _call_children_scripts(self):
if self.parallel_devices is None:
raise MisconfigurationException("you selected (distribute_backend = ddp) but did not set Trainer(gpus=?)")

os.environ["PL_TRAINER_GPUS"] = ",".join([str(device.index) for device in self.parallel_devices])
os.environ["PL_IN_DDP_SUBPROCESS"] = "1"

num_gpus = len(self.parallel_devices)
os.environ["WORLD_SIZE"] = f"{num_gpus * self.num_nodes}"
os.environ["WORLD_SIZE"] = f"{self.num_processes * self.num_nodes}"

self.interactive_ddp_procs = []

Expand Down
54 changes: 36 additions & 18 deletions tests/accelerators/test_accelerator_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

import pytest
import torch
import torch.distributed

from pytorch_lightning import Trainer
from pytorch_lightning.accelerators.accelerator import Accelerator
Expand Down Expand Up @@ -385,6 +386,35 @@ def on_fit_start(self, trainer, pl_module):
trainer.fit(model)


@RunIf(special=True)
def test_accelerator_choice_ddp_cpu_and_plugin(tmpdir):
""" Test that accelerator="ddp_cpu" can work together with an instance of DDPPlugin. """
_test_accelerator_choice_ddp_cpu_and_plugin(tmpdir, ddp_plugin_class=DDPPlugin)


@RunIf(special=True)
def test_accelerator_choice_ddp_cpu_and_plugin_spawn(tmpdir):
""" Test that accelerator="ddp_cpu" can work together with an instance of DDPPSpawnPlugin. """
_test_accelerator_choice_ddp_cpu_and_plugin(tmpdir, ddp_plugin_class=DDPSpawnPlugin)


def _test_accelerator_choice_ddp_cpu_and_plugin(tmpdir, ddp_plugin_class):

model = BoringModel()
trainer = Trainer(
default_root_dir=tmpdir,
plugins=[ddp_plugin_class(find_unused_parameters=True)],
fast_dev_run=True,
accelerator='ddp_cpu',
num_processes=2,
)
assert isinstance(trainer.training_type_plugin, ddp_plugin_class)
assert isinstance(trainer.accelerator, CPUAccelerator)
assert trainer.training_type_plugin.num_processes == 2
assert trainer.training_type_plugin.parallel_devices == [torch.device("cpu")] * 2
trainer.fit(model)


@mock.patch.dict(
os.environ, {
"SLURM_NTASKS": "2",
Expand All @@ -396,11 +426,8 @@ def on_fit_start(self, trainer, pl_module):
}
)
@mock.patch('torch.cuda.device_count', return_value=0)
@mock.patch('pytorch_lightning.plugins.DDPPlugin.setup_distributed', autospec=True)
def test_accelerator_choice_ddp_cpu_custom_cluster(device_count_mock, setup_distributed_mock):
"""
Test that we choose the custom cluster even when SLURM or TE flags are around
"""
def test_accelerator_choice_ddp_cpu_custom_cluster(_, tmpdir):
""" Test that we choose the custom cluster even when SLURM or TE flags are around """

class CustomCluster(LightningEnvironment):

Expand All @@ -410,25 +437,16 @@ def master_address(self):
def creates_children(self) -> bool:
return True

class CB(Callback):

def on_fit_start(self, trainer, pl_module):
assert isinstance(trainer.accelerator, CPUAccelerator)
assert isinstance(trainer.training_type_plugin, DDPPlugin)
assert isinstance(trainer.training_type_plugin.cluster_environment, CustomCluster)
raise SystemExit()

model = BoringModel()
trainer = Trainer(
default_root_dir=tmpdir,
plugins=[CustomCluster()],
fast_dev_run=True,
accelerator='ddp_cpu',
num_processes=2,
callbacks=[CB()],
)

with pytest.raises(SystemExit):
trainer.fit(model)
assert isinstance(trainer.accelerator, CPUAccelerator)
assert isinstance(trainer.training_type_plugin, DDPPlugin)
assert isinstance(trainer.training_type_plugin.cluster_environment, CustomCluster)


@mock.patch.dict(
Expand Down
12 changes: 10 additions & 2 deletions tests/accelerators/test_ddp.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,16 @@ def setup(self, stage: Optional[str] = None) -> None:


@RunIf(min_gpus=2, min_torch="1.8.1", special=True)
@pytest.mark.parametrize("precision", [16, 32])
def test_ddp_wrapper(tmpdir, precision):
def test_ddp_wrapper_16(tmpdir):
_test_ddp_wrapper(tmpdir, precision=16)


@RunIf(min_gpus=2, min_torch="1.8.1", special=True)
def test_ddp_wrapper_32(tmpdir):
_test_ddp_wrapper(tmpdir, precision=32)


def _test_ddp_wrapper(tmpdir, precision):
"""
Test parameters to ignore are carried over for DDP.
"""
Expand Down
39 changes: 25 additions & 14 deletions tests/callbacks/test_progress_bar.py
Original file line number Diff line number Diff line change
Expand Up @@ -543,22 +543,33 @@ def test_progress_bar_can_be_pickled():


@RunIf(min_gpus=2, special=True)
@pytest.mark.parametrize([
"total_train_samples",
"train_batch_size",
"total_val_samples",
"val_batch_size",
"val_check_interval",
], [
(8, 4, 2, 1, 0.2),
(8, 4, 2, 1, 0.5),
])
def test_progress_bar_max_val_check_interval(
total_train_samples, train_batch_size, total_val_samples, val_batch_size, val_check_interval, tmpdir
):
def test_progress_bar_max_val_check_interval_0(tmpdir):
_test_progress_bar_max_val_check_interval(
tmpdir,
total_train_samples=8,
train_batch_size=4,
total_val_samples=2,
val_batch_size=1,
val_check_interval=0.2
)


@RunIf(min_gpus=2, special=True)
def test_progress_bar_max_val_check_interval_1(tmpdir):
_test_progress_bar_max_val_check_interval(
tmpdir,
total_train_samples=8,
train_batch_size=4,
total_val_samples=2,
val_batch_size=1,
val_check_interval=0.5
)

world_size = 2

def _test_progress_bar_max_val_check_interval(
tmpdir, total_train_samples, train_batch_size, total_val_samples, val_batch_size, val_check_interval
):
world_size = 2
train_data = DataLoader(RandomDataset(32, total_train_samples), batch_size=train_batch_size)
val_data = DataLoader(RandomDataset(32, total_val_samples), batch_size=val_batch_size)

Expand Down
41 changes: 36 additions & 5 deletions tests/callbacks/test_pruning.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,13 +162,44 @@ def test_pruning_callback(


@RunIf(special=True, min_gpus=2)
@pytest.mark.parametrize("parameters_to_prune", [False, True])
@pytest.mark.parametrize("use_global_unstructured", [False, True])
def test_pruning_callback_ddp(tmpdir, use_global_unstructured: bool, parameters_to_prune: bool):
def test_pruning_callback_ddp_0(tmpdir):
train_with_pruning_callback(
tmpdir,
parameters_to_prune=parameters_to_prune,
use_global_unstructured=use_global_unstructured,
parameters_to_prune=False,
use_global_unstructured=False,
accelerator="ddp",
gpus=2,
)


@RunIf(special=True, min_gpus=2)
def test_pruning_callback_ddp_1(tmpdir):
train_with_pruning_callback(
tmpdir,
parameters_to_prune=False,
use_global_unstructured=True,
accelerator="ddp",
gpus=2,
)


@RunIf(special=True, min_gpus=2)
def test_pruning_callback_ddp_2(tmpdir):
train_with_pruning_callback(
tmpdir,
parameters_to_prune=True,
use_global_unstructured=False,
accelerator="ddp",
gpus=2,
)


@RunIf(special=True, min_gpus=2)
def test_pruning_callback_ddp_3(tmpdir):
train_with_pruning_callback(
tmpdir,
parameters_to_prune=True,
use_global_unstructured=True,
accelerator="ddp",
gpus=2,
)
Expand Down
13 changes: 11 additions & 2 deletions tests/checkpointing/test_checkpoint_callback_frequency.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,17 @@ def training_step(self, batch, batch_idx):

@mock.patch('torch.save')
@RunIf(special=True, min_gpus=2)
@pytest.mark.parametrize(['k', 'epochs', 'val_check_interval', 'expected'], [(1, 1, 1.0, 1), (2, 2, 0.3, 5)])
def test_top_k_ddp(save_mock, tmpdir, k, epochs, val_check_interval, expected):
def test_top_k_ddp_0(save_mock, tmpdir):
_top_k_ddp(save_mock, tmpdir, k=1, epochs=1, val_check_interval=1.0, expected=1)


@mock.patch('torch.save')
@RunIf(special=True, min_gpus=2)
def test_top_k_ddp_1(save_mock, tmpdir):
_top_k_ddp(save_mock, tmpdir, k=2, epochs=2, val_check_interval=0.3, expected=5)


def _top_k_ddp(save_mock, tmpdir, k, epochs, val_check_interval, expected):

class TestModel(BoringModel):

Expand Down
9 changes: 9 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from http.server import SimpleHTTPRequestHandler

import pytest
import torch.distributed
import torch.multiprocessing as mp


Expand All @@ -41,6 +42,14 @@ def restore_env_variables():
os.environ.update(env_backup)


@pytest.fixture(scope="function", autouse=True)
def teardown_process_group():
""" Ensures that the distributed process group gets closed before the next test runs. """
yield
if torch.distributed.is_available() and torch.distributed.is_initialized():
torch.distributed.destroy_process_group()


def pytest_configure(config):
config.addinivalue_line("markers", "spawn: spawn test in a separate process using torch.multiprocessing.spawn")

Expand Down
12 changes: 10 additions & 2 deletions tests/plugins/test_deepspeed_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -640,8 +640,16 @@ def test_deepspeed_multigpu_stage_3_checkpointing_full_weights_manual(tmpdir):


@RunIf(min_gpus=2, deepspeed=True, special=True)
@pytest.mark.parametrize('offload_optimizer', [True, False])
def test_deepspeed_multigpu_stage_2_accumulated_grad_batches(tmpdir, offload_optimizer):
def test_deepspeed_multigpu_stage_2_accumulated_grad_batches(tmpdir):
_deepspeed_multigpu_stage_2_accumulated_grad_batches(tmpdir, offload_optimizer=False)


@RunIf(min_gpus=2, deepspeed=True, special=True)
def test_deepspeed_multigpu_stage_2_accumulated_grad_batches_offload_optimizer(tmpdir):
_deepspeed_multigpu_stage_2_accumulated_grad_batches(tmpdir, offload_optimizer=True)


def _deepspeed_multigpu_stage_2_accumulated_grad_batches(tmpdir, offload_optimizer):
"""
Test to ensure with Stage 2 and multiple GPUs, accumulated grad batches works.
"""
Expand Down
10 changes: 7 additions & 3 deletions tests/trainer/test_data_loading.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,13 @@ def check_replace_distributed_sampler(tmpdir, save_preds_on_dl_idx, accelerator,


@RunIf(min_gpus=2, special=True)
@pytest.mark.parametrize("mode", [1, 2])
def test_replace_distributed_sampler_custom_dataloader_custom_batch_sampler(tmpdir, mode):
check_replace_distributed_sampler(tmpdir, True, "ddp", 2, 2, mode)
def test_replace_distributed_sampler_custom_dataloader_custom_batch_sampler_0(tmpdir):
check_replace_distributed_sampler(tmpdir, True, "ddp", 2, 2, mode=1)


@RunIf(min_gpus=2, special=True)
def test_replace_distributed_sampler_custom_dataloader_custom_batch_sampler_1(tmpdir):
check_replace_distributed_sampler(tmpdir, True, "ddp", 2, 2, mode=2)


@pytest.mark.parametrize("num_workers", [0, 1])
Expand Down