-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] Add barrier to accelerator's teardown #6814
Conversation
We should add a test to make sure this case isn't forgotten again, however I can't seem to reproduce this. I took the reproduce from the original issue, modified it a bit since the API has changed since, but it runs fine: # Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
# USE THIS MODEL TO REPRODUCE A BUG YOU REPORT
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
import glob
import os
import torch
from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import Dataset
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
"""
Testing PL Module
Use as follows:
- subclass
- modify the behavior for what you want
class TestModel(BaseTestModel):
def training_step(...):
# do your own thing
or:
model = BaseTestModel()
model.training_epoch_end = None
"""
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self.layer(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('loss', loss)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('x', loss, sync_dist=True)
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('y', loss, sync_dist=True)
def configure_optimizers(self):
optimizer = torch.optim.AdamW(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def run_test():
class TestModel(BoringModel):
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('x', loss)
# fake data
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
# model
model = TestModel()
tmp_dir = 'temp/'
if os.path.exists(tmp_dir):
os.rmdir(tmp_dir)
checkpoint = ModelCheckpoint(
dirpath=tmp_dir,
monitor='x',
mode='min',
save_top_k=1
)
trainer = Trainer(
default_root_dir=os.getcwd(),
max_epochs=2,
accelerator='ddp',
gpus=2,
callbacks=[checkpoint]
)
trainer.fit(model, train_data, val_data)
checkpoints = list(sorted(glob.glob(os.path.join(tmp_dir, "*.ckpt"), recursive=True)))
print("checkpoints", checkpoints)
print(checkpoint.best_model_path)
assert os.path.exists(
checkpoint.best_model_path), f'Could not find checkpoint at rank {trainer.global_rank}'
if __name__ == '__main__':
run_test() |
@SeanNaren n00b question: what's the recommended way for writing a multi-gpu/cpu test? are these special tests? using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Hey @ananthsub, If you use Best, |
306e876
to
7e9e49c
Compare
@SeanNaren for why it's hard to reproduce, this happens when we load checkpoint weights: https://github.com/PyTorchLightning/pytorch-lightning/blob/19e67d18c472c3a03dec4dd9bfcef031e9ca8719/pytorch_lightning/trainer/trainer.py#L992-L995 But I think this is the wrong spot to apply it. We should synchronize before we return back to the users because they could do anything else afterward, not just call back into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
@ananthsub There's a failing test. Link |
@kaushikb11 @tchaton do the azure pipelines need any different filesystem access? or is https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=5034&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=d9f671c5-a304-5675-5394-961fd7f98b9b a real failure that we need to debug? meaning this PR isn't fixing the issue |
@ananthsub It's strange. Is the test passing on your machine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think this is the wrong spot to apply it.
Should we remove that test barrier then?
trainer.fit(module) | ||
trainer.test() <-- best checkpoint path needs to be available on all ranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no indentation
trainer.fit(module) | |
trainer.test() <-- best checkpoint path needs to be available on all ranks | |
trainer.fit(module) | |
trainer.test() <-- best checkpoint path needs to be available on all ranks |
9239ee3
to
9abbfea
Compare
Codecov Report
@@ Coverage Diff @@
## master #6814 +/- ##
========================================
+ Coverage 44% 87% +43%
========================================
Files 197 194 -3
Lines 12585 12395 -190
========================================
+ Hits 5501 10776 +5275
+ Misses 7084 1619 -5465 |
9c0022c
to
ffa65f7
Compare
Looks like one of the special tests is timing out, meaning not every process is entering the barrier? That seems bizarre. |
@awaelchli @Borda @SeanNaren do we need to look for free ports across tests? |
I'm not sure, some tests have
|
12787b1
to
3a6b41d
Compare
What does this PR do?
There doesn't seem to be any synchronization before we return from the main trainer API entry points (fit/test/validate/predict). This was previously added in #4323 to handle cases like this:
or this
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃