-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with ddp #4033
Comments
Hi! thanks for your contribution!, great first issue! |
Hey! Can you try to reproduce the error running our base model for testing? |
This issue don't reproduces with testing model |
not 100% sure but can you try initializing |
I just removed tokenizer from script and error still reproducing |
but you are using it here |
I replase it by number(20000) |
can you update the code above? |
Then that's a good place to start looking for differences. |
Test code: # Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
# USE THIS MODEL TO REPRODUCE A BUG YOU REPORT
# --------------------------------------------
# --------------------------------------------
# --------------------------------------------
import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import Trainer, LightningModule
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class RandomIteratableDataset(torch.utils.data.IterableDataset):
def __init__(self, size, length):
self.len = length
self.size = size
self.data = torch.randn(length, size)
def __iter__(self):
yield torch.randn(20000, (self.size))
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
"""
Testing PL Module
Use as follows:
- subclass
- modify the behavior for what you want
class TestModel(BaseTestModel):
def training_step(...):
# do your own thing
or:
model = BaseTestModel()
model.training_epoch_end = None
"""
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self.layer(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def validation_epoch_end(self, outputs) -> None:
torch.stack([x['x'] for x in outputs]).mean()
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"y": loss}
def test_epoch_end(self, outputs) -> None:
torch.stack([x["y"] for x in outputs]).mean()
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def run_test():
class TestModel(BoringModel):
def on_train_epoch_start(self) -> None:
print('override any method to prove your bug')
# fake data
train_data = torch.utils.data.DataLoader(RandomIteratableDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomIteratableDataset(32, 64))
test_data = torch.utils.data.DataLoader(RandomIteratableDataset(32, 64))
# model
model = TestModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
gpus=2,
distributed_backend='ddp'
)
trainer.fit(model, train_data, val_data)
trainer.test(test_dataloaders=test_data)
if __name__ == '__main__':
run_test() Code, for reproducing error: import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from linformer_pytorch import LinformerLM
from torch.utils.data import DataLoader
class MyDummyDataset(torch.utils.data.IterableDataset):
def __init__(self):
super(MyDummyDataset).__init__()
def __len__(self):
return 354800
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None:
yield torch.randint(20000, (5120,)), torch.randint(20000, (5120,))
else:
yield torch.randint(20000, (5120,)), torch.randint(20000, (5120,))
class LiLinformer(pl.LightningModule):
def __init__(self):
super().__init__()
self.model = LinformerLM(
num_tokens=20000, # Number of tokens in the LM
input_size=5120, # Dimension 1 of the input
channels=128, # Dimension 2 of the input
dim_d=None,
# Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
dim_k=128, # The second dimension of the P_bar matrix from the paper
dim_ff=128, # Dimension in the feed forward network
dropout_ff=0.15, # Dropout for feed forward network
nhead=16, # Number of attention heads
depth=12, # How many times to run the model
dropout=0.1, # How much dropout to apply to P_bar after softmax
activation="gelu",
# What activation to use. Currently, only gelu and relu supported, and only on ff network.
checkpoint_level="C2", # What checkpoint level to use. For more information, see below.
parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
k_reduce_by_layer=0,
# Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
full_attention=False,
# Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
include_ff=True, # Whether or not to include the Feed Forward layer
w_o_intermediate_dim=None,
# If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
emb_dim=128, # If you want the embedding dimension to be different than the channels for the Linformer
)
def forward(self, inputs):
output = self.model(inputs)
return output
def training_step(self, inputs, mm):
inp, labels = inputs
loss_mx = labels != -100
output = self.model(inp)
output = output[loss_mx].view(-1, 20000)
labels = labels[loss_mx].view(-1)
loss = F.cross_entropy(output, labels)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
def run_main():
lt_model = LiLinformer()
train_data = torch.utils.data.DataLoader(MyDummyDataset(), batch_size=4)
val_data = torch.utils.data.DataLoader(MyDummyDataset(), batch_size=4)
trainer = pl.Trainer(max_epochs=2, gpus=2, distributed_backend='ddp')
trainer.fit(lt_model, train_data, val_data)
if __name__ == "__main__":
run_main() I take LinformerLM from this repo Also, if I trying to stat training this model without lightning, using just vanilla Pytorch, I also get error. After this simple fix vanilla Pytorch ddp run without error, but lightning still reproducing it. |
Looking at the error message at the top of what you posted:
I found that your Lilinformer model uses which looks like is the cause of the issue. I'm not sure if something needs to be enabled to make checkpointing possible in Lightning. I would first make sure that your model does not do what is described in the point 2) in the hint above. |
Hey, author of the other repo here. I think that error 2 is happening:
Here, the model is being run with checkpointing and parameter sharing, and this is causing this error to appear since the same variable is being marked ready more than once. Switch the |
Thanks all for help. This issue don't connect with lightning, this issue comes from vanilla torch. Sorry for miss reporting |
🐛 Bug
When I run ddp with sample script I get the following error:
but with dp all going normal
Code sample
Environment
- GPU:
- TITAN RTX
- TITAN RTX
- available: True
- version: 10.1
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.10.0
- tqdm: 4.50.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Update Lightning compatibility with PyTorch 1.2.0 #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019
The text was updated successfully, but these errors were encountered: