You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like the exact bug described in #2620 is back: When calling trainer.test() after trainer.fit(), LightningDataModule.setup() is called twice, which is especially problematic when using random_split in setup() because it leads to training samples leaking into the test set (in the worst case of extremely large training data compared to test data, the test data will most likely consist exclusively of training samples).
To Reproduce
importosimporttorchfrompytorch_lightningimportLightningDataModulefromtorch.utils.dataimportDataLoader, Dataset, random_splitfrompytorch_lightningimportLightningModule, TrainerclassRandomDataset(Dataset):
def__init__(self, size, length):
self.len=lengthself.data=torch.randn(length, size)
def__getitem__(self, index):
returnself.data[index]
def__len__(self):
returnself.lenclassMyDataModule(LightningDataModule):
def__init__(self):
super().__init__()
defprepare_data(self):
# download, split, etc...# only called on 1 GPU/TPU in distributedpassdefsetup(self, stage):
# make assignments here (val/train/test split)# called on every process in DDPprint("\nSetting up!!!\n")
dataset=RandomDataset(32, 96)
self.data_train, self.data_val, self.data_test=random_split(
dataset, [32, 32, 32]
)
deftrain_dataloader(self):
returnDataLoader(self.data_train, batch_size=2)
defval_dataloader(self):
returnDataLoader(self.data_val, batch_size=2)
deftest_dataloader(self):
returnDataLoader(self.data_test, batch_size=2)
classBoringModel(LightningModule):
def__init__(self):
super().__init__()
self.layer=torch.nn.Linear(32, 2)
defforward(self, x):
returnself.layer(x)
deftraining_step(self, batch, batch_idx):
loss=self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
defvalidation_step(self, batch, batch_idx):
loss=self(batch).sum()
self.log("valid_loss", loss)
deftest_step(self, batch, batch_idx):
loss=self(batch).sum()
self.log("test_loss", loss)
defconfigure_optimizers(self):
returntorch.optim.SGD(self.layer.parameters(), lr=0.1)
defrun():
datamodule=MyDataModule()
model=BoringModel()
trainer=Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
weights_summary=None,
)
trainer.fit(model=model, datamodule=datamodule)
trainer.test(model=model, datamodule=datamodule)
if__name__=="__main__":
run()
Expected behavior
I am expecting MyDataModule.setup() to be called once, but it is called twice (easily verified by the print statements).
Environment
CUDA:
- GPU:
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- available: True
- version: 11.1
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.11
- version: Proposal for help #1 SMP Wed Jul 21 11:57:15 UTC 2021
Additional context
Maybe I misunderstand how prepare() and setup() are supposed to be used, especially in DDP mode, but the comment in the docs clearly states val/train/test split in setup(), which currently leads to the described problem of training data ending up in the test split when using a random split.
The text was updated successfully, but these errors were encountered:
setup is currently called once per fit or test, not once overall. However, this behavior is changing Lightning v1.6, where setup will be called unconditionally across these methods. Why is it changing? #7301
Imagine the situation where you make edits to your datamodule in between calls to the trainer. this can lead to silent errors
dm = MyDataModule()
trainer = Trainer(...)
model = MyLightningModule()
trainer.fit(model, dm)
# update dm with new training dataset
dm.train_dataset = ...
trainer.fit(model, dm) # <-- datamodule.setup() isn't called again!
So instead, you can write your datamodule so the setup hooks are idempotent
class MyDataModule(LightningDataModule):
def __init__(self):
super().__init__()
self.data_train = None
sel.data_val = None
sel.data_test = None
def setup(self, stage):
if self.data_train is None:
self.data_train = ...
if self.data_val is None:
self.data_val = ...
if self.data_test is None:
self.data_test = ...
You can further optimize this by using the stage argument passed to setup to determine which attributes need to be initialized. For instance, you only need to initialize self.data_test if you're testing. you don't need to initialize during fitting. this could save you some extra memory and time.
#6420 is an issue to discuss whether we should have dedicated setup and prepare_data functions for each of the entry point functions to better isolate this.
🐛 Bug
It seems like the exact bug described in #2620 is back: When calling
trainer.test()
aftertrainer.fit()
,LightningDataModule.setup()
is called twice, which is especially problematic when usingrandom_split
insetup()
because it leads to training samples leaking into the test set (in the worst case of extremely large training data compared to test data, the test data will most likely consist exclusively of training samples).To Reproduce
Expected behavior
I am expecting
MyDataModule.setup()
to be called once, but it is called twice (easily verified by the print statements).Environment
- GPU:
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- Tesla V100S-PCIE-32GB
- available: True
- version: 11.1
- numpy: 1.20.3
- pyTorch_debug: False
- pyTorch_version: 1.9.1
- pytorch-lightning: 1.4.8
- tqdm: 4.60.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.11
- version: Proposal for help #1 SMP Wed Jul 21 11:57:15 UTC 2021
Additional context
Maybe I misunderstand how
prepare()
andsetup()
are supposed to be used, especially in DDP mode, but the comment in the docs clearly states val/train/test split insetup()
, which currently leads to the described problem of training data ending up in the test split when using a random split.The text was updated successfully, but these errors were encountered: