-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend val_check_interval
support
#8135
Comments
Dear @amoshyc, Thanks for reporting this issue. Would be interested in creating a fix PR ? Best. |
Hi @tchaton , |
Sorry for the late reply. I'm so busy doing my research. I haven't have time to fix it. |
Dear @MirMustafaAli, Yes, feel free to work on this one and ping us if you get blocked :) Best, |
Thanks |
@tchaton Feedback and how to proceed required on solution |
train_set = RandomDataset(1, 32)
valid_set = RandomDataset(1, 32)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2) Given this definition, your dataset has Since you set The issue here is that the docs definition is ambiguous val_check_interval: How often to check the validation set. Use float to check within a training epoch,
use int to check every n steps (batches). "use int to check every n steps (batches)" does not explicitly state whether this is within an epoch or between epochs. The current implementation is for within an epoch. Changing this would be a breaking change, and I can imagine users wanting to be able to choose which one to do. cc @PyTorchLightning/core-contributors Repro bug report modelimport os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(1, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def on_validation_start(self):
print('global_step:', self.global_step)
def run():
train_set = RandomDataset(1, 32)
valid_set = RandomDataset(1, 32)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
max_steps=100,
val_check_interval=10,
num_sanity_val_steps=0,
progress_bar_refresh_rate=0,
)
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader)
if __name__ == "__main__":
run() |
@carmocca note there is also the |
I think so. These are all the options
Also similar to #12000 |
@carmocca is it possible to have one option prefered over another? if These are few things we might have to change. Should we implement it
|
Hey @MirMustafaAli, I believe the simplest to implement to resolve this issue is as shared by @carmocca.
|
@tchaton |
Yes, we should describe all the possible options and the flags to set to achieve each of them |
Shall i start with this options as a start? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
So if I understand correctly, right now there is no good way to handle training with max_steps & val_check_interval where max_steps > len(dataloader)? Ideally we want to always trigger validation after the final step. So for instance, if I train 30,000 steps, validate every 3000 steps, and the length of the dataloader is 20,000, validation will currently happen after the following steps:
Right now for this workflow, I manually run validation after trainer.fit finishes and save an additional checkpoint after step 29999. In terms of potential workarounds, I could have my training dataset lie about its length and report some rediculously high number, then take the modulo of the requested index with the true size of the dataset in order to make it valid. Then all steps would fit inside a single "epoch" and things should just work. On the other hand, even if the dataset reports a multiple of its original length, this still has an impact on sampling as you no longer train on every single sample once before training on them again. This could be addressed by using a custom sampler which is aware of the original dataset length: class PsuedoEpochSampler(object):
def __init__(self, l: int, n: int):
"""
Parameters
----------
l
Length of the dataset (multiple of n and the original length of the dataset)
n
Dataset reports its original length multiplied by this number
"""
assert l % n == 0
self._l = l // n
self._n = n
self._g = None
def _generator(self):
idx = np.arange(self._l)
for ni in range(0, self._n):
np.random.shuffle(idx)
for idxi in idx:
yield int(idxi)
def __iter__(self):
self._g = self._generator()
return self
def __next__(self):
return next(self._g)
def __len__(self):
return self._l*self._n
class PsuedoEpochSamplerDistributed(object):
def __init__(self, l: int, n: int, s: int = 0):
"""
Parameters
----------
l
Length of the dataset (multiple of n and the original length of the dataset)
n
Dataset reports its original length multiplied by this number
"""
assert l % n == 0
self._l = l // n
self._n = n
self._g = None
idx = np.arange(self._l)
# Initial shuffle for splitting dataset inbetween ranks
state_o = np.random.get_state()
np.random.seed(s)
np.random.shuffle(idx)
np.random.set_state(state_o)
self._idx = np.array_split(idx, dist.get_world_size())
self._rank_len = min([len(_) for _ in self._idx])
for i in range(0, len(self._idx)):
self._idx[i] = self._idx[i][:self._rank_len]
def _generator(self):
idx = self._idx[dist.get_rank()].copy()
for ni in range(0, self._n):
np.random.shuffle(idx)
for idxi in idx:
yield int(idxi)
def __iter__(self):
self._g = self._generator()
return self
def __next__(self):
return next(self._g)
def __len__(self):
return self._n*self._rank_len |
What's the status on this? Like @csvance pointed out, whenever a new epoch starts it throws of doing validation by update step count. |
It's labeled as "help wanted", meaning, waiting for somebody to start working on it. |
Is it fine if I start working on it then? |
Yes. You can take one of the not possible/implemented options I described here: |
Okay, I think I will work on |
val_check_interval
support
@carmocca Any news about this issue? This feature is really useful for scenarios in which the training set is small and validation is expensive (e.g. few-shot for generation tasks). |
🐛 Bug
When I train the model by specifying number of training step instead of epoch,
val_check_interval
behaves strangely. Please see the following colab:https://colab.research.google.com/drive/1I0ySRH03T9LdXHoGwCp3Q242dHEHWP_0?usp=sharing
In the code, I log the
global_step
on each validation.I set Trainer's
max_steps
to100
andval_check_interval
to10
.But when I run the cells In[4] and In[5], the outputs are different.
The only different between In[4] and In[5] is the number of samples of the dataset which should not be the reason.
In[4]:
Out[4]:
In[5]:
Out[5]:
Expected behavior
Since I specify
max_steps
and setval_check_interval
to a integer, I expect that result is the same as Out[4] no matter the number of samples in the dataset. The doc says thatval_check_interval
specifies the number of training step between validations, so Out[5] should be the same as Out[4].I also expect that the number of times validation performed should be the same. BTW, the x-axis in the tensorboard are also wrong. You can see that in Out[9].
Environment
cc @Borda @tchaton
The text was updated successfully, but these errors were encountered: