Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of the trainer.fit function #4821

Closed
sebaleme opened this issue Nov 23, 2020 · 3 comments
Closed

Behavior of the trainer.fit function #4821

sebaleme opened this issue Nov 23, 2020 · 3 comments
Labels
bug Something isn't working question Further information is requested waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@sebaleme
Copy link

sebaleme commented Nov 23, 2020

I am using the Lightning environment for training my NN model, and I am getting issues with the outputs. I decided to check the train dataset to ensure that it is not corrupted during preparing the data. However, according to where I print the inputs, I get different results. If I print the data directly after the pytorch Dataloader, I get what I expect.

Code

def fit(self, f_train_data, f_val_data):
        """ Calls the fit function of the model """

        train_dataloader = DataLoader(f_train_data.get_torch_tensors(), batch_size=self._batch_size, num_workers=4)
        val_dataloader = DataLoader(f_val_data.get_torch_tensors(), batch_size=self._batch_size, num_workers=4)

        for i, (input_x, target) in enumerate(train_dataloader):
            if i < 10:
                print(input_x)

        # fit() automatically store a breakpoint of the weights after each epoc.
        history = self.trainer.fit(self._lightening_model, train_dataloader, val_dataloader) 

With this code, I get the 10 first train dataset, as they were provided in the files.
But if I print the data when it is fed in my model, I get something different:

Code

class ResContextBlock(nn.Module):
    def __init__(self, in_filters, out_filters):
        super(ResContextBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_filters, out_filters, kernel_size=(1, 1), stride=1)
        self.act1 = nn.LeakyReLU()

    def forward(self, x):
        # print(x.shape)
        shortcut = self.conv1(x)
        print(x)
        shortcut = self.act1(shortcut)

In this case, the first iteration is ok. The first batch of the training dataset is displayed properly, but the next ones are modified. It is not the values from the bag I am using. My guess is, that somewhere between the trainer.fit call and the model forward() method, the data are modified. I checked the trainer.fit() function from lightning, but couldn t find where it would be the case.

If I describe the data variables, train_dataloader is a list labels tensors and sample tensors. x is directly a tensor of sample tensor. So the starting point of the lightning code analysis would be where the sample batches are extracted, in order to be fed into the model:

    def training_step(self, batch, batch_idx):
        """ Callback for training step in pytorch lightening"""
        x, y = batch
        print(x.shape)
        print(x)
        y_hat = self.model(x)
        loss = self._loss_func(y_hat, y.long())
        acc = self.train_acc(y_hat, y)

        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=False)
        self.log("train_acc", acc, on_step=False, on_epoch=True, prog_bar=False, logger=False)
        return loss

When I do a print here, I get also something correct. So the x extracted here from batch is OK. The question would be, what happened to it at the line "y_hat = self.model(x)"

Validation sanity check: 0it [00:00, ?it/s]
Epoch 0: 0%| | 0/220 [00:00<?, ?it/s]
torch.Size([1, 1, 160, 1000])
tensor([[[[21.6000, 21.6000, 21.7500, ..., 13.5000, 13.3500, 13.6500],
[21.4500, 21.4500, 21.4500, ..., 13.3500, 13.8000, 13.8000],
[21.6000, 21.6000, 21.6000, ..., 13.8000, 13.8000, 13.6500],
...,
[ 9.7500, 9.6000, 9.6000, ..., 3.6000, 3.7500, 3.7500],
[ 9.4500, 9.4500, 9.4500, ..., 3.7500, 3.7500, 3.7500],
[ 9.4500, 9.3000, 9.4500, ..., 3.6000, 3.7500, 3.6000]]]],
device='cuda:0')

I could provide more insights on my code if needed, but I don t want to overload my question with too many details.

What's your environment?

  • OS: Ubuntu 18.04
  • PyTorch Lightning 1.04
@sebaleme sebaleme added the question Further information is requested label Nov 23, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Contributor

tchaton commented Nov 27, 2020

Hey @sebaleme,

Would you mind reproducing this behaviour using bug_report_model.py script.

Best regards,
T.C

@tchaton tchaton added bug Something isn't working waiting on author Waiting on user action, correction, or update labels Nov 27, 2020
@stale
Copy link

stale bot commented Dec 27, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 27, 2020
@stale stale bot closed this as completed Jan 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants