Behavior of the trainer.fit function #4821

sebaleme · 2020-11-23T17:43:52Z

I am using the Lightning environment for training my NN model, and I am getting issues with the outputs. I decided to check the train dataset to ensure that it is not corrupted during preparing the data. However, according to where I print the inputs, I get different results. If I print the data directly after the pytorch Dataloader, I get what I expect.

Code

def fit(self, f_train_data, f_val_data):
        """ Calls the fit function of the model """

        train_dataloader = DataLoader(f_train_data.get_torch_tensors(), batch_size=self._batch_size, num_workers=4)
        val_dataloader = DataLoader(f_val_data.get_torch_tensors(), batch_size=self._batch_size, num_workers=4)

        for i, (input_x, target) in enumerate(train_dataloader):
            if i < 10:
                print(input_x)

        # fit() automatically store a breakpoint of the weights after each epoc.
        history = self.trainer.fit(self._lightening_model, train_dataloader, val_dataloader)

With this code, I get the 10 first train dataset, as they were provided in the files.
But if I print the data when it is fed in my model, I get something different:

Code

class ResContextBlock(nn.Module):
    def __init__(self, in_filters, out_filters):
        super(ResContextBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_filters, out_filters, kernel_size=(1, 1), stride=1)
        self.act1 = nn.LeakyReLU()

    def forward(self, x):
        # print(x.shape)
        shortcut = self.conv1(x)
        print(x)
        shortcut = self.act1(shortcut)

In this case, the first iteration is ok. The first batch of the training dataset is displayed properly, but the next ones are modified. It is not the values from the bag I am using. My guess is, that somewhere between the trainer.fit call and the model forward() method, the data are modified. I checked the trainer.fit() function from lightning, but couldn t find where it would be the case.

If I describe the data variables, train_dataloader is a list labels tensors and sample tensors. x is directly a tensor of sample tensor. So the starting point of the lightning code analysis would be where the sample batches are extracted, in order to be fed into the model:

    def training_step(self, batch, batch_idx):
        """ Callback for training step in pytorch lightening"""
        x, y = batch
        print(x.shape)
        print(x)
        y_hat = self.model(x)
        loss = self._loss_func(y_hat, y.long())
        acc = self.train_acc(y_hat, y)

        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=False)
        self.log("train_acc", acc, on_step=False, on_epoch=True, prog_bar=False, logger=False)
        return loss

When I do a print here, I get also something correct. So the x extracted here from batch is OK. The question would be, what happened to it at the line "y_hat = self.model(x)"

Validation sanity check: 0it [00:00, ?it/s]
Epoch 0: 0%| | 0/220 [00:00<?, ?it/s]
torch.Size([1, 1, 160, 1000])
tensor([[[[21.6000, 21.6000, 21.7500, ..., 13.5000, 13.3500, 13.6500],
[21.4500, 21.4500, 21.4500, ..., 13.3500, 13.8000, 13.8000],
[21.6000, 21.6000, 21.6000, ..., 13.8000, 13.8000, 13.6500],
...,
[ 9.7500, 9.6000, 9.6000, ..., 3.6000, 3.7500, 3.7500],
[ 9.4500, 9.4500, 9.4500, ..., 3.7500, 3.7500, 3.7500],
[ 9.4500, 9.3000, 9.4500, ..., 3.6000, 3.7500, 3.6000]]]],
device='cuda:0')

I could provide more insights on my code if needed, but I don t want to overload my question with too many details.

What's your environment?

OS: Ubuntu 18.04
PyTorch Lightning 1.04

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-23T17:44:45Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2020-11-27T11:48:25Z

Hey @sebaleme,

Would you mind reproducing this behaviour using bug_report_model.py script.

Best regards,
T.C

stale · 2020-12-27T12:14:46Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

sebaleme added the question Further information is requested label Nov 23, 2020

tchaton added bug Something isn't working waiting on author Waiting on user action, correction, or update labels Nov 27, 2020

stale bot added the won't fix This will not be worked on label Dec 27, 2020

stale bot closed this as completed Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of the trainer.fit function #4821

Behavior of the trainer.fit function #4821

sebaleme commented Nov 23, 2020 •

edited

Loading

github-actions bot commented Nov 23, 2020

tchaton commented Nov 27, 2020

stale bot commented Dec 27, 2020

Behavior of the trainer.fit function #4821

Behavior of the trainer.fit function #4821

Comments

sebaleme commented Nov 23, 2020 • edited Loading

Code

Code

What's your environment?

github-actions bot commented Nov 23, 2020

tchaton commented Nov 27, 2020

stale bot commented Dec 27, 2020

sebaleme commented Nov 23, 2020 •

edited

Loading