Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

junwen-austin · 2020-09-02T21:51:08Z

🐛 Bug

I noticed this when I was adding more metrics calculation to the LightningModule, for example, adding the confusion matrix at the end of validation/test epoch. Before and after I added these functions (which do not appear to be dependent on any random seed), I noticed the training results are not the exactly the same.

However, once I added these function and re-ran again, yes I got the same training results.

To Reproduce

Code sample

Expected behavior

The training results should be identical even if some deterministic functions are added

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0): 1.4
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7
CUDA/cuDNN version: 10.1
GPU models and configuration: 4 GPUs DDP
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-09-04T04:12:37Z

Thanks for the report. We're also seeing something similar in our CI, unfortunately we don't have a lead yet.

I did not exactly understand how you came to the conclusion that it is related to metrics. Did you try several runs with and without having metric? I don't think it is metrics, but rather just a ddp issue.

junwen-austin · 2020-09-04T16:20:52Z

@awaelchli Thanks for your follow-up. No, I do not think this issue has anything to do with the metrics that I was adding but more likely to DDP.

awaelchli · 2020-09-20T21:37:48Z

@junwen-austin I may have fixed this problem with the linked PR. But to be sure it closes this issue, could you let me know the trainer flags you used? Did you use distributed_backend="ddp_spawn" (is also the default) together with
pytorch_lightning.seed_everything?

junwen-austin · 2020-09-20T21:46:17Z

@awaelchli Thanks for letting me know the progress. I used the ddp as the backend, not the ddp_spawn. Thanks

awaelchli · 2020-09-20T21:54:56Z

hmm, that's unfortunate because my fix only applies to ddp_spawn.
for ddp, I ran several scripts (including my research) and always got deterministic behavior, so I am not sure what what the cause is in your case :(

junwen-austin · 2020-09-20T22:25:38Z

@awaelchli it is rather a tricky one. I have two versions of the Lightning models v1 and v2. The only difference between them is that I added additional metric (confusion matrix in this case) in v2, and I noticed the training/validation/test results are slightly off, with both case having ddp as backend, same seed for seed_everything and deterministic flag = True.

Since the additional code about the confusion matrix has nothing to do with randomization, I expect I should get the exactly the same results.

S

awaelchli · 2020-09-20T23:41:29Z

but the confusion matrix is not used for training, right? It is just there for visualization?
So, just to clarify to be 100% sure we're talking about the same

you get the same training loss if you rerun v1 multiple times
you get the same training loss if you rerun v2 multiple times
the training loss between v1 and v2 are however different

sorry if I misunderstood

junwen-austin · 2020-09-21T00:56:38Z

@awaelchli

Confusion matrix is used at the end of validation/test loop but not during the training loop
you get the same training loss if you rerun v1 multiple times: Yes, also same validation/test loss
you get the same training loss if you rerun v2 multiple times: Yes, also same validation/test loss
the training loss between v1 and v2 are however different: Yes, also different validation/test loss

Sorry about any miscommunication on my part. Thanks

edenlightning · 2020-09-22T20:41:58Z

@awaelchli able to repro? any clue?

awaelchli · 2020-09-22T21:14:16Z

Unfortunately not. I tried to reproduce by adding the confusion matrix to the validation epoch end as described, but this gives me same val and train loss as it is expected with setting the seed. @junwen-austin I think I need an example code from you or I don't know where to look for the problem.

junwen-austin · 2020-09-22T21:38:18Z

@awaelchli Thanks for looking into it. Sure, the pseudo code looks like:

In v1, there is no confusion matrix at all or any calculation related to it.

in v2, I added the following in the __init__ function:
self.confusion_matrix = ConfusionMatrix()

then at the end of validation/test epoch, I calculated confusion matrix based on sync'ed y and y_pred and then log each component of confusion matrix. Note: in both v1 and v2, y and y_pred are sync'ed before calculating any metric

           def validation_epoch_end(self, result):
               y = result['y']
               y_pred = result['y_pred']
               
               if use_ddp:
                   sync y
                   sync y_pred
            
              # calculate other metrics
                   ...
             
              # start the code only in v2:
              cm = self.confusion_matrix(y_pred, y)
              
             # log each component of cm:
             self.logger.add_scalar('TP', cm[1,1], global_step=self.global_step)
             self.logger.add_scalar('FP', cm[0,1], global_step=self.global_step)
             self.logger.add_scalar('FN', cm[1,0], global_step=self.global_step)
             self.logger.add_scalar('TN', cm[0,0], global_step=self.global_step)
             
             # end the code only in v2

awaelchli · 2020-09-23T09:11:35Z

I had it almost like this, but without sync ddp. Can you share the whole runnable script? At this point I can only make guesses.

junwen-austin · 2020-09-23T14:13:21Z

@awaelchli
Unfortunately I cannot share the entire script but I will provide more information in hoping of getting what you need.

I have the following helper function with the Lightning Module:

def calculate(self, batch):
    data, y = batch
    logit = self.model(data).     # y_hat is probability
    loss = torch.nn.functional.CrossEntropy(logit, y)
    y_hat = torch.nn.Softmax(dim=1)(logit)     
    _, y_pred = torch.max(y_hat, dim=1)    # y_pred is the predicted label
   
   return loss, y, y_hat, y_pred

def compute_metrics(self, y, y_hat, y_pred):
      # compute metrics given y, y_hat, y_pred
      acc = self.accuracy(y_pred, y)/self.trainer.world_size
      f1 = self.f1(y_pred,y) / self.trainer.world_size
      cm = self.confusion_matrix(y_pred, y)    # added in v2
     
      return acc, f1, cm
 
@staticmethod
def log_metrics(result, kind, scalars=('loss','accuracy','f1'):
        for scalar in scalars:
                  result.log(scalar + '/' + kind, scalar, on_step=False, on_epoch=True, sync_dist=True, sync_dist_op='mean')


def train_valid_test_step(self, batch, kind):
          loss, y, y_hat, y_pred = self.calculate(batch)
          if kind == 'train':
                result = pl.TrainResult(loss)
          else:
                result = pl.EvalResult(checkpoint_on=loss)
                result.val_loss = loss

          result.y =y
          result.y_hat = y_hat
          result.y_pred = y_pred
 
def sync_across_gpus(self, t):   # t is a tensor
       
        gather_t_tensor = [torch.ones_like(t) for _ in range(self.trainer.world_size)]
        torch.distributed.all_gather(gather_t_tensor, t)
        return torch.cat(gather_t_tensor)          

def valid_test_epoch_end(self, result, kind):
    loss = result.val_loss
    y = result.y
    y_hat = result.y_hat
    y_pred = result.y_pred

   if self.trainer.use_ddp:
          loss = self.sync_across_gpus(loss)
          y = self.sync_across_gpus(y)
          y_pred = self.sync_across_gpus(y_pred)
          y_hat = self.sync_across_gpus(y_hat)
         
   acc, f1, cm = self.compute_metrics(y, y_hat, y_pred)
   result.loss = loss.mean()
   result.accurancy = acc
   result.f1 = f1
   result.cm = cm # added in v2
   
   log_metrics(result, kind, scalars=('loss','accuracy','f1')

   # added in v2
   self.logger.add_scalar('TP', cm[1,1], global_step=self.global_step)
   self.logger.add_scalar('FP', cm[0,1], global_step=self.global_step)
   self.logger.add_scalar('FN', cm[1,0], global_step=self.global_step)
   self.logger.add_scalar('TN', cm[0,0], global_step=self.global_step)


def training_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'train')

def train_step_end(self, result):
     loss = result.minimize
     result.log('loss/train', loss, on_step=True, sync_dist=True, sync_dist_op='mean')
     return result


def validation_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'valid')

def validation_epoch_end(self, result):
     
     return self.valid_test_epoch_end(result,'valid')

def test_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'test')

def test_epoch_end(self, result):
     
     return self.valid_test_epoch_end(result,'test')

awaelchli · 2020-09-28T05:42:10Z

still no luck reproducing this. since you cannot share all the code, I had tried to combine what you show here with an existing example code, but no luck.

maybe if you tried yourself to adapt our mnist example with your confusion matrix use case we get a reproducible script. I tried but I failed.

junwen-austin · 2020-09-28T16:58:43Z

@awaelchli

Thanks I will do that and let you know.

junwen-austin · 2020-10-05T17:52:47Z

@awaelchli My apology I cannot replicate the issue seen from my work project. Let's close the issue for now and if I spot a similar one, I will have better documentation of it instead of trying to getting it from top of my head. Thanks.

junwen-austin added bug Something isn't working help wanted Open to be worked on labels Sep 2, 2020

awaelchli added the distributed Generic distributed-related topic label Sep 4, 2020

edenlightning added this to the 0.9.x milestone Sep 8, 2020

edenlightning added the Blocked on internal refactors label Sep 10, 2020

awaelchli mentioned this issue Sep 20, 2020

Fix deterministic behavior in ddp_spawn #3573

Merged

7 tasks

edenlightning assigned awaelchli Sep 22, 2020

edenlightning removed this from the 0.9.x milestone Oct 2, 2020

edenlightning added waiting on author Waiting on user action, correction, or update and removed Blocked on ... labels Oct 2, 2020

junwen-austin closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

junwen-austin commented Sep 2, 2020

awaelchli commented Sep 4, 2020

junwen-austin commented Sep 4, 2020

awaelchli commented Sep 20, 2020 •

edited

Loading

junwen-austin commented Sep 20, 2020

awaelchli commented Sep 20, 2020

junwen-austin commented Sep 20, 2020

awaelchli commented Sep 20, 2020 •

edited

Loading

junwen-austin commented Sep 21, 2020

edenlightning commented Sep 22, 2020

awaelchli commented Sep 22, 2020 •

edited

Loading

junwen-austin commented Sep 22, 2020

awaelchli commented Sep 23, 2020

junwen-austin commented Sep 23, 2020 •

edited

Loading

awaelchli commented Sep 28, 2020

junwen-austin commented Sep 28, 2020

junwen-austin commented Oct 5, 2020

Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

Comments

junwen-austin commented Sep 2, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

awaelchli commented Sep 4, 2020

junwen-austin commented Sep 4, 2020

awaelchli commented Sep 20, 2020 • edited Loading

junwen-austin commented Sep 20, 2020

awaelchli commented Sep 20, 2020

junwen-austin commented Sep 20, 2020

awaelchli commented Sep 20, 2020 • edited Loading

junwen-austin commented Sep 21, 2020

edenlightning commented Sep 22, 2020

awaelchli commented Sep 22, 2020 • edited Loading

junwen-austin commented Sep 22, 2020

awaelchli commented Sep 23, 2020

junwen-austin commented Sep 23, 2020 • edited Loading

awaelchli commented Sep 28, 2020

junwen-austin commented Sep 28, 2020

junwen-austin commented Oct 5, 2020

awaelchli commented Sep 20, 2020 •

edited

Loading

awaelchli commented Sep 20, 2020 •

edited

Loading

awaelchli commented Sep 22, 2020 •

edited

Loading

junwen-austin commented Sep 23, 2020 •

edited

Loading