Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335

Closed
junwen-austin opened this issue Sep 2, 2020 · 16 comments
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on waiting on author Waiting on user action, correction, or update

Comments

@junwen-austin
Copy link

🐛 Bug

I noticed this when I was adding more metrics calculation to the LightningModule, for example, adding the confusion matrix at the end of validation/test epoch. Before and after I added these functions (which do not appear to be dependent on any random seed), I noticed the training results are not the exactly the same.

However, once I added these function and re-ran again, yes I got the same training results.

To Reproduce

Code sample

Expected behavior

The training results should be identical even if some deterministic functions are added

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0): 1.4
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: 4 GPUs DDP
  • Any other relevant information:

Additional context

@junwen-austin junwen-austin added bug Something isn't working help wanted Open to be worked on labels Sep 2, 2020
@awaelchli
Copy link
Contributor

Thanks for the report. We're also seeing something similar in our CI, unfortunately we don't have a lead yet.

I did not exactly understand how you came to the conclusion that it is related to metrics. Did you try several runs with and without having metric? I don't think it is metrics, but rather just a ddp issue.

@awaelchli awaelchli added the distributed Generic distributed-related topic label Sep 4, 2020
@junwen-austin
Copy link
Author

@awaelchli Thanks for your follow-up. No, I do not think this issue has anything to do with the metrics that I was adding but more likely to DDP.

@awaelchli
Copy link
Contributor

awaelchli commented Sep 20, 2020

@junwen-austin I may have fixed this problem with the linked PR. But to be sure it closes this issue, could you let me know the trainer flags you used? Did you use distributed_backend="ddp_spawn" (is also the default) together with
pytorch_lightning.seed_everything?

@junwen-austin
Copy link
Author

@awaelchli Thanks for letting me know the progress. I used the ddp as the backend, not the ddp_spawn. Thanks

@awaelchli
Copy link
Contributor

hmm, that's unfortunate because my fix only applies to ddp_spawn.
for ddp, I ran several scripts (including my research) and always got deterministic behavior, so I am not sure what what the cause is in your case :(

@junwen-austin
Copy link
Author

@awaelchli it is rather a tricky one. I have two versions of the Lightning models v1 and v2. The only difference between them is that I added additional metric (confusion matrix in this case) in v2, and I noticed the training/validation/test results are slightly off, with both case having ddp as backend, same seed for seed_everything and deterministic flag = True.

Since the additional code about the confusion matrix has nothing to do with randomization, I expect I should get the exactly the same results.

S

@awaelchli
Copy link
Contributor

awaelchli commented Sep 20, 2020

but the confusion matrix is not used for training, right? It is just there for visualization?
So, just to clarify to be 100% sure we're talking about the same

  • you get the same training loss if you rerun v1 multiple times
  • you get the same training loss if you rerun v2 multiple times
  • the training loss between v1 and v2 are however different

sorry if I misunderstood

@junwen-austin
Copy link
Author

@awaelchli

  • Confusion matrix is used at the end of validation/test loop but not during the training loop

  • you get the same training loss if you rerun v1 multiple times: Yes, also same validation/test loss

  • you get the same training loss if you rerun v2 multiple times: Yes, also same validation/test loss

  • the training loss between v1 and v2 are however different: Yes, also different validation/test loss

Sorry about any miscommunication on my part. Thanks

@edenlightning
Copy link
Contributor

@awaelchli able to repro? any clue?

@awaelchli
Copy link
Contributor

awaelchli commented Sep 22, 2020

Unfortunately not. I tried to reproduce by adding the confusion matrix to the validation epoch end as described, but this gives me same val and train loss as it is expected with setting the seed. @junwen-austin I think I need an example code from you or I don't know where to look for the problem.

@junwen-austin
Copy link
Author

@awaelchli Thanks for looking into it. Sure, the pseudo code looks like:

In v1, there is no confusion matrix at all or any calculation related to it.

in v2, I added the following in the __init__ function:
self.confusion_matrix = ConfusionMatrix()

then at the end of validation/test epoch, I calculated confusion matrix based on sync'ed y and y_pred and then log each component of confusion matrix. Note: in both v1 and v2, y and y_pred are sync'ed before calculating any metric

           def validation_epoch_end(self, result):
               y = result['y']
               y_pred = result['y_pred']
               
               if use_ddp:
                   sync y
                   sync y_pred
            
              # calculate other metrics
                   ...
             
              # start the code only in v2:
              cm = self.confusion_matrix(y_pred, y)
              
             # log each component of cm:
             self.logger.add_scalar('TP', cm[1,1], global_step=self.global_step)
             self.logger.add_scalar('FP', cm[0,1], global_step=self.global_step)
             self.logger.add_scalar('FN', cm[1,0], global_step=self.global_step)
             self.logger.add_scalar('TN', cm[0,0], global_step=self.global_step)
             
             # end the code only in v2



@awaelchli
Copy link
Contributor

I had it almost like this, but without sync ddp. Can you share the whole runnable script? At this point I can only make guesses.

@junwen-austin
Copy link
Author

junwen-austin commented Sep 23, 2020

@awaelchli
Unfortunately I cannot share the entire script but I will provide more information in hoping of getting what you need.

I have the following helper function with the Lightning Module:

def calculate(self, batch):
    data, y = batch
    logit = self.model(data).     # y_hat is probability
    loss = torch.nn.functional.CrossEntropy(logit, y)
    y_hat = torch.nn.Softmax(dim=1)(logit)     
    _, y_pred = torch.max(y_hat, dim=1)    # y_pred is the predicted label
   
   return loss, y, y_hat, y_pred

def compute_metrics(self, y, y_hat, y_pred):
      # compute metrics given y, y_hat, y_pred
      acc = self.accuracy(y_pred, y)/self.trainer.world_size
      f1 = self.f1(y_pred,y) / self.trainer.world_size
      cm = self.confusion_matrix(y_pred, y)    # added in v2
     
      return acc, f1, cm
 
@staticmethod
def log_metrics(result, kind, scalars=('loss','accuracy','f1'):
        for scalar in scalars:
                  result.log(scalar + '/' + kind, scalar, on_step=False, on_epoch=True, sync_dist=True, sync_dist_op='mean')


def train_valid_test_step(self, batch, kind):
          loss, y, y_hat, y_pred = self.calculate(batch)
          if kind == 'train':
                result = pl.TrainResult(loss)
          else:
                result = pl.EvalResult(checkpoint_on=loss)
                result.val_loss = loss

          result.y =y
          result.y_hat = y_hat
          result.y_pred = y_pred
 
def sync_across_gpus(self, t):   # t is a tensor
       
        gather_t_tensor = [torch.ones_like(t) for _ in range(self.trainer.world_size)]
        torch.distributed.all_gather(gather_t_tensor, t)
        return torch.cat(gather_t_tensor)          

def valid_test_epoch_end(self, result, kind):
    loss = result.val_loss
    y = result.y
    y_hat = result.y_hat
    y_pred = result.y_pred

   if self.trainer.use_ddp:
          loss = self.sync_across_gpus(loss)
          y = self.sync_across_gpus(y)
          y_pred = self.sync_across_gpus(y_pred)
          y_hat = self.sync_across_gpus(y_hat)
         
   acc, f1, cm = self.compute_metrics(y, y_hat, y_pred)
   result.loss = loss.mean()
   result.accurancy = acc
   result.f1 = f1
   result.cm = cm # added in v2
   
   log_metrics(result, kind, scalars=('loss','accuracy','f1')

   # added in v2
   self.logger.add_scalar('TP', cm[1,1], global_step=self.global_step)
   self.logger.add_scalar('FP', cm[0,1], global_step=self.global_step)
   self.logger.add_scalar('FN', cm[1,0], global_step=self.global_step)
   self.logger.add_scalar('TN', cm[0,0], global_step=self.global_step)


def training_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'train')

def train_step_end(self, result):
     loss = result.minimize
     result.log('loss/train', loss, on_step=True, sync_dist=True, sync_dist_op='mean')
     return result


def validation_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'valid')

def validation_epoch_end(self, result):
     
     return self.valid_test_epoch_end(result,'valid')

def test_step(self, batch, batch_idx):
     return self.train_valid_test_step(batch, 'test')

def test_epoch_end(self, result):
     
     return self.valid_test_epoch_end(result,'test')
    

@awaelchli
Copy link
Contributor

still no luck reproducing this. since you cannot share all the code, I had tried to combine what you show here with an existing example code, but no luck.

maybe if you tried yourself to adapt our mnist example with your confusion matrix use case we get a reproducible script. I tried but I failed.

@junwen-austin
Copy link
Author

@awaelchli

Thanks I will do that and let you know.

@edenlightning edenlightning removed this from the 0.9.x milestone Oct 2, 2020
@edenlightning edenlightning added waiting on author Waiting on user action, correction, or update and removed Blocked on ... labels Oct 2, 2020
@junwen-austin
Copy link
Author

@awaelchli My apology I cannot replicate the issue seen from my work project. Let's close the issue for now and if I spot a similar one, I will have better documentation of it instead of trying to getting it from top of my head. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

3 participants